9

I know that TeX can't write the content of hbox in an auxiliary file (Re-parse the content of a box register).

That mean that

\newwrite\foo
\immediate\openout\foo=\jobname.txt
\setbox0=\hbox{bar}
\immediate\write\foo{\box0}

can't write However, can LuaTeX do it? I have found

\directlua{
 n = tex.getbox(0)
}

But I don't understand what n is representing and if I could use it to write the box content in a file.

Maïeul
  • 10,984
  • If you indent your code by four spaces or click the {} button with more than one line selected, the code will display correctly. – Manuel Feb 15 '15 at 21:06
  • sorry, too accustomed to the github syntaxis. – Maïeul Feb 15 '15 at 21:09
  • 2
    What are you hoping to get out here? If you show the content of a box using classical TeX you'll see that it's a series of typesetting instructions not anything necessarily related to 'text'. I wonder what the aim is here. – Joseph Wright Feb 15 '15 at 21:17
  • eledmac number line. To do it, it split vboxs to little vbox (one by \baselineskip) and then add the line number. I would like to output one specific line to auxiliary file. – Maïeul Feb 15 '15 at 21:24
  • A likely better approach is to catch the macros/text before they are consumed in a box. – Heiko Oberdiek Mar 29 '18 at 17:03
  • No, I need precisely to get it as there are split in the box. – Maïeul Mar 31 '18 at 09:35

1 Answers1

9

Edit: here is a new code that works for ligatures:

\documentclass{article}
\usepackage{fontspec}
\begin{document}
\setbox0=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy, diffierence, difference}
\directlua{
    % local fontstyles = require "l4fontstyles"
  local char = unicode.utf8.char
  local glyph_id = node.id("glyph")
  local glue_id  = node.id("glue")
  local hlist_id = node.id("hlist")
  local vlist_id = node.id("vlist")
  local disc_id  = node.id("disc")
  local minglue  = tex.sp("0.2em")
  local usedcharacters = {}
  local identifiers = fonts.hashes.identifiers
  local function get_unicode(xchar,font_id)
    local current = {}
    local uchar = identifiers[font_id].characters[xchar].tounicode
    for i= 1, string.len(uchar), 4 do
      local cchar = string.sub(uchar, i, i + 3)
      print(xchar,uchar,cchar, font_id, i)
      table.insert(current,char(tonumber(cchar,16)))
    end
    return current
  end
  local function nodeText(n)
    local t =  {}
    for x in node.traverse(n) do
      % glyph node
      if x.id == glyph_id then
        % local currentchar = fonts.hashes.identifiers[x.font].characters[x.char].tounicode
        local chars = get_unicode(x.char,x.font)
        for _, current_char in ipairs(chars) do
          table.insert(t,current_char)
        end
      % glue node
      elseif x.id == glue_id and  node.getglue(x) > minglue then
        table.insert(t," ")
      % discretionaries
      elseif x.id == disc_id then
        table.insert(t, nodeText(x.replace))
      % recursivelly process hlist and vlist nodes
      elseif x.id == hlist_id or x.id == vlist_id then
        table.insert(t,nodeText(x.head))
      end
    end
    return table.concat(t)
  end
  local n = tex.getbox(0)
  print(nodeText(n.head))
  local f = io.open("hello.txt","w")
  f:write(nodeText(n.head))
  f:close()
}

\box0 \end{document}

Result in hello.txt:

Příliš žluťoučký kůň úpěl ďábelské ódy, diffierence, difference

Original answer:

Variablen in your example is a node list. Various types of nodes exists, such as glyphs for characters, glue for spacing, or hlist which is the type you get for your \hbox. hlist contains child nodes, which are accessible in n.head attribute. You can then loop this child list for glyphs and glues.

Each node type is distinguishable by value of n.id attribute. Particular node types and possible attributes are described in chapter "8 Nodes". In this particular example, we need to process just glyph and glue nodes, but you should keep in mind that node lists are recursive and various nodes can contain child lists, like hlist, vlist, etc. You can support them with recursive call of nodeText on current node head attribute.

Regarding glyph nodes, char attribute contains unicode value only in the case if you use opentype or truetype fonts, if you use old 8-bit fonts, it contains just 8-bit value which actual encoding depends on used font encoding and it isn't easy to convert it to unicode.

\documentclass{article}
\usepackage{fontspec}
\begin{document}
\setbox0=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy}
\directlua{
    local fontstyles = require "l4fontstyles"
  local char = unicode.utf8.char
  local glyph_id = node.id("glyph")
  local glue_id  = node.id("glue")
  local hlist_id = node.id("hlist")
  local vlist_id = node.id("vlist")
  local minglue = tex.sp("0.2em")
  local usedcharacters = {}
  local identifiers = fonts.hashes.identifiers
  local function get_unicode(xchar,font_id)
     return char(tonumber(identifiers[font_id].characters[xchar].tounicode,16))
  end
  local function nodeText(n)
    local t =  {}
    for x in node.traverse(n) do
      % glyph node
      if x.id == glyph_id then
        % local currentchar = fonts.hashes.identifiers[x.font].characters[x.char].tounicode
        table.insert(t,get_unicode(x.char,x.font))
                local y = fontstyles.get_fontinfo(x.font)
                print(x.char,y.name,y.weight,y.style) 
      % glue node
      elseif x.id == glue_id and  node.getglue(x) > minglue then
    table.insert(t," ")
        elseif x.id == hlist_id or x.id == vlist_id then
            table.insert(t,nodeText(x.head))
  end
end
return table.concat(t)

end local n = tex.getbox(0) print(nodeText(n.head)) local f = io.open("hello.txt","w") f:write(nodeText(n.head)) f:close() }

\box0 \end{document}

nodeText function returns text contained in the node list. It is used to print \hbox contents to the terminal and to write to file hello.txt in this example.

For basic info about font style, you can try to use l4fontstyles module, like this:

local fontstyles = require "l4fontstyles"
...
if x.id == glyph_id then                                                        
        table.insert(t,char(x.char))
        local y = fontstyles.get_fontinfo(x.font)
        print(y.name,y.weight,y.style)
michal.h21
  • 50,697
  • thank a lot for the answer and explanation ! it's what I need ! – Maïeul Feb 15 '15 at 22:14
  • just another question. In LuaTeX handbook, I don't find anything about the possible value of note.id. I unerstand 37 = non blank charater, 10 = blank charater, but what else is possible? is it possible to get some information about formating, like weight or emphase? – Maïeul Feb 16 '15 at 09:37
  • @Maïeul see updated answer – michal.h21 Feb 16 '15 at 10:22
  • Thanks. I didn't understand that the id. value was the number in parenthesis in the begining of chapter 8. – Maïeul Feb 16 '15 at 10:45
  • it seems some problem with combineed unicode charecter. In input ךְ ( U+05DA U+05B0) becomes in output (but not .pdf) ``, (U+DB80 U+DC07). Does there any library to solve this problem? – Maïeul Feb 17 '15 at 15:19
  • @Maïeul it is written to the fille hello.txt as U+05DA U+05B0. LuaLaTeX see the combined characters as two characters and it doesn't seem that it does any conversions – michal.h21 Feb 17 '15 at 15:54
  • @indeed. So it must be any other personal code which is problematic. – Maïeul Feb 17 '15 at 16:02
  • ok, it seems that \newfontfamily\hebrewfont[Script=Hebrew]{Ezra SIL} create the problem but not \newfontfamily\hebrewfont[]{Ezra SIL} – Maïeul Feb 17 '15 at 16:06
  • @Maïeul so it is probably remapped in the font. I don't know how to solve that, maybe using open type feature file? – michal.h21 Feb 17 '15 at 16:36
  • something mapped in one sens could be mapped again in the other sens. Of course. But, for my specific problematic, it doesnt matter. The .pdf has no importance, and so I could forget the arg. – Maïeul Feb 17 '15 at 16:39
  • Will this code work for all cases? Or is it just a partially functioning proof of concept? Thanks – codepoet Sep 20 '20 at 08:25
  • 1
    @reportaman honestly, I don't know. it is five years old and there were lot of changes in LuaTeX since then, so I wouldn't be surprised if it doesn't work at all anymore. – michal.h21 Sep 20 '20 at 08:30
  • Like I wonder if there is a way to determine hyphen added by Tex instead of it being really part of the word. What makes that complex is that some words with real hyphen I guess can be hyphenated. – codepoet Sep 20 '20 at 08:36
  • It depends on when you call this function. When you call it before line breaking, you should be fine. Otherwise, hyphen should be the last glyph in hlist with character equal to tex.hyphenchar, I think. – michal.h21 Sep 20 '20 at 10:48
  • @michal.h21 Thanks, I just tried to compile, and it fails on local fontstyles = require "l4fontstyles" with error lualibs-basic-merged.lua:378: attempt to call a nil value (fie ld 'cpath specification') – codepoet Sep 21 '20 at 00:13
  • Also, this code skips the characters of ligatures, probably because they are 'disc' nodes. For instance an english word like difference gets reported as dierence, as if ff never existed. – codepoet Sep 21 '20 at 00:27
  • And this: Try adding the word diffierence (instead of difference), a word with ffi ligature breaks the compile completely. Error message: bad argument #1 to 'char' (invalid value), while the word difference with ff ligature doesn't break compile (just doesn't print ff in the output). Otherwise works fine! The recursion through nested boxes works seamlessly :) – codepoet Sep 21 '20 at 00:41
  • @reportaman l4fontstyles seems to be used only for debugging message, so you can remove it. and also remove local y = fontstyles.get_fontinfo(x.font). – michal.h21 Sep 21 '20 at 07:12
  • @reportaman it seems that for ligatures it return several Unicode codepoints in one string, but we convert the whole string to number. this results in invalid value. – michal.h21 Sep 21 '20 at 07:13
  • 1
    @reportaman I've updated my answer with a new code that works for ligatures – michal.h21 Sep 21 '20 at 12:06
  • I used another approach to get similar result, for which I will post a related question, and link this to it. Though instead of get_unicode function, I just iterate through the complex composed glyphs, and discs to get the character. – codepoet Sep 21 '20 at 22:27
  • @reportaman the main purpose of get_unicode is to correctly convert glyph.char nodes back to Unicode. glyph.char contains a glyph index after Harfbuzz shaping, so it is necessary to convert it back to Unicode. – michal.h21 Sep 22 '20 at 06:17
  • @micah Can’t you iterate through ligature/shaped glyph nodes to get its Unicode components? The decomposed glyph data is stored under the glyph node afaik. That’s how I am able to get the components of shaped glyphs back. – codepoet Sep 22 '20 at 07:22
  • @reportaman it loops over ligature glyph components, see elseif x.id == disc_id then. But you still have to process glyphs that have no Unicode representation, like some ligatures or process of shaping of Arabic and other complex scripts. – michal.h21 Sep 22 '20 at 07:59
  • Thanks, I have posted my code (that seems to produce same result without get_unicode function), and asked a question here: https://tex.stackexchange.com/questions/563731/luatex-utf-8-character-extraction-is-one-way-more-accurate-and-or-preferable-t – codepoet Sep 22 '20 at 09:50
  • Both of our techniques fail when using HarfBuzz renderer, I have posted complete code with error signature here: https://tex.stackexchange.com/questions/563794/devanagari-utf-8-text-extraction-fails-in-harfbuzz-mode-passes-in-node-mode – codepoet Sep 22 '20 at 19:52