LuaTeX: UTF-8 character extraction, is one way more accurate and/or preferable than another

Question

While experimenting ways to extract UTF-8 character strings from TeX boxes, I found a post from user micahl-h21 here: UTF-8 text extraction. After looking at the way glyph data is stored in the nodelist, I modified the code to see if another approach works. In my approach, I traverse the components of composed complex glyphs/discs to extract the constituent characters. In his approach, he seems to be passing complex glyphs like ligatures to some function to decompose it. The output printed by both of our codes (for the test string in the example) looks same. Can someone please review, and suggest if both approaches are equally functionally correct (I am aware that my code requires special handling of TeX ligatures, please ignore that). And if yes, which one would be better for performance (I can cache unicode.utf8.char in my code like him, please ignore that discrepancy in any comment on performance).

Here's the output text written to terminal and to an output file hello.txt: Příliš žluťoučký kůň úpěl ďábelské ódy difference diffierence. His complete code is at UTF-8 text extraction, the place where our codes differ is that I don't use his following function (get_unicode), and just stick with unicode.utf8.char(glyphnodename.char) applied to glyph components (whereas he applies this function get_unicode to decompose complex glyphs instead of digging a level deeper in the glyph node to get the decomposed glyphs [as far as I understand]).

local function get_unicode(xchar,font_id)
    local current = {}
    local uchar = identifiers[font_id].characters[xchar].tounicode
    for i= 1, string.len(uchar), 4 do
      local cchar = string.sub(uchar, i, i + 3)
      print(xchar,uchar,cchar, font_id, i)
      table.insert(current,char(tonumber(cchar,16)))
    end
    return current
  end

\documentclass{article}
\usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
\usepackage{fontspec}
\usepackage{microtype}
\usepackage[english]{babel}
\usepackage{blindtext}
\begin{document}
\setbox0=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence.}
\directlua{
  local glyph_id = node.id("glyph")
  local disc_id = node.id("disc")
  local glue_id  = node.id("glue")
  local hlist_id = node.id("hlist")
  local vlist_id = node.id("vlist")
  local minglue = tex.sp("0.2em")
  local function nodeText(n)
    local t =  {}
    for x in node.traverse(n) do
      % glyph node
      if x.id == glyph_id then
        if bit32.band(x.subtype,2) \csstring~=0 and unicode.utf8.char(x.char) \csstring~="“" and unicode.utf8.char(x.char) \csstring~="”" then %
          for g in node.traverse_id(glyph_id,x.components) do
            if bit32.band(g.subtype, 2) \csstring~=0 then
              for gc in node.traverse_id(glyph_id,g.components) do
                table.insert(t,unicode.utf8.char(gc.char))
              end
            else
              table.insert(t,unicode.utf8.char(g.char))
            end
          end
        else
          table.insert(t,unicode.utf8.char(x.char))
        end
      % disc node
      elseif x.id == disc_id then
        for g in node.traverse_id(glyph_id,x.replace) do
          if bit32.band(g.subtype, 2) \csstring~=0 then
            for gc in node.traverse_id(glyph_id,g.components) do
              table.insert(t,unicode.utf8.char(gc.char))
            end
          else
            table.insert(t,unicode.utf8.char(g.char))
          end
        end
        % glue node
      elseif x.id == glue_id and  node.getglue(x) > minglue then

        table.insert(t," ")
      elseif x.id == hlist_id or x.id == vlist_id then
        table.insert(t,nodeText(x.head))
      end
    end
    return table.concat(t)
  end
  local n = tex.getbox(0)
  print(nodeText(n.head))
  local f = io.open("hello.txt","w")
  f:write(nodeText(n.head))
  f:close()
}
\box0
\end{document}

not really followed your code but you should test with a script that does more font shaping (and test with harf mode) in which the reverse translation from glyph identifiers to unicode input characters is harder. — David Carlisle, Sep 22 '20 at 11:54
@DavidCarlisle: Thanks for the feedback, I will try that next. The control flow difference between his code and mine is that for ligatures (<glyphnode>.subtype bit-1 is set), I iterate the components of the glyph nodes <glyphnode>.components. The .components list has the individual parts that make up the ligature, like ffi ligature's .components list has 3 individual glyphs: f, f, i. — codepoet, Sep 22 '20 at 12:05
@DavidCarlisle I had tested Devanagari with Node shaper before your comment, and it had worked. Upon your suggestion i tried HarfBuzz, and it failed. And it failed using user micahl-h21's idea too (link to his technique: https://tex.stackexchange.com/questions/228312/write-the-content-of-an-hbox-in-an-auxilary-file). I have filed another question to discuss why it fails in Harf mode and not in Node mode: https://tex.stackexchange.com/questions/563794/devanagari-utf-8-text-extraction-fails-in-harfbuzz-mode-passes-in-node-mode — codepoet, Sep 22 '20 at 19:38
@DavidCarlisle I wonder if you are aware of difference in usage of 'char' field by HarfBuzz renderer. I have explained my observations on a small Devanagari test-case here: https://github.com/Josef-Friedrich/nodetree/issues/6#issuecomment-697907215 — codepoet, Sep 23 '20 at 19:32
not in detail, but I'm not exactly surprised, the point of half mode is that consecutive character runs get passed to harfbuzz which does whatever it does and returns a list of glyph nodes, so the tex side doesn't really have the character level information it has in the other modes where it is doing all the shaping itself, you see similar in xetex logs: where tex normally reports every character xetex just reports the first character of each word — David Carlisle, Sep 23 '20 at 20:05
@DavidCarlisle Thanks, thats very useful to know. Khaled explained what we should do on that github thread. What I see is, for harfbuzz fonts, glyph nodes have a field 'glyph_info' in properties field, and it contains the string of text associated with that node. The glyph nodes that do not have anything that should be printed (as they were part of a cluster), their string is empty. So my code should work, if I detect that a font has HarfBuzz renderer, and get unicode code points from 'glyph_info' instead of 'char'. — codepoet, Sep 24 '20 at 20:48

LuaTeX: UTF-8 character extraction, is one way more accurate and/or preferable than another

0 Answers0

Linked