Devanagari UTF-8 text extraction code produces lua error while using HarfBuzz renderer

Question

Devanagari UTF-8 text extraction code doesn't produce lua error while using fontspec's Node renderer, but still doesn't produce correct result due to a possible bug in Node renderer. I have filed that as a separate question: LuaTeX: Devanagari glyph order is reversed in tex's internal nodelist, how to recover the correct order while traversing glyph nodes?

While experimenting with different techniques for UTF-8 text extraction from TeX boxes, I encountered that the two techniques that do not produce any lua error for extraction of Devanagari text with fontspec's Node renderer, both produce lua error while using fontspec's HarfBuzz renderers (Renderer = Harfbuzz, Renderer=OpenType).

The two techniques are detailed here: technique-1 (using micahl-h21's get_unicode function) and here: technique-2 (just applying unicode.utf8.char on constituent components of complex glyphs). I tried multiple Devanagari fonts, all result in same behavior.

Complete test code for both techniques, and their respective error signatures are listed one after the other in the blocks below. For my example, I used freely available Noto Sans Devanagari (regular weight) available here: link to google fonts GitHub for Noto Sans Devanagari

Technique-1 with Devanagari and HarfBuzz (no lua error if compiled with Node renderer):

\documentclass{article}
\usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
\usepackage{fontspec}
\usepackage{microtype}
%\newfontscript{Devanagari}{deva,dev2}
\newfontfamily{\devanagarifam}{Noto Sans Devanagari}[Script=Devanagari, Scale=1, Renderer=HarfBuzz]
\begin{document}
% Devanagari text is at the right end of following line
% of code, you might have to scroll right to read it
\setbox0=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence. \devanagarifam एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}
\directlua{
    % local fontstyles = require "l4fontstyles"
  local char = unicode.utf8.char
  local glyph_id = node.id("glyph")
  local glue_id  = node.id("glue")
  local hlist_id = node.id("hlist")
  local vlist_id = node.id("vlist")
  local disc_id  = node.id("disc")
  local minglue  = tex.sp("0.2em")
  local usedcharacters = {}
  local identifiers = fonts.hashes.identifiers
  local function get_unicode(xchar,font_id)
    local current = {}
    local uchar = identifiers[font_id].characters[xchar].tounicode
    for i= 1, string.len(uchar), 4 do
      local cchar = string.sub(uchar, i, i + 3)
      print(xchar,uchar,cchar, font_id, i)
      table.insert(current,char(tonumber(cchar,16)))
    end
    return current
  end
  local function nodeText(n)
    local t =  {}
    for x in node.traverse(n) do
      % glyph node
      if x.id == glyph_id then
        % local currentchar = fonts.hashes.identifiers[x.font].characters[x.char].tounicode
        local chars = get_unicode(x.char,x.font)
        for _, current_char in ipairs(chars) do
          table.insert(t,current_char)
        end
      % glue node
      elseif x.id == glue_id and  node.getglue(x) > minglue then
        table.insert(t," ")
      % discretionaries
      elseif x.id == disc_id then
        table.insert(t, nodeText(x.replace))
      % recursivelly process hlist and vlist nodes
      elseif x.id == hlist_id or x.id == vlist_id then
        table.insert(t,nodeText(x.head))
      end
    end
    return table.concat(t)
  end
  local n = tex.getbox(0)
  print(nodeText(n.head))
  local f = io.open("hello.txt","w")
  f:write(nodeText(n.head))
  f:close()
}
\box0
\end{document}

Error signature for Technique-1 (HarfBuzz renderer):

[\directlua]:1: bad argument #1 to 'len' (string expected, got nil)
stack traceback:
    [C]: in function 'string.len'
    [\directlua]:1: in upvalue 'get_unicode'
    [\directlua]:1: in local 'nodeText'
    [\directlua]:1: in main chunk.
l.62 }

Technique-2 with Devanagari and HarfBuzz (no lua error if compiled with Node renderer):

\documentclass{article}
\usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
\usepackage{fontspec}
\usepackage{microtype}
%\newfontscript{Devanagari}{deva,dev2}
\newfontfamily{\devanagarifam}{Noto Sans Devanagari}[Script=Devanagari, Scale=1, Renderer=HarfBuzz]
\begin{document}
% Devanagari text is at the right end of following line
% of code, you might have to scroll right to read it
\setbox0=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence. \devanagarifam एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}
\directlua{
  local glyph_id = node.id("glyph")
  local disc_id = node.id("disc")
  local glue_id  = node.id("glue")
  local hlist_id = node.id("hlist")
  local vlist_id = node.id("vlist")
  local minglue = tex.sp("0.2em")
  local function nodeText(n)
    local t =  {}
    for x in node.traverse(n) do
      % glyph node
      if x.id == glyph_id then
        if bit32.band(x.subtype,2) \csstring~=0 and unicode.utf8.char(x.char) \csstring~="“" and unicode.utf8.char(x.char) \csstring~="”" then %
          for g in node.traverse_id(glyph_id,x.components) do
            if bit32.band(g.subtype, 2) \csstring~=0 then
              for gc in node.traverse_id(glyph_id,g.components) do
                table.insert(t,unicode.utf8.char(gc.char))
              end
            else
              table.insert(t,unicode.utf8.char(g.char))
            end
          end
        else
          table.insert(t,unicode.utf8.char(x.char))
        end
      % disc node
      elseif x.id == disc_id then
        for g in node.traverse_id(glyph_id,x.replace) do
          if bit32.band(g.subtype, 2) \csstring~=0 then
            for gc in node.traverse_id(glyph_id,g.components) do
              table.insert(t,unicode.utf8.char(gc.char))
            end
          else
            table.insert(t,unicode.utf8.char(g.char))
          end
        end
        % glue node
      elseif x.id == glue_id and  node.getglue(x) > minglue then

        table.insert(t," ")
      elseif x.id == hlist_id or x.id == vlist_id then
        table.insert(t,nodeText(x.head))
      end
    end
    return table.concat(t)
  end
  local n = tex.getbox(0)
  print(nodeText(n.head))
  local f = io.open("hello.txt","w")
  f:write(nodeText(n.head))
  f:close()
}
\box0
\end{document}

Error signature for Technique-2 (HarfBuzz renderer):

[\directlua]:1: bad argument #1 to 'char' (invalid value)
stack traceback:
    [C]: in field 'char'
    [\directlua]:1: in local 'nodeText'
    [\directlua]:1: in main chunk.
l.64 }

I get the same message with any font rendered with harfbuzz (\documentclass{article} \usepackage{fontspec} \newfontfamily\fordinary{Noto Sans}[Renderer=HarfBuzz] \begin{document} \fordinary \setbox0=\hbox{abc}...), as if there is no string anymore. — Cicada, Sep 23 '20 at 09:50
@Cicada Am not sure what was your complete code. If I add \box0 \end{document} to the tail end your code, it does compile successfully and produce a correct pdf. Am using Texlive 2020 with packages last updated a week or so ago. — codepoet, Sep 25 '20 at 10:03
The luacode (too long to paste in a comment): \documentclass{article} \usepackage{fontspec} \newfontfamily\fordinary{Noto Sans}[Renderer=HarfBuzz] \begin{document} \fordinary \setbox0=\hbox{abc} \directlua{ .... } \box0 \end{document}. — Cicada, Sep 26 '20 at 03:14

michal.h21 · Answer 1 · 2020-09-23T11:09:42.773

I don't know too much about how HarfBuzz fonts are handled by Luaotfload, but I was able to find the way how to get the tounicode fields, thanks to table.serialize. So my original code adapted for Harfbuzz looks like this:

\documentclass{article}
\usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
\usepackage{fontspec}
\usepackage{microtype}
\usepackage{luacode}
%\newfontscript{Devanagari}{deva,dev2}
\newfontfamily{\devanagarifam}{Noto Sans Devanagari}[Script=Devanagari, Scale=1, Renderer=HarfBuzz]
\newfontfamily{\arabicfam}{Amiri}[Script=Arabic, Scale=1, Renderer=HarfBuzz]
\begin{document}
\setbox0=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence. \devanagarifam एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}
\setbox1=\hbox{\arabicfam \textdir TRT  هذه المقالة عن براغ. لتصفح عناوين مشابهة، انظر براغ (توضيح).}
\begin{luacode*}
  -- local fontstyles = require "l4fontstyles"
  local char = unicode.utf8.char
  local glyph_id = node.id("glyph")
  local glue_id  = node.id("glue")
  local hlist_id = node.id("hlist")
  local vlist_id = node.id("vlist")
  local disc_id  = node.id("disc")
  local minglue  = tex.sp("0.2em")
  local usedcharacters = {}
  local identifiers = fonts.hashes.identifiers
  local fontcache = {}
local function to_unicode_chars(uchar)
    local uchar = uchar or ""
    -- put characters into a table
    local current = {}
    -- each codepoint is 4 bytes long, we loop over tounicode entry and cut it into 4 bytes chunks
    for i= 1, string.len(uchar), 4 do
      local cchar = string.sub(uchar, i, i + 3)
      -- codepoint is hex string, we need to convert it to number ad then to UTF8 char
      table.insert(current,char(tonumber(cchar,16)))
    end
    return current
  end
  -- cache character lookup, to speed up things
  local function get_character_from_cache(xchar, font_id)
    local current_font = fontcache[font_id] or {characters = {}}
    fontcache[font_id] = current_font -- initialize font cache for the current font if it doesn't exist
    return current_font.characters[xchar]
  end
-- save characters to cache for faster lookup
  local function save_character_to_cache(xchar, font_id, replace)
    fontcache[font_id][xchar] = replace
    -- return value
    return replace
  end
local function initialize_harfbuzz_cache(font_id, hb)
    -- save some harfbuzz tables for faster lookup
    local current_font = fontcache[font_id]
    -- the unicode data can be in two places
    -- 1. hb.shared.glyphs[glyphid].backmap
    current_font.glyphs = current_font.glyphs or hb.shared.glyphs
    -- 2. hb.shared.unicodes 
    -- it contains mapping between Unicode and glyph id
    -- we must create new table that contains reverse mapping
    if not current_font.backmap then 
      current_font.backmap = {} 
      for k,v in pairs(hb.shared.unicodes) do
        current_font.backmap[v] = k
      end
    end
    -- save it back to the font cache
    fontcache[font_id] = current_font
    return current_font.glyphs, current_font.backmap
  end
local function get_unicode(xchar,font_id)
    -- try to load character from cache first
    local current_char = get_character_from_cache(xchar, font_id) 
    if current_char then return current_char end
    -- get tounicode for non HarfBuzz fonts
    local characters = identifiers[font_id].characters
    local uchar = characters[xchar].tounicode
    -- stop processing if tounicode exists
    if uchar then return save_character_to_cache(xchar, font_id, to_unicode_chars(uchar)) end
    -- detect if font is processed by Harfbuzz
    local hb = identifiers[font_id].hb
    -- try HarfBuzz data
    if not uchar and hb then 
      -- get glyph index of the character
      local index = characters[xchar].index
      -- load HarfBuzz tables from cache
      local glyphs, backmap = initialize_harfbuzz_cache(font_id, hb)
      -- get tounicode field from HarfBuzz glyph info
      local tounicode = glyphs[index].tounicode
      if tounicode then
        return save_character_to_cache(xchar, font_id, to_unicode_chars(tounicode))
      end
      -- if this fails, try backmap, which contains mapping between glyph index and Unicode
      local backuni = backmap[index]
      if backuni then 
        return save_character_to_cache(xchar, font_id, {char(backuni)})
      end
      -- if this fails too, discard this character
      return save_character_to_cache(xchar, font_id, {})
    end
    -- return just the original char if everything else fails
    return save_character_to_cache(xchar, font_id, {char(xchar)})
  end
local function nodeText(n)
    -- output buffer
    local t =  {}
    for x in node.traverse(n) do
      -- glyph node
      if x.id == glyph_id then
        -- get table with characters for current node.char
        local chars = get_unicode(x.char,x.font)
        for _, current_char in ipairs(chars) do
          -- save characters to the output buffer
          table.insert(t,current_char)
        end
      -- glue node
      elseif x.id == glue_id and  node.getglue(x) > minglue then
        table.insert(t," ")
      -- discretionaries
      elseif x.id == disc_id then
        table.insert(t, nodeText(x.replace))
      -- recursivelly process hlist and vlist nodes
      elseif x.id == hlist_id or x.id == vlist_id then
        table.insert(t,nodeText(x.head))
      end
    end
    return table.concat(t)
  end
  local n = tex.getbox(0)
  local n1 = tex.getbox(1)
  print(nodeText(n.head))
  local f = io.open("hello.txt","w")
  f:write(nodeText(n.head))
  f:write(nodeText(n1.head))
  f:close()
\end{luacode*}
\box0
\box1
\end{document}

I've added also an Arabic sample from Wikipedia. Here is the content of hello.txt:

Příliš žluťoučký kůň úpěl ďábelské ódy difference diffierence. एक गांव -- में मोहन नाम का लड़का रहता था। उसके पताजी एक मामूली मजदूर थे।هذه المقالة عن براغ. لتصفح عناوين مشابهة، انظر براغ (توضيح).

Two important functions are these

  local function to_unicode_chars(uchar)
    local uchar = uchar or ""
    local current = {}
    for i= 1, string.len(uchar), 4 do
      local cchar = string.sub(uchar, i, i + 3)
      table.insert(current,char(tonumber(cchar,16)))
    end
    return current
  end

to_unicode_chars function splits to_unicode entries to four byte chunks, which are then converted to UTF 8 characters. It can also handle glyphs without tounicode entries, in this case it just returns empty string.

  local function get_unicode(xchar,font_id)
    -- try to load character from cache first
    local current_char = get_character_from_cache(xchar, font_id) 
    if current_char then return current_char end
    -- get tounicode for non HarfBuzz fonts
    local characters = identifiers[font_id].characters
    local uchar = characters[xchar].tounicode
    -- stop processing if tounicode exists
    if uchar then return save_character_to_cache(xchar, font_id, to_unicode_chars(uchar)) end
    -- detect if font is processed by Harfbuzz
    local hb = identifiers[font_id].hb
    -- try HarfBuzz data
    if not uchar and hb then 
      -- get glyph index of the character
      local index = characters[xchar].index
      -- load HarfBuzz tables from cache
      local glyphs, backmap = initialize_harfbuzz_cache(font_id, hb)
      -- get tounicode field from HarfBuzz glyph info
      local tounicode = glyphs[index].tounicode
      if tounicode then
        return save_character_to_cache(xchar, font_id, to_unicode_chars(tounicode))
      end
      -- if this fails, try backmap, which contains mapping between glyph index and Unicode
      local backuni = backmap[index]
      if backuni then 
        return save_character_to_cache(xchar, font_id, {char(backuni)})
      end
      -- if this fails too, discard this character
      return save_character_to_cache(xchar, font_id, {})
    end
    -- return just the original char if everything else fails
    return save_character_to_cache(xchar, font_id, {char(xchar)})
  end

This function first tries to load Uniocode data from current font info. If it fails, it tries to lookup in Harfbuzz tables. Most of characters have tounicode mapping in the glyphs table. If it isn't available, it tries the unicodes table, which contains mapping between glyph indices and Unicode. If even this fails, then we discard this character.

Seems very close to the solution, though there is a problem. I see a difference in input and output Devanagari strings; there is one complex glyph missing in the unicode string output to terminal & file hello.txt (its there in PDF). Input Devanagari string looks like this: ...उसके पिताजी एक..., whereas the output string produced by your code looks like this: ... उसके ताजी एक.... There is one missing glyph: पि in word पिताजी. It would also help if you can add some inline comments in function to_unicode_chars or elaborate it more, I don't understand the format of input and math there. — codepoet, Sep 23 '20 at 02:38
@reportaman I've expanded my answer. It seems that this one character that is missing doesn't have any mapping. It also isn't valid utf-8 codepoint, so we cannot use it at all. — michal.h21, Sep 23 '20 at 11:12
Hi Michal, I am yet to test your latest code, though there is a bug somewhere (either our usage of 'char' field when using HarfBuzz renderer, or HarfBuzz renderers potential misuse of 'char' field by putting 'string' there instead of 'char'). I have dissected the error further in my comment on a bug I field on GitHub for the same issue. Here's the link: https://github.com/Josef-Friedrich/nodetree/issues/6#issuecomment-697907215 Please let me know what you think. — codepoet, Sep 23 '20 at 19:28
@reportaman you shouldn't get errors at this moment, because char function is not used directly on node.char in most cases, only on tounicode entries, which should be safe. Khaled's explanation what node.char contains is nice, especially that if node.char value > 0x120000 then it is a glyph index. you could use char on lower values in theory, but it will produce ligatures, so you don't want it. the real issue is that this particular glyph index 1180255 doesn't have any Unicode info in the font table, so we cannot use it. — michal.h21, Sep 23 '20 at 21:40
Wow, that looks like a lot of possible sources of information. Have you ever tried the function get_glyph_info pointed by Khaled? Given that he says it works with all the renderers, I wonder if a solution based on it would be a relatively shorter one that works for all cases (of valid input Unicode text fed to TeX)? Thanks anyway — codepoet, Sep 24 '20 at 05:35
@reportaman yes, but it doesn't work for your use case - it doesn't separate ligature components and chars with high values are printed as ^^^^^^12025F, so you won't get original text this way. — michal.h21, Sep 24 '20 at 08:47
Still something off in the same unicode substring that had problem: ...उसके पिताजी एक... in the TeX input is now being output as ...उसके पताजी एक... in hello.txt & terminal. पि is made up of प + ि, and in the output ि is missing. Here are the unicode points of प is U+092A, and ि is U+093F. I wonder if there is a problem with ordering of प + ि that makes ि disappear? — codepoet, Sep 24 '20 at 22:15
I have added a related question here: https://tex.stackexchange.com/questions/564097/luatex-devanagari-glyph-order-in-reversed-in-texs-internal-nodelist-how-to-re — codepoet, Sep 25 '20 at 02:18

score 2 · Accepted Answer · answered Sep 25 '20 at 12:40

In node mode, restoring the full text is not generally possible because you get shaped output and shaped glyphs can not be uniquely mapped back to input text. You can only approximate it by using tounicode values. These map to the actual PDF file ToUnicode CMap entries and therefore follow their restricted model of glyph to Unicode mapping: Every glyph is equivalent to a fixed sequence of unicode codepoints. These mappings are the concatenated in rendering order. As you have seen, this model is not sufficient to map Devanagari glyphs to input text.

You can use harf mode instead to avoid the issue: harf mode is not affected by this limited model because it does not just give you a shaped list of glyphs, but additionally creates PDF marked content ActualText entries overriding the ToUnicode mappings in sequences which can not be correclty modeled though ToUnicode. The data necessary for this mapping can be queried from Lua code using the glyph_data property. (This is an undocumented implementation detail and might change in the future)

If xou want to extract as much as possible from any text, you can combine this property based and the ToUnicode based approach in your Lua code:

Create a file extracttext.lua with

local type = type
local char = utf8.char
local unpack = table.unpack
local getproperty = node.getproperty
local getfont = font.getfont
local is_glyph = node.is_glyph
-- tounicode id UTF-16 in hex, so we need to handle surrogate pairs...
local utf16hex_to_utf8 do -- Untested, but should more or less work
  local l = lpeg
  local tonumber = tonumber
  local hex = l.R('09', 'af', 'AF')
  local byte = hex * hex
  local simple = byte * byte / function(s) return char(tonumber(s, 16)) end
  local surrogate = l.S'Dd' * l.C(l.R('89', 'AB', 'ab') * byte)
                  * l.S'Dd' * l.C(l.R('CF', 'cf') * byte) / function(high, low)
                      return char(0x10000 + ((tonumber(high, 16) & 0x3FF) << 10 | (tonumber(low, 16) & 0x3FF)))
                    end
  utf16hex_to_utf8 = l.Cs((surrogate + simple)^0)
end
-- First the non-harf case
-- Standard caching setup
local identity_table = setmetatable({}, {__index = function(_, id) return char(id) end})
local cached_text = setmetatable({}, {__index = function(t, fid)
  local fontdir = getfont(fid)
  local characters = fontdir and fontdir.tounicode == 1 and fontdir.characters
  local font_cache = characters and setmetatable({}, {__index = function(tt, slot)
    local character = characters[slot]
    local text = character and character.tounicode or slot
    -- At this point we have the tounicode value in text. This can have different forms.
    -- The order in the if ... elseif chain is based on how likely it is to encounter them.
    -- This is a small performance optimization.
    local t = type(text)
    if t == 'string' then
      text = utf16hex_to_utf8:match(text)
    elseif t == 'number' then
      text = char(text)
    elseif t == 'table' then
      text = char(unpack(text)) -- I haven't tested this case, but it should work
    end
    tt[slot] = text
    return text
  end}) or identity_table
  t[fid] = font_cache
  return font_cache
end})
-- Now the tounicode case just has to look up the value
local function from_tounicode(n)
  local slot, fid = is_glyph(n)
  return cached_text[fid][slot]
end
-- Now the traversing stuff. Nothing interesting to see here except for the
-- glyph case
local traverse = node.traverse
local glyph, glue, disc, hlist, vlist = node.id'glyph', node.id'glue', node.id'disc', node.id'hlist', node.id'vlist'
local extract_text_vlist
-- We could replace i by #t+1 but this should be slightly faster
local function real_extract_text(head, t, i)
  for n, id in traverse(head) do
    if id == glyph then
      -- First handle harf mode: Look for a glyph_info property. If that does not exists
      -- use from_tounicode. glyph_info will sometimes/often be an empty string. That's
      -- intentional and it should not trigger a fallback. The actual mapping will be
      -- contained in surrounding chars.
      local props = getproperty(n)
      t[i] = props and props.glyph_info or from_tounicode(n)
      i = i + 1
    elseif id == glue then
      if n.width > 1001 then -- 1001 is arbitrary but sufficiently high to be bigger than most weird glue uses
        t[i] = ' '
        i = i + 1
      end
    elseif id == disc then
      i = real_extract_text(n.replace, t, i)
    elseif id == hlist then
      i = real_extract_text(n.head, t, i)
    elseif id == vlist then
      i = extract_text_vlist(n.head, t, i)
    end
  end
  return i
end
function extract_text_vlist(head, t, i) -- glue should not become a space here
  for n, id in traverse(head) do
    if id == hlist then
      i = real_extract_text(n.head, t, i)
    elseif id == vlist then
      i = extract_text_vlist(n.head, t, i)
    end
  end
  return i
end
return function(list)
  local t = {}
  real_extract_text(list.head, t, 1)
  return table.concat(t)
end

This can be used as a normal Lua module:

\documentclass{article}
\usepackage{fontspec}
\newfontfamily{\devharf}{Noto Sans Devanagari}[Script=Devanagari, Renderer=HarfBuzz]
\newfontfamily{\devnode}{Noto Sans Devanagari}[Script=Devanagari, Renderer=Node]
\begin{document}
% Devanagari text is at the right end of following line
% of code, you might have to scroll right to read it
\setbox0=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence. \devharf एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}
\setbox1=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence. \devnode एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}
\directlua{
  local extracttext = require'extracttext'
  local f = io.open("hello.harf.txt","w") % Can reproduce the full input text
  f:write(extracttext(tex.getbox(0)))
  f:close()
f = io.open("hello.node.txt","w") % In node mode, we only get an approximation
  f:write(extracttext(tex.getbox(1)))
  f:close()
}
\box0
\box1
\end{document}

A more general note: As you can see, there is some work involved when it comes to getting text from a shaped list, especially in the ToUnicode case where we have to map surrogate pairs and stuff. This is mostly because shaped text is not intended for such use. As soon as the glyph nodes are protected (aka. subtype(n) >= 256 or not is_char(n) is true), the .char entries no longer contain Unicode values but internal identifiers, the .font entry might no longer be the value you expect and some glyphs might not be represented as glyphs at all. In most cases in which you want to actually access the text behind a box and not just the visual display of the text, you really want to intercept the list before it gets shaped in the first place.

Which implies that time's directionality is part of the definition of the typeset object, which in turn implies that the process is not symmetrical when time's arrow is reversed. Indeed, with \expandafter\newcommand\csname xx123\endcsname{^^^^0065} \setbox0=\hbox{abc{\Huge \csname xx123\endcsname}}, the "text" that is retrieved is abce. The analogy of TeX's 'mouth' and 'stomach' comes to mind. ("Reconstructed" might be more suitable a term than "retrieved".) — Cicada, Sep 26 '20 at 05:11

Devanagari UTF-8 text extraction code produces lua error while using HarfBuzz renderer

2 Answers2

Linked