4

Devanagari UTF-8 text extraction code doesn't produce lua error while using fontspec's Node renderer, but still doesn't produce correct result due to a possible bug in Node renderer. I have filed that as a separate question: LuaTeX: Devanagari glyph order is reversed in tex's internal nodelist, how to recover the correct order while traversing glyph nodes?

While experimenting with different techniques for UTF-8 text extraction from TeX boxes, I encountered that the two techniques that do not produce any lua error for extraction of Devanagari text with fontspec's Node renderer, both produce lua error while using fontspec's HarfBuzz renderers (Renderer = Harfbuzz, Renderer=OpenType).

The two techniques are detailed here: technique-1 (using micahl-h21's get_unicode function) and here: technique-2 (just applying unicode.utf8.char on constituent components of complex glyphs). I tried multiple Devanagari fonts, all result in same behavior.

Complete test code for both techniques, and their respective error signatures are listed one after the other in the blocks below. For my example, I used freely available Noto Sans Devanagari (regular weight) available here: link to google fonts GitHub for Noto Sans Devanagari

Technique-1 with Devanagari and HarfBuzz (no lua error if compiled with Node renderer):

\documentclass{article}
\usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
\usepackage{fontspec}
\usepackage{microtype}

%\newfontscript{Devanagari}{deva,dev2} \newfontfamily{\devanagarifam}{Noto Sans Devanagari}[Script=Devanagari, Scale=1, Renderer=HarfBuzz]

\begin{document}

% Devanagari text is at the right end of following line % of code, you might have to scroll right to read it \setbox0=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence. \devanagarifam एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}

\directlua{ % local fontstyles = require "l4fontstyles" local char = unicode.utf8.char local glyph_id = node.id("glyph") local glue_id = node.id("glue") local hlist_id = node.id("hlist") local vlist_id = node.id("vlist") local disc_id = node.id("disc") local minglue = tex.sp("0.2em") local usedcharacters = {} local identifiers = fonts.hashes.identifiers local function get_unicode(xchar,font_id) local current = {} local uchar = identifiers[font_id].characters[xchar].tounicode for i= 1, string.len(uchar), 4 do local cchar = string.sub(uchar, i, i + 3) print(xchar,uchar,cchar, font_id, i) table.insert(current,char(tonumber(cchar,16))) end return current end local function nodeText(n) local t = {} for x in node.traverse(n) do % glyph node if x.id == glyph_id then % local currentchar = fonts.hashes.identifiers[x.font].characters[x.char].tounicode local chars = get_unicode(x.char,x.font) for _, current_char in ipairs(chars) do table.insert(t,current_char) end % glue node elseif x.id == glue_id and node.getglue(x) > minglue then table.insert(t," ") % discretionaries elseif x.id == disc_id then table.insert(t, nodeText(x.replace)) % recursivelly process hlist and vlist nodes elseif x.id == hlist_id or x.id == vlist_id then table.insert(t,nodeText(x.head)) end end return table.concat(t) end local n = tex.getbox(0) print(nodeText(n.head)) local f = io.open("hello.txt","w") f:write(nodeText(n.head)) f:close() }

\box0 \end{document}

Error signature for Technique-1 (HarfBuzz renderer):

[\directlua]:1: bad argument #1 to 'len' (string expected, got nil)
stack traceback:
    [C]: in function 'string.len'
    [\directlua]:1: in upvalue 'get_unicode'
    [\directlua]:1: in local 'nodeText'
    [\directlua]:1: in main chunk.
l.62 }

Technique-2 with Devanagari and HarfBuzz (no lua error if compiled with Node renderer):

\documentclass{article}
\usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
\usepackage{fontspec}
\usepackage{microtype}

%\newfontscript{Devanagari}{deva,dev2} \newfontfamily{\devanagarifam}{Noto Sans Devanagari}[Script=Devanagari, Scale=1, Renderer=HarfBuzz]

\begin{document}

% Devanagari text is at the right end of following line % of code, you might have to scroll right to read it \setbox0=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence. \devanagarifam एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}

\directlua{ local glyph_id = node.id("glyph") local disc_id = node.id("disc") local glue_id = node.id("glue") local hlist_id = node.id("hlist") local vlist_id = node.id("vlist") local minglue = tex.sp("0.2em") local function nodeText(n) local t = {} for x in node.traverse(n) do % glyph node if x.id == glyph_id then if bit32.band(x.subtype,2) \csstring~=0 and unicode.utf8.char(x.char) \csstring~="“" and unicode.utf8.char(x.char) \csstring~="”" then % for g in node.traverse_id(glyph_id,x.components) do if bit32.band(g.subtype, 2) \csstring~=0 then for gc in node.traverse_id(glyph_id,g.components) do table.insert(t,unicode.utf8.char(gc.char)) end else table.insert(t,unicode.utf8.char(g.char)) end end else table.insert(t,unicode.utf8.char(x.char)) end % disc node elseif x.id == disc_id then for g in node.traverse_id(glyph_id,x.replace) do if bit32.band(g.subtype, 2) \csstring~=0 then for gc in node.traverse_id(glyph_id,g.components) do table.insert(t,unicode.utf8.char(gc.char)) end else table.insert(t,unicode.utf8.char(g.char)) end end % glue node elseif x.id == glue_id and node.getglue(x) > minglue then
table.insert(t," ") elseif x.id == hlist_id or x.id == vlist_id then table.insert(t,nodeText(x.head)) end end return table.concat(t) end local n = tex.getbox(0) print(nodeText(n.head)) local f = io.open("hello.txt","w") f:write(nodeText(n.head)) f:close()

}

\box0 \end{document}

Error signature for Technique-2 (HarfBuzz renderer):

[\directlua]:1: bad argument #1 to 'char' (invalid value)
stack traceback:
    [C]: in field 'char'
    [\directlua]:1: in local 'nodeText'
    [\directlua]:1: in main chunk.
l.64 }
codepoet
  • 1,316
  • I get the same message with any font rendered with harfbuzz (\documentclass{article} \usepackage{fontspec} \newfontfamily\fordinary{Noto Sans}[Renderer=HarfBuzz] \begin{document} \fordinary \setbox0=\hbox{abc}...), as if there is no string anymore. – Cicada Sep 23 '20 at 09:50
  • @Cicada Am not sure what was your complete code. If I add \box0 \end{document} to the tail end your code, it does compile successfully and produce a correct pdf. Am using Texlive 2020 with packages last updated a week or so ago. – codepoet Sep 25 '20 at 10:03
  • The luacode (too long to paste in a comment): \documentclass{article} \usepackage{fontspec} \newfontfamily\fordinary{Noto Sans}[Renderer=HarfBuzz] \begin{document} \fordinary \setbox0=\hbox{abc} \directlua{ .... } \box0 \end{document}. – Cicada Sep 26 '20 at 03:14

2 Answers2

3

I don't know too much about how HarfBuzz fonts are handled by Luaotfload, but I was able to find the way how to get the tounicode fields, thanks to table.serialize. So my original code adapted for Harfbuzz looks like this:

\documentclass{article}
\usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
\usepackage{fontspec}
\usepackage{microtype}
\usepackage{luacode}

%\newfontscript{Devanagari}{deva,dev2} \newfontfamily{\devanagarifam}{Noto Sans Devanagari}[Script=Devanagari, Scale=1, Renderer=HarfBuzz] \newfontfamily{\arabicfam}{Amiri}[Script=Arabic, Scale=1, Renderer=HarfBuzz]

\begin{document}

\setbox0=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence. \devanagarifam एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।} \setbox1=\hbox{\arabicfam \textdir TRT هذه المقالة عن براغ. لتصفح عناوين مشابهة، انظر براغ (توضيح).}

\begin{luacode*} -- local fontstyles = require "l4fontstyles" local char = unicode.utf8.char local glyph_id = node.id("glyph") local glue_id = node.id("glue") local hlist_id = node.id("hlist") local vlist_id = node.id("vlist") local disc_id = node.id("disc") local minglue = tex.sp("0.2em") local usedcharacters = {} local identifiers = fonts.hashes.identifiers local fontcache = {}

local function to_unicode_chars(uchar) local uchar = uchar or "" -- put characters into a table local current = {} -- each codepoint is 4 bytes long, we loop over tounicode entry and cut it into 4 bytes chunks for i= 1, string.len(uchar), 4 do local cchar = string.sub(uchar, i, i + 3) -- codepoint is hex string, we need to convert it to number ad then to UTF8 char table.insert(current,char(tonumber(cchar,16))) end return current end -- cache character lookup, to speed up things local function get_character_from_cache(xchar, font_id) local current_font = fontcache[font_id] or {characters = {}} fontcache[font_id] = current_font -- initialize font cache for the current font if it doesn't exist return current_font.characters[xchar] end

-- save characters to cache for faster lookup local function save_character_to_cache(xchar, font_id, replace) fontcache[font_id][xchar] = replace -- return value return replace end

local function initialize_harfbuzz_cache(font_id, hb) -- save some harfbuzz tables for faster lookup local current_font = fontcache[font_id] -- the unicode data can be in two places -- 1. hb.shared.glyphs[glyphid].backmap current_font.glyphs = current_font.glyphs or hb.shared.glyphs -- 2. hb.shared.unicodes -- it contains mapping between Unicode and glyph id -- we must create new table that contains reverse mapping if not current_font.backmap then current_font.backmap = {} for k,v in pairs(hb.shared.unicodes) do current_font.backmap[v] = k end end -- save it back to the font cache fontcache[font_id] = current_font return current_font.glyphs, current_font.backmap end

local function get_unicode(xchar,font_id) -- try to load character from cache first local current_char = get_character_from_cache(xchar, font_id) if current_char then return current_char end -- get tounicode for non HarfBuzz fonts local characters = identifiers[font_id].characters local uchar = characters[xchar].tounicode -- stop processing if tounicode exists if uchar then return save_character_to_cache(xchar, font_id, to_unicode_chars(uchar)) end -- detect if font is processed by Harfbuzz local hb = identifiers[font_id].hb -- try HarfBuzz data if not uchar and hb then -- get glyph index of the character local index = characters[xchar].index -- load HarfBuzz tables from cache local glyphs, backmap = initialize_harfbuzz_cache(font_id, hb) -- get tounicode field from HarfBuzz glyph info local tounicode = glyphs[index].tounicode if tounicode then return save_character_to_cache(xchar, font_id, to_unicode_chars(tounicode)) end -- if this fails, try backmap, which contains mapping between glyph index and Unicode local backuni = backmap[index] if backuni then return save_character_to_cache(xchar, font_id, {char(backuni)}) end -- if this fails too, discard this character return save_character_to_cache(xchar, font_id, {}) end -- return just the original char if everything else fails return save_character_to_cache(xchar, font_id, {char(xchar)}) end

local function nodeText(n) -- output buffer local t = {} for x in node.traverse(n) do -- glyph node if x.id == glyph_id then -- get table with characters for current node.char local chars = get_unicode(x.char,x.font) for _, current_char in ipairs(chars) do -- save characters to the output buffer table.insert(t,current_char) end -- glue node elseif x.id == glue_id and node.getglue(x) > minglue then table.insert(t," ") -- discretionaries elseif x.id == disc_id then table.insert(t, nodeText(x.replace)) -- recursivelly process hlist and vlist nodes elseif x.id == hlist_id or x.id == vlist_id then table.insert(t,nodeText(x.head)) end end return table.concat(t) end local n = tex.getbox(0) local n1 = tex.getbox(1) print(nodeText(n.head)) local f = io.open("hello.txt","w") f:write(nodeText(n.head)) f:write(nodeText(n1.head)) f:close() \end{luacode*}

\box0

\box1 \end{document}

I've added also an Arabic sample from Wikipedia. Here is the content of hello.txt:

Příliš žluťoučký kůň úpěl ďábelské ódy difference diffierence. एक गांव -- में मोहन नाम का लड़का रहता था। उसके पताजी एक मामूली मजदूर थे।هذه المقالة عن براغ. لتصفح عناوين مشابهة، انظر براغ (توضيح).

Two important functions are these

  local function to_unicode_chars(uchar)
    local uchar = uchar or ""
    local current = {}
    for i= 1, string.len(uchar), 4 do
      local cchar = string.sub(uchar, i, i + 3)
      table.insert(current,char(tonumber(cchar,16)))
    end
    return current
  end

to_unicode_chars function splits to_unicode entries to four byte chunks, which are then converted to UTF 8 characters. It can also handle glyphs without tounicode entries, in this case it just returns empty string.

  local function get_unicode(xchar,font_id)
    -- try to load character from cache first
    local current_char = get_character_from_cache(xchar, font_id) 
    if current_char then return current_char end
    -- get tounicode for non HarfBuzz fonts
    local characters = identifiers[font_id].characters
    local uchar = characters[xchar].tounicode
    -- stop processing if tounicode exists
    if uchar then return save_character_to_cache(xchar, font_id, to_unicode_chars(uchar)) end
    -- detect if font is processed by Harfbuzz
    local hb = identifiers[font_id].hb
    -- try HarfBuzz data
    if not uchar and hb then 
      -- get glyph index of the character
      local index = characters[xchar].index
      -- load HarfBuzz tables from cache
      local glyphs, backmap = initialize_harfbuzz_cache(font_id, hb)
      -- get tounicode field from HarfBuzz glyph info
      local tounicode = glyphs[index].tounicode
      if tounicode then
        return save_character_to_cache(xchar, font_id, to_unicode_chars(tounicode))
      end
      -- if this fails, try backmap, which contains mapping between glyph index and Unicode
      local backuni = backmap[index]
      if backuni then 
        return save_character_to_cache(xchar, font_id, {char(backuni)})
      end
      -- if this fails too, discard this character
      return save_character_to_cache(xchar, font_id, {})
    end
    -- return just the original char if everything else fails
    return save_character_to_cache(xchar, font_id, {char(xchar)})
  end

This function first tries to load Uniocode data from current font info. If it fails, it tries to lookup in Harfbuzz tables. Most of characters have tounicode mapping in the glyphs table. If it isn't available, it tries the unicodes table, which contains mapping between glyph indices and Unicode. If even this fails, then we discard this character.

michal.h21
  • 50,697
  • Seems very close to the solution, though there is a problem. I see a difference in input and output Devanagari strings; there is one complex glyph missing in the unicode string output to terminal & file hello.txt (its there in PDF). Input Devanagari string looks like this: ...उसके पिताजी एक..., whereas the output string produced by your code looks like this: ... उसके ताजी एक.... There is one missing glyph: पि in word पिताजी. It would also help if you can add some inline comments in function to_unicode_chars or elaborate it more, I don't understand the format of input and math there. – codepoet Sep 23 '20 at 02:38
  • @reportaman I've expanded my answer. It seems that this one character that is missing doesn't have any mapping. It also isn't valid utf-8 codepoint, so we cannot use it at all. – michal.h21 Sep 23 '20 at 11:12
  • Hi Michal, I am yet to test your latest code, though there is a bug somewhere (either our usage of 'char' field when using HarfBuzz renderer, or HarfBuzz renderers potential misuse of 'char' field by putting 'string' there instead of 'char'). I have dissected the error further in my comment on a bug I field on GitHub for the same issue. Here's the link: https://github.com/Josef-Friedrich/nodetree/issues/6#issuecomment-697907215 Please let me know what you think. – codepoet Sep 23 '20 at 19:28
  • @reportaman you shouldn't get errors at this moment, because char function is not used directly on node.char in most cases, only on tounicode entries, which should be safe. Khaled's explanation what node.char contains is nice, especially that if node.char value > 0x120000 then it is a glyph index. you could use char on lower values in theory, but it will produce ligatures, so you don't want it. the real issue is that this particular glyph index 1180255 doesn't have any Unicode info in the font table, so we cannot use it. – michal.h21 Sep 23 '20 at 21:40
  • Wow, that looks like a lot of possible sources of information. Have you ever tried the function get_glyph_info pointed by Khaled? Given that he says it works with all the renderers, I wonder if a solution based on it would be a relatively shorter one that works for all cases (of valid input Unicode text fed to TeX)? Thanks anyway – codepoet Sep 24 '20 at 05:35
  • @reportaman yes, but it doesn't work for your use case - it doesn't separate ligature components and chars with high values are printed as ^^^^^^12025F, so you won't get original text this way. – michal.h21 Sep 24 '20 at 08:47
  • Still something off in the same unicode substring that had problem: ...उसके पिताजी एक... in the TeX input is now being output as ...उसके पताजी एक... in hello.txt & terminal. पि is made up of + ि, and in the output ि is missing. Here are the unicode points of is U+092A, and ि is U+093F. I wonder if there is a problem with ordering of + ि that makes ि disappear? – codepoet Sep 24 '20 at 22:15
  • I have added a related question here: https://tex.stackexchange.com/questions/564097/luatex-devanagari-glyph-order-in-reversed-in-texs-internal-nodelist-how-to-re – codepoet Sep 25 '20 at 02:18
2

In node mode, restoring the full text is not generally possible because you get shaped output and shaped glyphs can not be uniquely mapped back to input text. You can only approximate it by using tounicode values. These map to the actual PDF file ToUnicode CMap entries and therefore follow their restricted model of glyph to Unicode mapping: Every glyph is equivalent to a fixed sequence of unicode codepoints. These mappings are the concatenated in rendering order. As you have seen, this model is not sufficient to map Devanagari glyphs to input text.

You can use harf mode instead to avoid the issue: harf mode is not affected by this limited model because it does not just give you a shaped list of glyphs, but additionally creates PDF marked content ActualText entries overriding the ToUnicode mappings in sequences which can not be correclty modeled though ToUnicode. The data necessary for this mapping can be queried from Lua code using the glyph_data property. (This is an undocumented implementation detail and might change in the future)

If xou want to extract as much as possible from any text, you can combine this property based and the ToUnicode based approach in your Lua code:

Create a file extracttext.lua with

local type = type
local char = utf8.char
local unpack = table.unpack
local getproperty = node.getproperty
local getfont = font.getfont
local is_glyph = node.is_glyph

-- tounicode id UTF-16 in hex, so we need to handle surrogate pairs... local utf16hex_to_utf8 do -- Untested, but should more or less work local l = lpeg local tonumber = tonumber local hex = l.R('09', 'af', 'AF') local byte = hex * hex local simple = byte * byte / function(s) return char(tonumber(s, 16)) end local surrogate = l.S'Dd' * l.C(l.R('89', 'AB', 'ab') * byte) * l.S'Dd' * l.C(l.R('CF', 'cf') * byte) / function(high, low) return char(0x10000 + ((tonumber(high, 16) & 0x3FF) << 10 | (tonumber(low, 16) & 0x3FF))) end utf16hex_to_utf8 = l.Cs((surrogate + simple)^0) end

-- First the non-harf case

-- Standard caching setup local identity_table = setmetatable({}, {__index = function(_, id) return char(id) end}) local cached_text = setmetatable({}, {__index = function(t, fid) local fontdir = getfont(fid) local characters = fontdir and fontdir.tounicode == 1 and fontdir.characters local font_cache = characters and setmetatable({}, {__index = function(tt, slot) local character = characters[slot] local text = character and character.tounicode or slot -- At this point we have the tounicode value in text. This can have different forms. -- The order in the if ... elseif chain is based on how likely it is to encounter them. -- This is a small performance optimization. local t = type(text) if t == 'string' then text = utf16hex_to_utf8:match(text) elseif t == 'number' then text = char(text) elseif t == 'table' then text = char(unpack(text)) -- I haven't tested this case, but it should work end tt[slot] = text return text end}) or identity_table t[fid] = font_cache return font_cache end})

-- Now the tounicode case just has to look up the value local function from_tounicode(n) local slot, fid = is_glyph(n) return cached_text[fid][slot] end

-- Now the traversing stuff. Nothing interesting to see here except for the -- glyph case local traverse = node.traverse local glyph, glue, disc, hlist, vlist = node.id'glyph', node.id'glue', node.id'disc', node.id'hlist', node.id'vlist' local extract_text_vlist -- We could replace i by #t+1 but this should be slightly faster local function real_extract_text(head, t, i) for n, id in traverse(head) do if id == glyph then -- First handle harf mode: Look for a glyph_info property. If that does not exists -- use from_tounicode. glyph_info will sometimes/often be an empty string. That's -- intentional and it should not trigger a fallback. The actual mapping will be -- contained in surrounding chars. local props = getproperty(n) t[i] = props and props.glyph_info or from_tounicode(n) i = i + 1 elseif id == glue then if n.width > 1001 then -- 1001 is arbitrary but sufficiently high to be bigger than most weird glue uses t[i] = ' ' i = i + 1 end elseif id == disc then i = real_extract_text(n.replace, t, i) elseif id == hlist then i = real_extract_text(n.head, t, i) elseif id == vlist then i = extract_text_vlist(n.head, t, i) end end return i end function extract_text_vlist(head, t, i) -- glue should not become a space here for n, id in traverse(head) do if id == hlist then i = real_extract_text(n.head, t, i) elseif id == vlist then i = extract_text_vlist(n.head, t, i) end end return i end return function(list) local t = {} real_extract_text(list.head, t, 1) return table.concat(t) end

This can be used as a normal Lua module:

\documentclass{article}
\usepackage{fontspec}

\newfontfamily{\devharf}{Noto Sans Devanagari}[Script=Devanagari, Renderer=HarfBuzz] \newfontfamily{\devnode}{Noto Sans Devanagari}[Script=Devanagari, Renderer=Node]

\begin{document}

% Devanagari text is at the right end of following line % of code, you might have to scroll right to read it \setbox0=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence. \devharf एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।} \setbox1=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence. \devnode एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}

\directlua{ local extracttext = require'extracttext' local f = io.open("hello.harf.txt","w") % Can reproduce the full input text f:write(extracttext(tex.getbox(0))) f:close()

f = io.open("hello.node.txt","w") % In node mode, we only get an approximation f:write(extracttext(tex.getbox(1))) f:close() }

\box0 \box1 \end{document}

A more general note: As you can see, there is some work involved when it comes to getting text from a shaped list, especially in the ToUnicode case where we have to map surrogate pairs and stuff. This is mostly because shaped text is not intended for such use. As soon as the glyph nodes are protected (aka. subtype(n) >= 256 or not is_char(n) is true), the .char entries no longer contain Unicode values but internal identifiers, the .font entry might no longer be the value you expect and some glyphs might not be represented as glyphs at all. In most cases in which you want to actually access the text behind a box and not just the visual display of the text, you really want to intercept the list before it gets shaped in the first place.

  • Which implies that time's directionality is part of the definition of the typeset object, which in turn implies that the process is not symmetrical when time's arrow is reversed. Indeed, with \expandafter\newcommand\csname xx123\endcsname{^^^^0065} \setbox0=\hbox{abc{\Huge \csname xx123\endcsname}}, the "text" that is retrieved is abce. The analogy of TeX's 'mouth' and 'stomach' comes to mind. ("Reconstructed" might be more suitable a term than "retrieved".) – Cicada Sep 26 '20 at 05:11