LuaTeX: Devanagari glyph order is reversed in tex's internal nodelist, how to recover the correct order while traversing glyph nodes?

Question

Note: The problem discussed in this question might affect not just Devanagari, but also other scripts that have character (logical/typed) & glyph (presentational/typeset) order reversed/reordered for some subset of character sequence.

In Devanagari, which is an abugida writing system, some vowel-notations (like accents in Latin alphabet) that are typed after a consonant can modify the shape of that consonant such that it occupies space to its left while pushing the consonant glyph to its right (even though its a left-to-right writing/typing system).[1] LuaTeX produces a pdf that is good for reading, and printing. Though it doesn't always[2] produce a pdf (for Devanagari) that is good for searching or copying text to a text editor, even in the simplest Hello World-type test (which I have added below). Note: This question is not necessarily just about copying/searching text correctly, please read further before posting a reply. I discovered this while testing my, and user michal-h21's text extraction techniques to extract text from a TeX box. The technique is simple: After a box has been set by TeX, traverse its nodelist to find and concatenate all unicode characters in order to get the unicode string of text set in the box.

Let's take an example where vowel-notation typed after consonant 'appears' to precede the consonant: Hello पिताजी. In this text, the first consonant-vowel pair पि is typed in following order: प (consonant), then ि (vowel notation); though as you can see the resultant text (correctly) appears to have the presentational order reversed: पि (as if ि preceded प). You can try this in your text editors to see the magic. Now let's discuss the options to typeset this Hello पिताजी text in a pdf using LuaLaTeX, and fontspec package, and the related problems of copying/searching & text extraction. Font used in the example below is Noto Sans Devanagari, its available for download here. I am using Adobe Acrobat Reader DC (free) to try copying from and searching the pdf as not all pdf readers have good support for non-Latin scripts (you might encounter problems copy-pasting non-Latin text from other pdf readers).

Fontspec with Renderer=Node (default value for Renderer): This mode seems particularly broken for copying/searching, and am not sure if there is any way to decipher real glyph order of Devanagari text. For our test text Hello पिताजी, if you copy and paste the text (from produced pdf) to a text editor, it will show it as Hello िपताजी. The root of the problem is that, in the internal nodelist representation for this text, TeX smartly swapped the order of [प, ि] to [ ि,प] as in typeset output ि is to appear to the left of प. But while doing so, it also "baked" this erroneous order of glyphs into the pdf. Thus while searching the pdf पि doesn't produce a hit, and while copying from pdf िप gets copied instead of expected पि. Lastly while traversing the glyph nodes, given TeX did change the internal order of glyphs, the extracted text too has िप instead of पि. So the questions for Renderer=Node are: Is there a way to decipher the real order of glyphs while traversing nodelist? This would help extracting the text, and operating on it in other ways. Is there a way to produce a correct pdf, in which text set in Devanagari can be searched/copied correctly just like text set in Latin script?

Fontspec with Renderer=HarfBuzz (new): This mode seems to produce, at least in this small test case, a correct pdf for searching/copying text. I am still trying to figure out how to correctly extract text by traversing nodes. The nodelist is structured in a different way, and I think a solution might be there. So the question for Renderer=HarfBuzz are: What fields in the nodelist should we look at for extracting text in correct order? The fontspec document says "Support for the Harfbuzz renderer is preliminary and may be improved over time.", moreover I vaguely remember (from TUG-2020) that this renderer has some limitations over Node mode. Can someone list what are the most serious limitations? And what do the authors of fontspec mean by "preliminary", and "may" in above excerpt?

Here's the test code for Renderer=Node:

% >>lualatex testdevanode.tex
\documentclass{article}
\usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
\usepackage{fontspec}
\usepackage[callback={}]{nodetree}
%\newfontscript{Devanagari}{deva,dev2}
\newfontfamily{\devanagarifam}{Noto Sans Devanagari}[Script=Devanagari, Scale=1, Renderer=Node]
\begin{document}
\NodetreeRegisterCallback{hpack_filter}

\setbox0=\hbox{Hello \devanagarifam पिताजी}

\box0

\NodetreeUnregisterCallback{hpack_filter}


\end{document}

And for Renderer=HarfBuzz (ctan's version of nodetree encounters a bug for Renderer=HarfBuzz so it has been commented out, you can uncomment it if you download and use the latest version from its GitHub reposition):

% >>lualatex testdevaharf.tex
\documentclass{article}
\usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
\usepackage{fontspec}
% \usepackage[callback={}]{nodetree}
\newfontfamily{\devanagarifam}{Noto Sans Devanagari}[Script=Devanagari, Scale=1, Renderer=HarfBuzz]
\begin{document}
% \NodetreeRegisterCallback{hpack_filter}

\setbox0=\hbox{Hello \devanagarifam पिताजी}

\box0

% \NodetreeUnregisterCallback{hpack_filter}


\end{document}

Update: Am attaching the relevant excerpt from chapter 12 of Unicode Standard specification v13.0 that confirms the violation done by Node renderer. The link to the document was pointed by user davislor.

[1] When looking at component glyphs that form a unit of sound (lets call it consonant-vowel pair going forward), that form the consonant-vowel pair, it might appear to a person trained in reading Latin based scripts that vowel must have been typed before the consonant, but that's not the case.

[2] We will discuss below a way that does seem to, at least in this test example, produce a pdf good for copying Devanagari text. Though it might need more testing.

The encoding you describe appears to violate section 12.1 of the Unicode Standard 13.0, if anyone wants an official reference. — Davislor, Sep 25 '20 at 04:24
The simplest workaround might be to use the HarfBuzz renderer? You could also, I suppose, do a regex search-and-replace in expl3 or Lua to normalize the mis-encoded forms. — Davislor, Sep 25 '20 at 04:25
I’m not familiar with extracting the node list, unfortunately. I’m a bit curious which project this is for. — Davislor, Sep 25 '20 at 04:53
@Davislor No it does not violate Unicode Standard 13.0! Please scroll to bottom of page 461 and see the figure 12-9, it reads: "Memory Representation and Rendering Order. The storage of plain text in Devanagari and all other Indic scripts generally follows phonetic order; that is, a CV syllable with a dependent vowel is always encoded as a consonant letter C followed by a vowel sign V in the memory representation. This order is employed by the ISCII standard and corresponds to both the phonetic order and the keying order of textual data (see Figure 12-9)." — codepoet, Sep 25 '20 at 05:18
I think we’re saying the same thing? I’m referring to this part: “TeX smartly swapped the order of [प, ि] to [ ि,प] as in typeset output ि does appear to the left of प.” But, as you just quoted, the dependent vowel should always follow the consonant in the string representation. If I understand correctly. The rendering engine is supposed to sort it all out. — Davislor, Sep 25 '20 at 05:21
Oh ok, it sounded like you are saying that my input to TeX: Hello पिताजी somehow has incorrect unicode encoding in it, and that the input itself is in violation of section 12.1. Now I get it, yes we are on the same page :) — codepoet, Sep 25 '20 at 05:24
Marcel had a talk about HarfBuzz at TUG 2020. https://m.youtube.com/watch?v=xPj6vNo8exY — Ulrike Fischer, Sep 25 '20 at 05:57
@UlrikeFischer Thanks for the link, such a coincidence that Marcel has a same example of glyph reordering in it (though the video doesn't dive into the problems that brings about in detail). Whats the right place to file a improvement request on Node renderer team? — codepoet, Oct 03 '20 at 21:45
Is this still a problem? Search, and Copy-Paste, work OK. Both ends of the pipe, from- and to-. The font (containing the conjunct/ligature etc information), and the font rendering engine (to handle such information), must both be suitable and applicable. The recent Noto fonts are that, and HarfBuzz font renderer for xe\lualatex. Modern browsers and operating systems and so on can now also render such fonts correctly. Both orders are correct: typing/phonetic order, rendered/display order. The font contains the rules, to map from one to the other. — Cicada, Aug 21 '21 at 13:56

LuaTeX: Devanagari glyph order is reversed in tex's internal nodelist, how to recover the correct order while traversing glyph nodes?

0 Answers0

Linked