What algorithm does LuaLaTeX use for fallback fonts?

Question

Background I have been busy with a book about the worlds scripts and languages on and off for a couple of years. It covers all the listed scripts in unicode, as well as a dozen or so, where there is no unicode standard yet and I use images instead of fonts.

With Noto fonts, unicode LuaLaTeX and l3 maturing, I have been able to print a reasonable range for all scripts, as needed in the write up. With the exception of East Asian scripts, that I only have a few pages only per script. I use as a main font Brill and I added fallback fonts to cover the rest of the scripts. Book so far, hovers around 350 pages and I anticipate that it will run to a final size of 600 pages. To cover the unicode points the fonts need to provide +-150,000 glyphs. Not all are codepoints are used in the book, as I mentioned earlier, in my estimation I only need about half of that. Obviously and understandably compilation speed is an issue, so I am looking to understand the algorithm used by luaotfload-fallback.lua ato try and see if I can improve the processing time. I am looking at strategies to optimize compilation times, not only for my document, but in general.

I have identified bottlenecks in mainly three areas a) fonts b) images c) logging (disk writes in general). Images I will use a preproceesor and optimize all of them as well as produce pdfs. I will use Golang for the preprocessor, which can also do marking if needed. Ideas for fonts and logging see below.

I have this (crazy) idea that the glyph info required at the nodes during processing, be obtained via a local server, so some tasks can be externalized and run with concurrency. I am thinking of some form of priority que, so that data for codepoints used frequently can be served fast and any unused codepoints on a second run, be taken out of the cache. Again here I will use Golang and sqlite3 since everything is local. I have a Lua table at the moment, which maps unicode points to fonts, based on a config file.
All logging to be also sent to a server rather than written on disk. Can also be done for the aux files.
The generation of the pdf also takes time, but I am undecided at this point if it can be optimized. Current compilation speed is about 1.3 seconds per page + an initial of 30-40 seconds.

Question Can someone explain to me, the algorithmic steps in luaotfload-fallback.lua? When and how is this used by LuaTeX when building a document? At which point are the glyph info needed? Any ideas welcome. Thank you for reading this far.

Your level of expertise is waaaay above my head, but: Seems to me that your concept 1 is on the right track. Maybe use PHP or some other external script to pre-process the TeX file, with appropriate substitutions? Then the font loader has less to think about. — rallg, Jan 08 '24 at 18:56
Ad 1) Wouldn't be easier to create a Lua file with huge hash table containing information about each glyph, including which font should be used to render it, script, languages, etc. I guess it will be faster than querying server, even if it is local and fast. It should be possible to get Luaotfload's font ID for the font name saved in the glyph info, and assign it to the node's font field in some of the node processing callbacks. I guess you could bypass font callbacks this way. — michal.h21, Jan 08 '24 at 19:00
@michal.h21 I was hoping you were around, as I value your thoughts with anything Lua. I already have a Lua table like you described, the thought of the server was as you process, you prioritize certain glyphs and remove ones not used. Latency on a local server is about the same like read/writes,. So progressively as the document grows these transactions get progressively faster, hopefully:) — yannisl, Jan 08 '24 at 19:11
@rallg Yes language is not important can also be done with Lua. Advantage of Golang is I now know it better than PHP (I gave up on it after codeigniter died and forgot most of it) and Golang has the concept of runes (native UTF8. Lua is another option also. — yannisl, Jan 08 '24 at 19:16
I still think Lua should suffice for that. I am bit ill at the moment, so I don't have an energy to do much right now, but I've found a nice script in LuaTeX's manual, section 12.5, page 244. It can be executed from the shell, and it loads OTF or TTF font, listing all glyphs. It can be modified to list supported Unicode codepoints for these glyphs if you replace g.name with g.unicode. — michal.h21, Jan 08 '24 at 20:34
Another possibility is to find Unicode characters supported by fonts is to use Luaotfload's luaotfload.patch_font callback (page 30 of the manual). Here you can traverse over font metadata in tfmdata. Characters supported by the font are in tfmdata.characters. I am not sure if processing and storing this information during the compilation is slower than reading this info from external file or database. I guess it depends on the font size. — michal.h21, Jan 08 '24 at 22:26
@yannisl I strongly doubt that the fonts are an issue here, unless you are doing something very weird. Trying a simple example document loading a fallback font where the component fonts contain about 190000 glyphs and then creating a full document where every page is filled with different glyphs I get roughly 3 second initialization time followed by less than .1 seconds per page. You could make the initialization a bit faster by caching a few more things, but mostly that's already mostly reading cached files. — Marcel Krüger, Jan 09 '24 at 23:42
Afterwards during the run TeX has the glyph data it needs itself in it's internal datastructures already, at which point it can be accessed with a single in memory read at a fixed offset. Adding an external server there providing data would not only require significant patching but also add a huge overhead. — Marcel Krüger, Jan 09 '24 at 23:45
Regarding what luaotfload-fallback.lua does: It creates a Lua table mapping glyphs to font ids at initialization time and then at runtime just iterates over all glyphs and changes the font according to the table without any further logic involved. — Marcel Krüger, Jan 09 '24 at 23:49
@MarcelKrüger Thanks. I will create a couple of minimals , retest and post them tomorrow here with a bounty when the post becomes eligible. I hear you about the server, was a crazy idea as I mentioned. — yannisl, Jan 10 '24 at 03:47
@MarcelKrüger Can you please add Unicode's LastResort font and see if you get a noticable difference when you run your test script? — yannisl, Jan 10 '24 at 04:07
More precisely, it used Latin Modern, Noto Serif CJK, Unifont, Unifont Upper and LastResort. — Marcel Krüger, Jan 10 '24 at 08:55
@MarcelKrüger Retried my scripts outside LuaLaTeX and got results similar to yours. I have added a bounty, f you can please summarize your comments I will accept them as an answer. — yannisl, Jan 11 '24 at 05:20

score 7 · Accepted Answer · answered Jan 14 '24 at 11:03

This doesn't answer the question in the question title at all, but I think that this addresses the issues presented in the question body (hopefully).

Indirect Answer

Here's a solution that loads 231 unique fonts and prints 83 020 unique characters (103 pages) in 7.505 seconds (on average) using LuaLaTeX.

First, run this script to download all the fonts:

#!/bin/sh
set -eu
mkdir fonts
cd fonts
git clone --depth 1 --no-checkout --filter=blob:none 

    https://github.com/notofonts/notofonts.github.io.git
cd notofonts.github.io
git sparse-checkout set --no-cone '!/' '/fonts//hinted/ttf/-Regular.ttf'
git checkout main
cd ..
git clone --depth 1 --no-checkout --filter=blob:none 

    https://github.com/notofonts/noto-cjk.git
cd noto-cjk
git sparse-checkout set --no-cone '!/' '/Serif/SubsetOTF//-Regular.otf'
git checkout main
cd ..
wget -O unifont-Regular.otf 

    https://unifoundry.com/pub/unifont/unifont-15.1.04/font-builds/unifont-15.1.04.otf
wget -O unifont_upper-Regular.otf 

    https://unifoundry.com/pub/unifont/unifont-15.1.04/font-builds/unifont_upper-15.1.04.otf
wget -O NotoEmoji-Regular.ttf 

    "$(curl 'https://fonts.googleapis.com/css2?family=Noto+Emoji' | grep -o 'https.*ttf')"
cd ..

Then, place the following in all-characters.lua:

-- Save some globals for speed
local ipairs = ipairs
local max = math.max
local new_node = node.new
local node_write = node.write
local pairs = pairs
-- Define some constants
local GLUE_ID = node.id("glue")
local GLYPH_ID = node.id("glyph")
local SIZE = tex.sp("10pt")
-- Get all the fonts
local fontpaths = dir.glob("*-Regular.", "./fonts")
-- Sort the fonts such that the "preferred" fonts are last
table.sort(fontpaths, function(a, b)
    local a = file.nameonly(a):match("(.+)-Regular")
    local b = file.nameonly(b):match("(.+)-Regular")
if a:match(&quot;Serif&quot;) and not b:match(&quot;Serif&quot;) then
    return false
end
if b:match(&quot;Serif&quot;) and not a:match(&quot;Serif&quot;) then
    return true
end
if a:match(&quot;unifont&quot;) and not b:match(&quot;unifont&quot;) then
    return true
end
if b:match(&quot;unifont&quot;) and not a:match(&quot;unifont&quot;) then
    return false
end
if #a == #b then
    return a &gt; b
end
return #a &gt; #b

end)
-- Create a mapping from codepoint to font id
local by_character = {}
local virtual_fonts = {}
for _, filename in ipairs(fontpaths) do
    local fontdata = fonts.definers.read {
        lookup = "file",
        name = filename,
        size = SIZE,
        features = {},
    }
    local id = font.define(fontdata)
    fonts.definers.register(fontdata, id)
virtual_fonts[#virtual_fonts + 1] = { id = id }

for codepoint, char in pairs(fontdata.characters) do
    if char.unicode == codepoint then
        by_character[codepoint] = {
            width = char.width,
            height = char.height,
            depth = char.depth,
            font = id,
            commands = {
                { &quot;slot&quot;, #virtual_fonts, codepoint }
            },
        }
    end
end

end
local function print_all_chars()
    local count = 0
tex.forcehmode()
for codepoint, data in table.sortedpairs(by_character) do
    local glyph = new_node(GLYPH_ID)
    glyph.font = data.font
    glyph.char = codepoint

    local space = new_node(GLUE_ID)
    space.width = max(2 * SIZE - glyph.width, 0)
    glyph.next = space

    node_write(glyph)
    count = count + 1
end
tex.sprint(&quot;\\par Characters: &quot; .. count)
tex.sprint(&quot;\\par Fonts: &quot; .. #virtual_fonts)

end
-- Make the virtual font
local id = font.define {
    name = "all-characters",
    parameters = {},
    characters = by_character,
    properties = {},
    type = "virtual",
    fonts = virtual_fonts,
}
local new_command
if ltx then
    new_command = function(name, func)
        local index = luatexbase.new_luafunction(name)
        lua.get_functions_table()[index] = func
        token.set_lua(name, index, "protected")
    end
elseif context then
    new_command = function(name, func)
        interfaces.implement {
            name = name,
            actions = func,
            public = true,
        }
    end
end
new_command("printallchars", print_all_chars)
new_command("allcharactersfont", function() font.current(id) end)

Then, you can print all the characters using the following document:

\documentclass{article}
\ExplSyntaxOn
\lua_load_module:n { all-characters }
\ExplSyntaxOn
\begin{document}
    \printallchars
\end{document}

ConTeXt is 50% faster at 4.849 seconds on average:

\ctxloadluafile{all-characters}
\starttext
    \printallchars
\stoptext

More usefully, this also defines a virtual font \allcharactersfont that contains characters from all the loaded fonts:

\documentclass{article}
\pagestyle{empty}
\ExplSyntaxOn
\lua_load_module:n { all-characters }
\ExplSyntaxOn
\begin{document}
    {\allcharactersfont
        A Ξ Ж س
        क ௵ ෴ ფ
        ጄ ᑠ ᘟ Ⅶ
        ∰ ⡿ だ 㬯
        ䷥
}

\end{document}

Direct Answer

I have this (crazy) idea that the glyph info required at the nodes during processing, be obtained via a local server, so some tasks can be externalized and run with concurrency. I am thinking of some form of priority que, so that data for codepoints used frequently can be served fast and any unused codepoints on a second run, be taken out of the cache. Again here I will use Golang and sqlite3 since everything is local. I have a Lua table at the moment, which maps unicode points to fonts, based on a config file.

The document below loads all 231 fonts in 2.426 seconds on average, so there's not much room to speed up the font loading.

\ExplSyntaxOn
\lua_load_module:n { all-characters }
\csname@@end\endcsname

If you did still want to speed it up, the easiest way would be to place the font files and luaotfload caches in a RAM disk.

All logging to be also sent to a server rather than written on disk. Can also be done for the aux files.

Aside from some package initialization spam and overfull box warnings, your document shouldn't be producing that much log output. If you do have that much output, then I'd try and reduce the amount of output rather than trying to optimize it.

The generation of the pdf also takes time, but I am undecided at this point if it can be optimized. Current compilation speed is about 1.3 seconds per page + an initial of 30-40 seconds.

Disabling PDF compression can help a little, but 1.3 seconds per page suggests that something else is going on.

Another common issue is complicated TikZ figures, so if you're drawing any glyphs with TikZ then you should externalize and cache them.

Loading images can also be slow, so if you're loading a bunch of characters as individual files, then it's quite a bit faster to combine them all into a single PDF file and select the character by page number. pdfTeX (and maybe LuaTeX too?) closes each opened PDF file after every page, so it's much faster to load all the pages/characters into individual boxes at the start of each run than it is to reload the PDF file each time. (Or better yet, see the suggestion below.)

as well as a dozen or so, where there is no unicode standard yet and I use images instead of fonts.

[...]

Images I will use a preproceesor and optimize all of them as well as produce pdfs

If you have the character images available as SVG files, then my (unreleased/experimental) unnamed-emoji package solves almost this exact problem. There's a little bit of end-user documentation, but for actually building the “font” files you'll need to use the Makefile as a rough guide.

Thanks Max, impressive speeds. I will try your suggestions and I think the idea of a virtual font, is much better than a fallback font. I wanted to try it but could not really understand how to use it, your post clarifies it. I only have 3-4 scripts with images rather than fonts (I used .pngs). . Some of my code. I have the fonts in a local directory, on windows, https://gist.github.com/yannisl/29615e20a71d54d9fa55c870605fad78 for a script. It is going around in a by-pass way rather than the virtual font and hence my question. — yannisl, Jan 14 '24 at 11:29

What algorithm does LuaLaTeX use for fallback fonts?

1 Answers1

Indirect Answer

Direct Answer