Extract a substring from a Chinese string variable

Question

Looks like the stringstrings package can do this:

\substring{This is a string.}{1}{4}

gives This, but it seems to be failed for non-ASCII characters:

\substring{这是一个句子。}{1}{4}

cannot give 这是一个.

Is there a similar package working for Chinese characters？

Please tell us which TeX engine (pdfLaTeX, XeLaTeX, LuaLaTeX, something else?) you employ and which language-related packages (if any) you load. — Mico, May 16 '18 at 11:41
XeLaTeX, with xecjk, xunicode, fontspec packages used, thanks. — oaheix, May 16 '18 at 11:43
I think that this happens because there are no spaces between chinese characters. So the command works like the whole string is a word. — gvgramazio, May 16 '18 at 11:46
@giusva - More likely, it's because (a) in the utf8 encoding system, Chinese characters take up more than 2 bytes and (b) the stringstrings package is not sufficiently utf8-aware. (The package's user guide says that it can handle some 2-byte-encoded characters; however, that's not full utf8-awareness.) — Mico, May 16 '18 at 12:10

score 3 · Accepted Answer · answered May 16 '18 at 12:01

3

If you can switch to LuaLaTeX, it's straightforward to create a "wrapper" LaTeX macro that invokes the Lua function unicode.utf8.sub. (unicode.utf8.sub is a utf8-aware version of the standard string.sub function.)

\documentclass{article}
\usepackage{fontspec}
\setmainfont{MingLiU} % or whatever font you prefer
\newcommand\substring[3]{\directlua{%
   tex.sprint( unicode.utf8.sub ( "#1", #2 , #3 ) ) }}
\begin{document}
这是一个句子。

\substring{这是一个句子。}{1}{4}
\end{document}

answered May 16 '18 at 12:01

Mico

506,678

2

Thanks very much, but my case is highly dependent on the xecjk package, which does not work with LuaLaTeX. If there were no other solutions using XeLaTeX in the near future, I think I should mark your answer as accepted. – oaheix May 17 '18 at 00:29

Extract a substring from a Chinese string variable

1 Answers1

Linked