Parsing text letter by letter

Question

The PGF package has a module named parser that can parse a block of text from an initial state to a final state letter by letter.

For example, the following MWE will parse a given text and count the "Z's", ignore the b's and highlight the a's.

\documentclass{article}
\usepackage{pgf}
\usepackage{soul}
\usepgfmodule{parser}
\begin{document}
\newcount\mycount

\pgfparserdef{myparser}{initial}{the letter Z}{\advance\mycount by 1\relax Z}
\pgfparserdef{myparser}{initial}{the letter a}{\hl{a}}
\pgfparserdef{myparser}{initial}{the letter b}{} % do nothing
\pgfparserdef{myparser}{initial}{the letter c}{\pgfparserswitch{final}}% done!

\pgfparserparse{myparser}ZZZZaabaabaZbabbbbbabaabcccc%

There are \the\mycount\ Z's.

\end{document}

The code will trigger an error if in the sample text an unknown letter is included (I would like it just to be ignored). Is there a short hand way to define such an action or I would need to define all the letters?

I never did figure out how to make that parser useful. I also want to know the answer to this. — Ryan Reich, Feb 24 '12 at 20:31
@RyanReich The manual does not give much info. I have been battling to get it to accept a blank space, where it states as long as the character is represented by meaning it should work. Maybe the German version of the manual gives more info. — yannisl, Feb 24 '12 at 20:47
@AhmedMusa For example the letter A hasn't been defined using \pgfparserdef and leads to errors. — yannisl, Feb 24 '12 at 21:02
The blank space thing was what got me also; I figured that I just didn't know the correct \meaning (it is rather tricky to get). I wonder if the parser doesn't accept spaces correctly? — Ryan Reich, Feb 24 '12 at 22:21
Continuing on Ahmed's question: If a non-pgf solution is ok, is it ok to require the text that you parse to be given as a macro argument? — Bruno Le Floch, Feb 24 '12 at 23:41
@RyanReich The \meaning of a blank space is a blank, but sure it does not get accepted. — yannisl, Feb 25 '12 at 04:28
@Yiannis: the \meaning of a blank space is blank space (with two trailing spaces, hence the difficulty to get it into TeX). — Bruno Le Floch, Feb 25 '12 at 04:48
@AhmedMusa If you have a beautiful and clever solution go ahead and post it:) — yannisl, Feb 25 '12 at 05:18
@BrunoLeFloch Thanks, so the one trailing space is the active part and the next its expansion? — yannisl, Feb 25 '12 at 05:40
@YiannisLazarides Not sure what you mean by "active part". The scheme for every catcode is the same: <catcode description><space><character>. For instance, the letter<space>A, or, say for a [catcode=10, charcode=64] token, blank space<space>@. To see that, try \def\showmeaning#1{\showtokens\expandafter{\meaning#1}}\lccode32=64\lowercase{\showmeaning{ }}. — Bruno Le Floch, Feb 25 '12 at 07:36
@RyanReich: I just learned about this feature of pgf 2.10, and I have to say that it's a pity I didn't know about it (and that it wasn't available 4 years ago, when I needed exactly this and had to code it myself). My use case is extracting a city name from a full postal address; in Poland, it is customary to give a zip code (of the form dd-ddd, where "d" is a digit) right before the city name. — mbork, Feb 25 '12 at 12:04

egreg · Answer 1 · 2012-02-25T11:04:23.967

Do you want something like this?

\documentclass{article}
\usepackage{xparse}
\usepackage{soul}
\ExplSyntaxOn
\NewDocumentCommand{\xparserdef}{mmmm}
  {
   \cs_new:cpn { xparser_name_#1_state_#2_#3: } { #4 }
  }
\NewDocumentCommand{\xparserparse}{mm}
  {
   \tl_set:Nn \l_xparser_state_tl { initial }
   \tl_set:Nx \l_tmpa_tl { \tl_to_str:n {#2} } 
   \tl_replace_all:NnV \l_tmpa_tl { ~ } \c_catcode_other_space_tl
   \tl_map_inline:Nn\l_tmpa_tl
     {
      \str_if_eq:VnF \l_xparser_state_tl { final }
        { \use:c { xparser_name_#1_state_ \l_xparser_state_tl _##1: } }
     }
  }
\tl_new:N \l_xparser_state_tl
\cs_generate_variant:Nn \tl_replace_all:Nnn {NnV}
\NewDocumentCommand{\xparserswitch}{m}
  {
   \tl_set:Nn \l_xparser_state_tl { #1 }
  }
\ExplSyntaxOff

\begin{document}
\newcount\mycount

\xparserdef{myparser}{initial}{Z}{\advance\mycount by 1\relax Z}
\xparserdef{myparser}{initial}{a}{\hl{a}}
\xparserdef{myparser}{initial}{b}{} % do nothing
\xparserdef{myparser}{initial}{ }{\textcolor{red}{S}}
\xparserdef{myparser}{initial}{c}{\xparserswitch{final}}% done!
\xparserdef{myparser}{initial}{|}{\xparserswitch{bar}}
\xparserdef{myparser}{bar}{|}{\xparserswitch{initial}}

\xparserparse{myparser}{ZZZZa ab|aabaZ|baZbbbbbabaabccccZ}

There are \the\mycount\ Z's.

\end{document}

In the definition of \xparserdef there should probably be a check that the second argument is not final.

enter image description here

Notice that the | hides the fifth Z and that the sixth is ignored as we are in state final. The macros also allow to define an action for "space" (thanks to Bruno Le Floch for suggesting the way).

Good, but that will still choke on spaces. You need to use \futurelet to do that properly, or perhaps \tl_set:Nx \l_tmpa_tl { \tl_to_str:n {#2} } \tl_replace_all:Nno \l_tmpa_tl { ~ } { \c_other_space_tl } \tl_map_inline:Nn\l_tmpa_tl { ... }. — Bruno Le Floch, Feb 25 '12 at 00:47

score 5 · Accepted Answer · answered Feb 25 '12 at 20:12

Some times ago I created a similar parser function with LuaLaTeX. I used it to read text files, count and change some chars and put some LaTeX commands into the output.

\documentclass{book}
\usepackage{filecontents}

%create a test text file
\begin{filecontents*}{lorem.txt}
Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore 
magna aliquyam erat, sed diam voluptua. At vero eos et accusam 
et justo duo dolores et ea rebum.
\end{filecontents*}

%create a lua script file
\begin{filecontents*}{luaFunctions.lua}

function createReplaceTable()    
    replaceTable = {}

    -- create a table with all ASCII chars
    -- the name and(!) the value of each table item is the ASCII char
    -- this is important if the char shouldn't be replaced
    -- the table have 128 items each filled with the corresponding char
    for i = 1, 128, 1 do    
       replaceTable[string.char(i-1)] = string.char(i-1)
    end
end

function parseString(input)
    outputString = ""

    -- for each char in the given string we replace
    -- the char with the content of the table item
    -- because the table items have the same name like the chars
    -- we have access to the table item via the given char
    for i = 1, string.len(input) do
        char = input:sub(i, i)
        outputString = outputString..replaceTable[char]
    end

    tex.print(outputString)
end

function parseFile(fileName)
    -- open file
    local input = io.open('lorem.txt', 'r')

    -- parse each line
    for line in input:lines() do
        parseString(line)
    end
end

function fillReplaceTable()
    -- here we fill/override the replacements for each ASCII char
    replaceTable["L"] = "\\textbf{\\large L}\\marginpar{\\tiny 'L'(\\stepcounter{counterForL}\\#\\thecounterForL)}"
    replaceTable["o"] = "\\underline{o}"
    replaceTable["e"] = ""
end

\end{filecontents*}    

% read the external lua file to declare the functions,
% but without execute the Lua commands and functions
\directlua{dofile("luaFunctions.lua")}

%create and fill the tables
\directlua{createReplaceTable()}
\directlua{fillReplaceTable()}

% latex commands to execute the lua functions
\def\parseString#1{\directlua{parseString("#1")}}
\def\parseFile#1{\directlua{parseFile("#1")}}

%counter for the letter 'L'
\newcounter{counterForL}

\begin{document}
\parseString{%
 Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.%
}

\parseFile{lorem.txt}
\end{document}

enter image description here

How would you handle escaping of characters in Lua. Would you need to escape them in the text file or is there a mechanism to automate this. What happens when you read the character? — yannisl, Feb 26 '12 at 08:45

score 3 · Answer 3 · answered Feb 25 '12 at 05:26

So far, have managed to keep the parser quiet and save typing by looping through the alphabet using a \@tfor and creating the macros.

Tried also PGF's @foreach unsuccessfully and would welcome some pointers in this respect.

% Letter definitions
\@tfor\next:=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890[].;:-=\  \do{%
  \def\command@factory#1{%
  \pgfparserdef{myparser}{initial}{\meaning #1}{\textcolor{purple}{#1}}%
  }
 \expandafter\command@factory\next
}

For the space, if it is entered as \, it appears to be working (but would have been better, if the parser would work without such hand marking.

Interestingly, if one adds \lipsum in the alphabets above i.e,

 abcdef\lipsum g...

It will get parsed and expanded as a single character (it will get fully printed in purple) in the MWE below.

\documentclass{article}
\usepackage{lipsum}
\usepackage{pgf}
\usepackage{soul}
\usepgfmodule{parser}
\usepackage{pgffor}
\begin{document}
\makeatletter
\newcount\mycount

% Letter definitions
\@tfor\next:=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890[].;:-=\lipsum\space\ \do{%
  \def\command@factory#1{%
  \pgfparserdef{myparser}{initial}{\meaning #1}{\textcolor{purple}{#1}}%
  }
 \expandafter\command@factory\next
}

%\foreach \x in {a,...,z} {\command@factory\x}

\pgfparserdef{myparser}{initial}{the letter Z}{\advance\mycount by 1\relax Z}
\pgfparserdef{myparser}{initial}{the letter a}{\hl{a}}
\pgfparserdef{myparser}{initial}{the letter b}{} % do nothing
\pgfparserdef{myparser}{initial}{the letter c}{c} % do nothing
\pgfparserdef{myparser}{initial}{the letter G}{\textcolor{blue}{George}}
\pgfparserdef{myparser}{initial}{the character !}{\pgfparserswitch{final}}% done!

\pgfparserparse{myparser}ZZZZaabaabaZebabbdbQG012\ 345booopsbabaabggg[g][1=;].cccc\lipsum!! 

\end{document}

Without looking into it at all, I'm guessing that \foreach breaks because it executes each iteration in a group, and \pgfparserdef is local. Try setting \globaldefs=1 (this was suggested in an answer here that I can't find right now). — Ryan Reich, Feb 25 '12 at 05:36
Also, thanks for answering a question I didn't ask but was wondering about: how the parser handles macros. Apparently it does the logical thing and treats them as a single token with a specific meaning. Since you set the behavior of the parser on \lipsum itself, it will pick up that token as a single entity. (If you hadn't, I bet it would have raised an error). — Ryan Reich, Feb 25 '12 at 05:39
@Yiannis: I don't know the file that defines \pgfparserdef, but there is a spurious space in the definition. — Ahmed Musa, Feb 26 '12 at 06:25

score 2 · Answer 4 · answered Feb 24 '12 at 22:52

Not used this before but it looks like if you change a copy at line 45 from

    \ifx\pgf@parser@action\relax%
      \PackageError{pgfparse}{Unexpected character
        '\meaning\pgf@parser@symbol' in parser '\pgf@parser' in state
        '\pgf@parser@state'}{}%
    \fi%

to

    \ifx\pgf@parser@action\relax%
      \PackageWarning{pgfparse}{Unexpected character
        '\meaning\pgf@parser@symbol' in parser '\pgf@parser' in state
        '\pgf@parser@state'}{}%
    \fi%

It's ignored with a warning, or if you just lost those lines it would be silently ignored.

Then the silence package should be a better way to do that, shouldn't it? — Bruno Le Floch, Feb 24 '12 at 23:31

Ahmed Musa · Answer 5 · 2012-02-28T20:41:49.773

\documentclass{article}
\usepackage[dvipsnames]{xcolor}
\usepackage{soul}
\usepackage{ltxkeys}
\makeatletter

\new@def*\cptifcmdeqTF#1{\expandafter\ifcseqTF\cpt@car#1\cpt@quark\car@nil}
\cptswap{ }{\let\cptblankspace= }
\new@def*\cptendparse{\@gobble\cptendparse}
\newletcs\cptstopparse\cptendparse
\ltxkeys@declarekeys*[CPT]{parserparse}[cpt@parser@]{%
  cmd/id/currparser;
  cmd/state/initial;
}
% \cptparserdef{<parserid>}{<state>}{\meaning<token>}{<defn>}
\robust@def\cptparserdef#1#2#3#4{%
  \long\csn@edef{cpt@parser@#1@#2@#3}{\unexpanded{#4}}%
}
% \ParserParseDef{<keyval>}{<tokenlist>}{<defn>}
% Use '#1' in <defn> to access the current token of <tokenlist>.
\robust@def*\ParserParseDef{\cpt@teststopt\cpt@ParserParseDef{}}
\robust@def\cpt@ParserParseDef[#1]#2#3{%
  \let\ifcpt@parser@st\ifcpt@st
  \ltxkeys@launchkeys[CPT]{parserparse}{#1}%
  \ifcpt@parser@st\expandafter\expandafter\fi
  \cpttfor#2\dofor{%
    \cptifcmdeqTF{##1}\cptblankspace{%
      % Current system definition for space token:
      \cptparserdef{\cpt@parser@id}{\cpt@parser@state}
        {blank space\@space\@space}{\@space}%
    }{%
      \edef\parser@tempa{\cpttrimspace{##1}}%
      \edef\parser@tempa{\expandafter\meaning\parser@tempa}%
      \cptparserdef{\cpt@parser@id}{\cpt@parser@state}{\parser@tempa}{#3}%
    }%
  }%
}
% \ParserParseSelectDef{<keyval>}{<tokenlist>}
% <tokenlist> -> {<token>}{<defn>}
% You can use '#1' in <defn> to access the first token of the current
% pair of <tokenlist>.
\robust@def*\ParserParseSelectDef{\cpt@teststopt\cpt@ParserParseSelectDef{}}
\robust@def\cpt@ParserParseSelectDef[#1]#2{%
  \let\ifcpt@parser@st\ifcpt@st
  \ltxkeys@launchkeys[CPT]{parserparse}{#1}%
  \begingroup
  \@tempcnta\z@pt
  \def\parser@do##1{%
    \cptifcmdeqTF{##1}\parser@do{}{%
      \advance\@tempcnta\@ne\parser@do
    }%
  }%
  \ifcpt@parser@st\expandafter\expandafter\fi
  \parser@do#2\parser@do
  \ifodd\@tempcnta
    \cpt@err{User list items not pairwise balanced}
      {List items for \noexpand\ParserParseSelectDef
      must be even in number}%
  \fi
  \endgroup
  \def\parser@do##1##2{%
    \cptifcmdeqTF{##1}\parser@do{}{%
      \cptifcmdeqTF{##1}\cptblankspace{%
        % Current user definition for space token:
        \cptparserdef{\cpt@parser@id}{\cpt@parser@state}
          {blank space\@space\@space}{##2}%
      }{%
        \edef\parser@tempa{\cpttrimspace{##1}}%
        \edef\parser@tempa{\expandafter\meaning\parser@tempa}%
        % This trick is to enable '#1' to be used in <defn> to access the
        % first token of the current pair of <tokenlist>.
        \def\reserved@a####1{\@temptokena{##2}}%
        \reserved@a{##1}%
        \def\reserved@a####1{%
          \cptparserdef{\cpt@parser@id}{\cpt@parser@state}{\parser@tempa}{####1}%
        }%
        \expandafter\reserved@a\expandafter{\the\@temptokena}%
      }%
      \parser@do
    }%
  }%
  \ifcpt@parser@st\expandafter\expandafter\fi
  \parser@do#2\parser@do\parser@do
}
\robust@def*\cptparserparse{\cpt@testopt\cpt@parserparse@a{}}
\robust@def*\cpt@parserparse@a[#1]{%
  \ltxkeys@launchkeys[CPT]{parserparse}{#1}%
  % Gobble any space after ']'. If the user needs any space after ']',
  % he has to insert an explicit space token:
  \expandafter\cpt@parserparse@b\romannumeral-`\q\noexpand
}
\robust@def*\cpt@parserparse@b{%
  \futurelet\cpt@parsersymbol\cpt@parserparse@c
}
\robust@def*\cpt@parserparse@c{%
  \ifx\cpt@parsersymbol\cptendparse
    \let\cpt@parseraction\relax
    \def\cpt@parserparse@d{\let\cpt@parserignore=}%
  \else
    \def\cpt@parserparse@d{%
      \afterassignment\cpt@parserparse@b\let\cpt@parserignore= %
    }%
    \letcstocsn\cpt@parseraction{cpt@parser@\cpt@parser@id
      @\cpt@parser@state @\meaning\cpt@parsersymbol}%
    \ifdefTF\cpt@parseraction{}{%
      \cpt@err{Unexpected character '\meaning\cpt@parsersymbol'
        in parser '\cpt@parser@id' of state '\cpt@parser@state'}\@ehc
    }%
  \fi
  \cpt@parseraction
  \cpt@parserparse@d
}
\edef\cptparserchars{%
  abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ%
  1234567890\cpt@otherchars\noexpand\cptblankspace
}

% Examples:
\newcommand*\sometext[1][1]{%
  \cptdotimes{#1}{Here is some sample text that should fit in the given space.}
}
\cptrobustify\sometext

% Default system definitions; initialization is by the user:
\ParserParseDef*[id=myparser,state=initial]\cptparserchars
{\textcolor{orange}{#1}}

% Peculiar user definitions:
\ParserParseSelectDef[id=myparser,state=initial]{%
  {Z}{\advance\mycount\@ne\textcolor{red}{\fbox{#1}}}
  {a}{\hl{a}} {b}{} {c}{c}
  {G}{\textcolor{blue}{George}}
  {!}{This is exclamation mark.}
  {\cptblankspace}{\textcolor{green}{\texttt{@}}}
  {\sometext}{\textcolor{purple}{#1}}
}

\makeatother

\begin{document}
\newcount\mycount
\noindent
\cptparserparse[id=myparser,state=initial]
Z ABC XYZ aabaaba Z ebab QOG 012345 booops babaab egg [foo]/[1=;].cccc.
\sometext!\cptendparse

\par\medskip\noindent
Number of character \textcolor{red}{\texttt{\fbox{Z}}} in token list: \number\mycount.

\end{document}

enter image description here

To-do

What happens to brace-groups in the given token list? For example, what happens to {ZZZaababa} in the following?

\cptparserparse{myparser}Z {ZZZaababa}.cccc!

Should parsing start anew locally in {ZZZaababa}? In selective sanitization, each brace-group is treated according to specific instructions for brace-groups.

Also, what level of nesting of brace-groups should the instructions apply to? For example, how far should parsing go in

\cptparserparse{myparser}Z {{{{{x{ZZZaababa}}}}}}.cccc!

Thanks - this is a very good tip.However, most of my examples and responses use common constructs such as for-loops, as they make them more understandable. — yannisl, Feb 26 '12 at 07:43
It is more or less. I will check it out a bit later and give you a proper response, as I need to familiarize myself with catoptions to understand your code. Thanks for all the effort. — yannisl, Feb 26 '12 at 09:03
If the code is useful I can remove the dependence on catoptions package. — Ahmed Musa, Feb 26 '12 at 17:35
Thanks - I am trying to build a generalized parser and thought this would be easier than from scratch. Not to worry about further edits. I will post a more specific question over the next few days and ping you here. — yannisl, Feb 26 '12 at 18:10

Parsing text letter by letter

5 Answers5

Linked