Extraction of data from a simple XML file

Question

I've a XML file with the contents:

<?xml version="1.0" encoding="utf-8"?>
<job xmlns="http://www.sample.com/">programming</job>

I need a way to extract what is in the <job..> </job> tags, programmin in this case. This should be done on linux command prompt, using grep/sed/awk.

If your XML file contained this: Tom & Jerry would you want the result to have XML escaping left alone: Tom & Jerry or would you want the escaping to be undone, as an XML parser would: Tom & Jerry If it's the latter, sorry, I don't know how to do that with Unix text tools. — Paul Clapham, Feb 09 '10 at 03:04
@Paul `s/&/\&/g`, same for `"` etc, of course it won't generalize for user-defined entities etc. — 13ren, Feb 10 '10 at 11:54
[https://stackoverflow.com/a/17333829/3291390](https://stackoverflow.com/a/17333829/3291390) — Stack Underflow, Jan 25 '20 at 03:41

amarillion · Accepted Answer · 2010-02-08T20:34:24.903

66

Do you really have to use only those tools? They're not designed for XML processing, and although it's possible to get something that works OK most of the time, it will fail on edge cases, like encoding, line breaks, etc.

I recommend xml_grep:

xml_grep 'job' jobs.xml --text_only

Which gives the output:

programming

On ubuntu/debian, xml_grep is in the xml-twig-tools package.

edited Feb 08 '10 at 20:34

answered Feb 08 '10 at 14:31

amarillion

23,409
14
65
80

Tight installation instructions would be great for xml_grep – paul_h Apr 01 '17 at 11:41
5

sudo apt-get install xml-twig-tools – FredFury Jul 25 '17 at 08:35
"grep" is just a synonym for painless text searching. – dr0i Jul 02 '18 at 10:42

score 15 · Answer 2 · answered Feb 08 '10 at 14:49

15

 grep '<job' file_name | cut -f2 -d">"|cut -f1 -d"<"

answered Feb 08 '10 at 14:49

Vijay

62,703
87
215
314

1

only that it fails if tags are on separate lines – ghostdog74 Feb 08 '10 at 23:53
8

There are about a dozen other ways that well-formed XML can make that fail. – Robert Rossney Feb 09 '10 at 03:10

score 11 · Answer 3 · answered Jul 02 '10 at 18:33

11

Using xmlstarlet:

echo '<job xmlns="http://www.sample.com/">programming</job>' | \
   xmlstarlet sel -N var="http://www.sample.com/" -t -m "//var:job" -v '.'

answered Jul 02 '10 at 18:33

lmxy

269
3
3

4

There is a significant number of different tools which use standard XPath notation to extract information from XML -- `xmlstarlet` is just one. Others include `xmllint`, `xpath`, etc. See http://stackoverflow.com/questions/15461737/how-to-execute-xpath-one-liners-from-shell – tripleee Jun 10 '15 at 07:28

score 9 · Answer 4 · answered Jun 10 '15 at 10:25

Please don't use line and regex based parsing on XML. It is a bad idea. You can have semantically identical XML with different formatting, and regex and line based parsing simply cannot cope with it.

Things like unary tags and variable line wrapping - these snippets 'say' the same thing:

<root>
  <sometag val1="fish" val2="carrot" val3="narf"></sometag>
</root>


<root>
  <sometag
      val1="fish"
      val2="carrot"
      val3="narf"></sometag>
</root>

<root
><sometag
val1="fish"
val2="carrot"
val3="narf"
></sometag></root>

<root><sometag val1="fish" val2="carrot" val3="narf"/></root>

Hopefully this makes it clear why making a regex/line based parser is difficult? Fortunately, you don't need to. Many scripting languages have at least one, sometimes more parser options.

As a previous poster has alluded to - xml_grep is available. That's actually a tool based off the XML::Twig perl library. However what it does is use 'xpath expressions' to find something, and differentiates between document structure, attributes and 'content'.

E.g.:

xml_grep 'job' jobs.xml --text_only

However in the interest of making better answers, here's a couple of examples of 'roll your own' based on your source data:

First way:

Use twig handlers that catches elements of a particular type and acts on them. The advantage of doing it this way is it parses the XML 'as you go', and lets you modify it in flight if you need to. This is particularly useful for discarding 'processed' XML when you're working with large files, using purge or flush:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

XML::Twig->new(
    twig_handlers => {
        'job' => sub { print $_ ->text }
    }
    )->parse( <> );

Which will use <> to take input (piped in, or specified via commandline ./myscript somefile.xml) and process it - each job element, it'll extract and print any text associated. (You might want print $_ -> text,"\n" to insert a linefeed).

Because it's matching on 'job' elements, it'll also match on nested job elements:

<job>programming
    <job>anotherjob</job>
</job>

Will match twice, but print some of the output twice too. You can however, match on /job instead if you prefer. Usefully - this lets you e.g. print and delete an element or copy and paste one modifying the XML structure.

Alternatively - parse first, and 'print' based on structure:

my $twig = XML::Twig->new( )->parse( <> );
print $twig -> root -> text;

As job is your root element, all we need do is print the text of it.

But we can be a bit more discerning, and look for job or /job and print that specifically instead:

my $twig = XML::Twig->new( )->parse( <> );
print $twig -> findnodes('/job',0)->text;

You can use XML::Twigs pretty_print option to reformat your XML too:

XML::Twig->new( 'pretty_print' => 'indented_a' )->parse( <> ) -> print;

There's a variety of output format options, but for simpler XML (like yours) most will look pretty similar.

score 8 · Answer 5 · answered Feb 08 '10 at 23:51

8

just use awk, no need other external tools. Below works if your desired tags appears in multitine.

$ cat file
test
<job xmlns="http://www.sample.com/">programming</job>
<job xmlns="http://www.sample.com/">
programming</job>

$ awk -vRS="</job>" '{gsub(/.*<job.*>/,"");print}' file
programming

programming

answered Feb 08 '10 at 23:51

ghostdog74

307,646
55
250
337

` job>` is valid, but your script doesn't recognize it. `` is a comment that needs to be ignored (and ` ]]>` is literal data), but your script doesn't know *that*. And then there are cases like having a DTD that defines new macros, such that `&foo;` expands to something locally-specified, and the simple cases like needing to convert `&` to `&`. Trying to roll your own XML parsing (or worse, generation) leads to no end of corner cases and little details that need to be individually run down and fixed. – Charles Duffy Sep 25 '17 at 14:28

score 6 · Answer 6 · answered Feb 08 '16 at 16:13

Using sed command:

Example:

$ cat file.xml
<note>
        <to>Tove</to>
                <from>Jani</from>
                <heading>Reminder</heading>
        <body>Don't forget me this weekend!</body>
</note>

$ cat file.xml | sed -ne '/<heading>/s#\s*<[^>]*>\s*##gp'
Reminder

Explanation:

cat file.xml | sed -ne '/<pattern_to_find>/s#\s*<[^>]*>\s*##gp'

n - suppress printing all lines
e - script

/<pattern_to_find>/ - finds lines that contain specified pattern what could be e.g.<heading>

next is substitution part s///pthat removes everything except desired value where / is replaced with # for better readability:

s#\s*<[^>]*>\s*##gp
\s* - includes white-spaces if exist (same at the end)
<[^>]*> represents <xml_tag> as non-greedy regex alternative cause <.*?> does not work for sed
g - substitutes everything e.g. closing xml </xml_tag> tag

score 5 · Answer 7 · answered Feb 10 '10 at 11:45

Assuming same line, input from stdin:

sed -ne '/<\/job>/ { s/<[^>]*>\(.*\)<\/job>/\1/; p }'

notes: -n stops it outputting everything automatically; -e means it's a one-liner (aot a script) /<\/job> acts like a grep; s strips the opentag + attributes and endtag; ; is a new statement; p prints; {} makes the grep apply to both statements, as one.

score 0 · Answer 8 · answered Feb 08 '10 at 14:29

0

How about:

cat a.xml | grep '<job' | cut -d '>' -f 2 | cut -d '<' -f 1

answered Feb 08 '10 at 14:29

codaddict

429,241
80
483
523

4

UUOC. `grep ' – ghostdog74 Feb 08 '10 at 23:53
@ghost *but but but, I think it's cleaner / nicer / not that much of a waste / my privelege to waste processes!* http://partmaps.org/era/unix/award.html#cat (actually, I think it's easier to edit the filename, because nearer the start) – 13ren Feb 10 '10 at 12:13
3

If you use `< a.xml | grep ...` you get it even closer to the start. – Thor Aug 23 '12 at 13:11

score 0 · Answer 9 · answered Dec 06 '15 at 13:00

A bit late to the show.

xmlcutty cuts out nodes from XML:

$ cat file.xml
<?xml version="1.0" encoding="utf-8"?>
<job xmlns="http://www.sample.com/">programming</job>
<job xmlns="http://www.sample.com/">designing</job>
<job xmlns="http://www.sample.com/">managing</job>
<job xmlns="http://www.sample.com/">teaching</job>

The path argument names the path to the element you want to cut out. In this case, since we are not interested in the tags at all, we rename the tag to \n, so we get a nice list:

$ xmlcutty -path /job -rename '\n' file.xml
programming
designing
managing
teaching

Note, that the XML was not valid to begin with (no root element). xmlcutty can work with slightly broken XML, too.

score 0 · Answer 10 · answered Jun 04 '20 at 06:45

yourxmlfile.xml

<item> 
  <title>15:54:57 - George:</title>
  <description>Diane DeConn? You saw Diane DeConn!</description> 
</item> 
<item> 
  <title>15:55:17 - Jerry:</title> 
  <description>Something huh?</description>
</item>

grep 'title' yourxmlfile.xml

  <title>15:54:57 - George:</title>
  <title>15:55:17 - Jerry:</title>

grep 'title' yourxmlfile.xml | awk -F">" '{print $2}'

  15:54:57 - George:</title
  15:55:17 - Jerry:</title

grep 'title' yourxmlfile.xml | awk -F">" '{print $2}' | awk -F"<" '{print $1}'

  15:54:57 - George:
  15:55:17 - Jerry:

Extraction of data from a simple XML file

10 Answers10

Linked

Related