1

I have thousands of XML files like follow

<names>
    <Id>1518845</Id>
    <Name>Confessions of a Thug (Paperback)</Name>
    <Authors>Philip Meadows Taylor</Authors>
    <Publisher>Rupa & Co</Publisher>
    <CountsOfReview>2.0</CountsOfReview>
</names>

I've tried the codes follow to parse

from lxml import etree

root = etree.parse("xm_file.xml")
import xml.etree.ElementTree as ET

tree = ET.parse("xm_file.xml")

and

parser = ET.XMLParser(encoding="utf-8")
tree = ET.parse("xm_file.xml", parser=parser)

and all lead to one of those errors

ParseError: not well-formed (invalid token): line 10, column 18
XMLSyntaxError: xmlParseEntityRef: no name, line 10, column 19

I searched and tried a lot for a solution for this to work to all files but in vain

NOTE : this didnt help me : How to parse invalid (bad / not well-formed) XML?

another situation is

<names>
    <Id>1481744</Id>
    <Name>Lettres de René-Édouard Claparède <1832-1871>.: Choisies et annotées</Name>
    <Authors>René-Édouard Claparède</Authors>
    <ISBN>3796505635</ISBN>
    <Rating>2.0</Rating>
    <PublishYear>1971</PublishYear>
    <PublishMonth>31</PublishMonth>
    <PublishDay>12</PublishDay>
</names>

while parsing it just handle the XML as if it is :

<names>
    <Id>1481744</Id>
    <Name>Lettres de René-Édouard Claparède</Name>
</names>

and other info doesnt appear

1 Answers1

1

You could replace the & before-hand:

import xml.etree.ElementTree as ET

data = """

<names>
    <Id>1518845</Id>
    <Name>Confessions of a Thug (Paperback)</Name>
    <Authors>Philip Meadows Taylor</Authors>
    <Publisher>Rupa & Co</Publisher>
    <CountsOfReview>2.0</CountsOfReview>
</names>

"""

data = data.replace('&', '&amp;')
tree = ET.ElementTree(ET.fromstring(data))

for publisher in tree.findall("Publisher"):
    print(publisher.text)

Which yields

Rupa & Co

A possible way would be to load the files in question before, replace the & and feed it to xml.etree.ElementTree, as in:

with open("some_cool_file") as fp:
    content = fp.read()
    content = content.replace('&', '&amp;')
    xml = ET.ElementTree(ET.fromstring(content))
Jan
  • 40,932
  • 8
  • 45
  • 77