Parsing XML with illegal special characters (&)

Question

I have thousands of XML files like follow

<names>
    <Id>1518845</Id>
    <Name>Confessions of a Thug (Paperback)</Name>
    <Authors>Philip Meadows Taylor</Authors>
    <Publisher>Rupa & Co</Publisher>
    <CountsOfReview>2.0</CountsOfReview>
</names>

I've tried the codes follow to parse

from lxml import etree

root = etree.parse("xm_file.xml")

import xml.etree.ElementTree as ET

tree = ET.parse("xm_file.xml")

and

parser = ET.XMLParser(encoding="utf-8")
tree = ET.parse("xm_file.xml", parser=parser)

and all lead to one of those errors

ParseError: not well-formed (invalid token): line 10, column 18

XMLSyntaxError: xmlParseEntityRef: no name, line 10, column 19

I searched and tried a lot for a solution for this to work to all files but in vain

NOTE : this didnt help me : How to parse invalid (bad / not well-formed) XML?

another situation is

<names>
    <Id>1481744</Id>
    <Name>Lettres de René-Édouard Claparède <1832-1871>.: Choisies et annotées</Name>
    <Authors>René-Édouard Claparède</Authors>
    <ISBN>3796505635</ISBN>
    <Rating>2.0</Rating>
    <PublishYear>1971</PublishYear>
    <PublishMonth>31</PublishMonth>
    <PublishDay>12</PublishDay>
</names>

while parsing it just handle the XML as if it is :

<names>
    <Id>1481744</Id>
    <Name>Lettres de René-Édouard Claparède</Name>
</names>

and other info doesnt appear

Maybe this helps? https://stackoverflow.com/questions/7604436/xmlparseentityref-no-name-warnings-while-loading-xml-into-a-php-file — Jan, Apr 30 '21 at 19:53
It’s not XML, Jim, at least not as we know it. Your question isn’t titled correctly - what you’re trying to parse *isn’t XML* — DisappointedByUnaccountableMod, Apr 30 '21 at 20:48
No, it's ***not*** XML. @barny is right. You did not understand the duplicate link the last time you asked this exact question. You cannot expect an XML parser, which is written based on following the rules that *define* XML, to succeed with arbitrary transgressions against those rules. — kjhughes, Apr 30 '21 at 21:24
You don't have thousands of XML files. You have thousands of non-XML files. In fact, you have a heap of junk. — Michael Kay, Apr 30 '21 at 21:28

score 1 · Accepted Answer · answered Apr 30 '21 at 20:03

1

You could replace the & before-hand:

import xml.etree.ElementTree as ET

data = """

<names>
    <Id>1518845</Id>
    <Name>Confessions of a Thug (Paperback)</Name>
    <Authors>Philip Meadows Taylor</Authors>
    <Publisher>Rupa & Co</Publisher>
    <CountsOfReview>2.0</CountsOfReview>
</names>

"""

data = data.replace('&', '&amp;')
tree = ET.ElementTree(ET.fromstring(data))

for publisher in tree.findall("Publisher"):
    print(publisher.text)

Which yields

Rupa & Co

A possible way would be to load the files in question before, replace the & and feed it to xml.etree.ElementTree, as in:

with open("some_cool_file") as fp:
    content = fp.read()
    content = content.replace('&', '&amp;')
    xml = ET.ElementTree(ET.fromstring(content))

answered Apr 30 '21 at 20:03

Jan

40,932
8
45
77

Please see my edit on question – Ashraf Khaled Apr 30 '21 at 20:30
1

Good try, but the OP added another test for you. – DisappointedByUnaccountableMod Apr 30 '21 at 20:48
2

@Jan, replacing `&` like that is dangerous because there could be actual XML entities in the data that would then be mangled. Yes, you could use a more complicated regex to catch many of them. Anyway, this general topic of how to parse bad "XML" has already been much more thoroughly addressed in other Q/A. – kjhughes Apr 30 '21 at 21:22

Parsing XML with illegal special characters (&)

1 Answers1