Following an answer for Parsing XML with namespace in Python via 'ElementTree'
I tried various methods to parse the below xml and get the fragment elements:
import xml.etree.ElementTree as etree
tree = etree.ElementTree(file='BaileyInternetMarketing.epub.annot')
root = tree.getroot()
namespaces = {'adobe': 'http://ns.adobe.com/digitaleditions/annotations'} # add more as needed
fragments = root.findall('adobe:fragment', namespaces)
#fragments = root.findall('{http://ns.adobe.com/digitaleditions/annotations}fragment')
#fragments = root.findall('fragment')
for fr in fragments:
print(fr)
Here is the xml header and a record:
<annotationSet xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://ns.adobe.com/digitaleditions/annotations">
<publication>
<dc:identifier>urn:uuid:c6726f70-43eb-4d0a-80a9-01c381fbf624</dc:identifier>
<dc:title>Internet Marketing</dc:title>
<dc:creator>Matt Bailey</dc:creator>
<dc:publisher></dc:publisher>
</publication>
<annotation>
<dc:identifier>urn:uuid:cb5b14c1-51fc-4398-8f9e-06a696e2c238</dc:identifier>
<dc:date>2018-01-07T14:31:48Z</dc:date>
<target>
<fragment start="OEBPS/9781118087220c03.xhtml#point(/1/4/2/142/1:20)" end="OEBPS/9781118087220c03.xhtml#point(/1/4/2/142/1:45)" progress="0">
<text>hamster-wheel analytics (</text>
</fragment>
</target>
<content>
<dc:date>2018-01-07T14:31:48Z</dc:date>
</content>
</annotation>
And the python console output:
>>> exec(open('parse.py').read())
>>> print(tree)
<xml.etree.ElementTree.ElementTree object at 0x000001F51E8687F0>
>>> print(root)
<Element '{http://ns.adobe.com/digitaleditions/annotations}annotationSet' at 0x000001F51EBCDB88>
>>> print(fragments)
[]
None of the three methods I tried populated the fragments array.
Is it possible to use the ElementTree library to parse this XML file with namespaces? Is it a problem that ns.adobe.com does not exist?
EDIT: I created a minimal non-working but self-contained example following
Parse XML with using ElementTree library
import xml.etree.ElementTree as etree
xmldata='''
<annotationSet xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://ns.adobe.com/digitaleditions/annotations">
<publication>
<dc:identifier>urn:uuid:c6726f70-43eb-4d0a-80a9-01c381fbf624</dc:identifier>
<dc:title>Internet Marketing</dc:title>
<dc:creator>Matt Bailey</dc:creator>
<dc:publisher></dc:publisher>
</publication>
<annotation>
<dc:identifier>urn:uuid:cb5b14c1-51fc-4398-8f9e-06a696e2c238</dc:identifier>
<dc:date>2018-01-07T14:31:48Z</dc:date>
<target>
<fragment start="OEBPS/9781118087220c03.xhtml#point(/1/4/2/142/1:20)" end="OEBPS/9781118087220c03.xhtml#point(/1/4/2/142/1:45)" progress="0">
<text>hamster-wheel analytics (</text>
</fragment>
</target>
<content>
<dc:date>2018-01-07T14:31:48Z</dc:date>
</content>
</annotation>
</annotationSet>
'''
tree = etree.fromstring(xmldata)
#root = tree.getroot()
#namespaces = {'adobe': 'http://ns.adobe.com/digitaleditions/annotations'} # add more as needed
#fragments = root.findall('adobe:fragment', namespaces)
fragments = tree.findall('{http://ns.adobe.com/digitaleditions/annotations}fragment')
#fragments = root.findall('fragment')
for fr in fragments:
print(fr)