1

Following an answer for Parsing XML with namespace in Python via 'ElementTree'

I tried various methods to parse the below xml and get the fragment elements:

import xml.etree.ElementTree as etree

tree = etree.ElementTree(file='BaileyInternetMarketing.epub.annot')

root = tree.getroot()
namespaces = {'adobe': 'http://ns.adobe.com/digitaleditions/annotations'} # add more as needed

fragments = root.findall('adobe:fragment', namespaces)
#fragments = root.findall('{http://ns.adobe.com/digitaleditions/annotations}fragment')
#fragments = root.findall('fragment')

for fr in fragments:
    print(fr)

Here is the xml header and a record:

<annotationSet xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://ns.adobe.com/digitaleditions/annotations">
  <publication>
    <dc:identifier>urn:uuid:c6726f70-43eb-4d0a-80a9-01c381fbf624</dc:identifier>
    <dc:title>Internet Marketing</dc:title>
    <dc:creator>Matt Bailey</dc:creator>
    <dc:publisher></dc:publisher>
  </publication>
  <annotation>
    <dc:identifier>urn:uuid:cb5b14c1-51fc-4398-8f9e-06a696e2c238</dc:identifier>
    <dc:date>2018-01-07T14:31:48Z</dc:date>
    <target>
      <fragment start="OEBPS/9781118087220c03.xhtml#point(/1/4/2/142/1:20)" end="OEBPS/9781118087220c03.xhtml#point(/1/4/2/142/1:45)" progress="0">
      <text>hamster-wheel analytics (</text>
      </fragment>
    </target>
    <content>
      <dc:date>2018-01-07T14:31:48Z</dc:date>
    </content>
  </annotation>

And the python console output:

>>> exec(open('parse.py').read())
>>> print(tree)
<xml.etree.ElementTree.ElementTree object at 0x000001F51E8687F0>
>>> print(root)
<Element '{http://ns.adobe.com/digitaleditions/annotations}annotationSet' at 0x000001F51EBCDB88>
>>> print(fragments)
[]

None of the three methods I tried populated the fragments array.

Is it possible to use the ElementTree library to parse this XML file with namespaces? Is it a problem that ns.adobe.com does not exist?

EDIT: I created a minimal non-working but self-contained example following

Parse XML with using ElementTree library

import xml.etree.ElementTree as etree

xmldata='''
<annotationSet xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://ns.adobe.com/digitaleditions/annotations">
  <publication>
    <dc:identifier>urn:uuid:c6726f70-43eb-4d0a-80a9-01c381fbf624</dc:identifier>
    <dc:title>Internet Marketing</dc:title>
    <dc:creator>Matt Bailey</dc:creator>
    <dc:publisher></dc:publisher>
  </publication>
  <annotation>
    <dc:identifier>urn:uuid:cb5b14c1-51fc-4398-8f9e-06a696e2c238</dc:identifier>
    <dc:date>2018-01-07T14:31:48Z</dc:date>
    <target>
      <fragment start="OEBPS/9781118087220c03.xhtml#point(/1/4/2/142/1:20)" end="OEBPS/9781118087220c03.xhtml#point(/1/4/2/142/1:45)" progress="0">
      <text>hamster-wheel analytics (</text>
      </fragment>
    </target>
    <content>
      <dc:date>2018-01-07T14:31:48Z</dc:date>
    </content>
  </annotation>
</annotationSet>
'''

tree = etree.fromstring(xmldata)

#root = tree.getroot()
#namespaces = {'adobe': 'http://ns.adobe.com/digitaleditions/annotations'} # add more as needed

#fragments = root.findall('adobe:fragment', namespaces)
fragments = tree.findall('{http://ns.adobe.com/digitaleditions/annotations}fragment')
#fragments = root.findall('fragment')

for fr in fragments:
        print(fr)
Gergely
  • 5,773
  • 4
  • 22
  • 32
  • 1
    By the way, it indeed is useful to add namespaces. The caveat is that you must *pass that `namespace`* into `findall`: `fragments = tree.findall('.//adobe:fragment', namespaces)` – Jongware Jun 21 '18 at 12:04

0 Answers0