0

I have docx file I would like to load properly to Doccano to this following format:

{"text": "EU rejects German call to boycott British lamb."}
{"text": "Peter Blackburn"}
...
{"text": "President Obama"}

My goal is to have approximately same length for "text" values and something clean at end of "text" values (ending with a point or a ;)

I thought about using this: https://gist.github.com/etienned/7539105 to read docx files and have paragraph.

This function:

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML
import zipfile


"""
Module that extract text from MS XML Word document (.docx).
(Inspired by python-docx <https://github.com/mikemaccana/python-docx>)
"""

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'


def get_docx_text(path):
    """
    Take the path of a docx file as argument, return the text in unicode.
    """
    document = zipfile.ZipFile(path)
    xml_content = document.read('word/document.xml')
    document.close()
    tree = XML(xml_content)

    paragraphs = []
    for paragraph in tree.getiterator(PARA):
        texts = [node.text
                 for node in paragraph.getiterator(TEXT)
                 if node.text]
        if texts:
            paragraphs.append(''.join(texts))

    return '\n\n'.join(paragraphs)

Next concatenate all to have a big text and cut it to have clean repartion in "text" values but not sure if it is the good method.

Can someone know how to do this ?

jos97
  • 361
  • 1
  • 12

0 Answers0