How to make a corpus of files in text-format based on parse of text & titles from MS Word documents in Python?

Question

I'm preprocessing/preparing a batch of MS Word documents, which I automaticly converted from .doc to .docx to use them later to train an NLP-model with entity recognition. I'm a newbie in Python-programming as well as in Spacy-NLP but I have some programming experience in other languages but right now my biggest question that makes me feel like "I don't know what to do or how to do it" is this: I have the documents in a folder. I need to parse the raw text and titles (which are in the name of the document itself, not the first line in the document) to make a corpus which is going to be used later on to train the NLP-model.

Since I'm a newbie I have a lot to learn. So I've already done a lot of research on this topic. In the beginning It was a pain in the *** for me to convert all these .doc-files to .docx-files but I've finaly found a way to do that. Since I need to get the title and text from a bunch of documents I assumed that I need to 'walk' over the documents in the folder, using a for-loop, which I did like this:

path = '/path/to/folder'
for filename in os.listdir(path):
    if filename.endswith('.docx'):
        path = os.path.join(path, filename)

I've also tried what I found in this stackoverflow-link (using the native python-docx module): extracting text from MS word files in python

But this gave me this TypeError: sequence item 0: expected str instance, bytes found

edit: The TypeError problem is solved, I tried again 3 different ways to extract text from a Word Document and thisone gave me the best output (without errors):

´´´
import docx
def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
    fullText.append(para.text)
return '\n'.join(fullText)

print(getText('test.docx'))

´´´ So now I (finaly) know how to do a good text-extraction from a Word document. I still need to figure out how to do this on a whole folder and what are my next steps in the proces in order to make a corpus that will be used for NLP.

Btw. I'm using Pycharm in a Ubuntu 18.04 virtual machine and Python version 3.6.

(I've also explained my problem a bit in a different way in this post https://python-forum.io/Thread-Data-extraction-from-multiple-MS-Word-file-s-in-python (see comment #9). I posted this yesterday, it was before trying out what I've found in the stackoverflow-link.)

Could anyone give me any idea about what is a good way to extract titles from MS Word document in order to make a corpus of files to use in SpaCy?

Thank you very much to take your time.

Please can you post the code you used to actually try to extract text from the files, and the full traceback for the TypeError that you got? At the moment you say you've tried something but we can't see exactly what. — Tom Dalton, Sep 20 '19 at 10:29
Have you checked this post? https://stackoverflow.com/questions/25228106/how-to-extract-text-from-an-existing-docx-file-using-python-docx — Tiago Duque, Sep 20 '19 at 10:56
@TomDalton Right now I have tried 3 different ways to extract text from a word document. I used the code which I've found in Tiago Duque 's link. I've been trying this code a few days ago as well but somehow It didn't work back then but now it works. I've eddited my text and wrote the code there. As output i don't get any errors anymore but I'm wondering how I should do this on a whole bunch of documents and save the files to make an NLP-corpus.. — Jonas, Sep 20 '19 at 13:52

How to make a corpus of files in text-format based on parse of text & titles from MS Word documents in Python?

0 Answers0