2

How can I read from a word (docx) file in python. I can read from a txt file but can not do the same for MS Office word document. Any suggestions?

Sina Cengiz
  • 140
  • 1
  • 8
  • 1
    Possible duplicate of [How to extract text from an existing docx file using python-docx](https://stackoverflow.com/questions/25228106/how-to-extract-text-from-an-existing-docx-file-using-python-docx) – Brian Apr 10 '20 at 20:15
  • Does this answer your question? [Read Docx files via python](https://stackoverflow.com/questions/29309085/read-docx-files-via-python) – Elisabeth Strunk Apr 10 '20 at 20:21

2 Answers2

6

There are a couple of packages that let you do this. Check

  1. python-docx.

  2. docx2txt (note that it does not seem to work with .doc). As per this, it seems to get more info than python-docx. From original documentation:

import docx2txt

# extract text
text = docx2txt.process("file.docx")

# extract text and write images in /tmp/img_dir
text = docx2txt.process("file.docx", "/tmp/img_dir") 
  1. textract (which works via docx2txt).

  2. Since .docx files are simply .zip files with a changed extension, this shows how to access the contents. This is a significant difference with .doc files, and the reason why some (or all) of the above do not work with .docs. In this case, you would likely have to convert doc -> docx first. antiword is an option.

2

See this library that allows for reading docx files https://python-docx.readthedocs.io/en/latest/

You should use the python-docx library available on PyPi. Then you can use the following

doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
    allText.append(docpara.text)
Sri
  • 2,128
  • 2
  • 13
  • 21
  • Hello! While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – Brian Apr 10 '20 at 20:15