How can I read from a word (docx) file in python. I can read from a txt file but can not do the same for MS Office word document. Any suggestions?
Asked
Active
Viewed 4,776 times
2
-
1Possible duplicate of [How to extract text from an existing docx file using python-docx](https://stackoverflow.com/questions/25228106/how-to-extract-text-from-an-existing-docx-file-using-python-docx) – Brian Apr 10 '20 at 20:15
-
Does this answer your question? [Read Docx files via python](https://stackoverflow.com/questions/29309085/read-docx-files-via-python) – Elisabeth Strunk Apr 10 '20 at 20:21
2 Answers
6
There are a couple of packages that let you do this. Check
docx2txt (note that it does not seem to work with
.doc). As per this, it seems to get more info than python-docx. From original documentation:
import docx2txt
# extract text
text = docx2txt.process("file.docx")
# extract text and write images in /tmp/img_dir
text = docx2txt.process("file.docx", "/tmp/img_dir")
Since
.docxfiles are simply.zipfiles with a changed extension, this shows how to access the contents. This is a significant difference with.docfiles, and the reason why some (or all) of the above do not work with.docs. In this case, you would likely have to convertdoc->docxfirst.antiwordis an option.
sancho.s ReinstateMonicaCellio
- 13,213
- 18
- 76
- 164
-
Thank you doc2txt solved my problem. I had been struggling with this for a long time. – Sina Cengiz Apr 11 '20 at 19:26
2
See this library that allows for reading docx files https://python-docx.readthedocs.io/en/latest/
You should use the python-docx library available on PyPi. Then you can use the following
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
Sri
- 2,128
- 2
- 13
- 21
-
Hello! While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – Brian Apr 10 '20 at 20:15