1

I'm working on a project in which I need to read the text from multiple doc and docx files. The docx files were easily done with the docx2txt module but I cannot for the love of me make it work for doc files. I've tried with textract, but it doesn't seem to work on Windows. I just need the text in the file, no pictures or anything like that. Any ideas?

  • Does this answer your question? [Read .doc file with python](https://stackoverflow.com/questions/36001482/read-doc-file-with-python) – user202729 Jun 09 '20 at 12:10
  • It is not easy to do. `textract` can do it if you have antiword installed. Tika can extract the text, but not the formatting. – erip Jun 09 '20 at 12:10

1 Answers1

0

I found that this seems to work:

import win32com.client
text = win32com.client.Dispatch("Word.Application")
text.visible = False
wb = text.Documents.Open("myfile.doc")
document = text.ActiveDocument
print(document.Range().Text)
Kehinde
  • 16
  • 3