I have this document which is .doc extension and contains information in tables. is there any method in python so that i can copy all data into a text file(.txt)
Asked
Active
Viewed 4,627 times
2
-
use textract, works well with all file extensions. – Ajay Gupta Jun 13 '18 at 10:26
-
2Possible duplicate of [Read .doc file with python](https://stackoverflow.com/questions/36001482/read-doc-file-with-python) – Jan Christoph Terasa Jun 13 '18 at 10:56
1 Answers
-1
It was duplicated one I just integrate all answers to one place
For Linux users Using textract library, which is not in windows
import textract
text = textract.process("path/to/file.extension")
text = text.decode("utf-8")
For windows user, If users know the encoding
from bs4 import BeautifulSoup as bs
soup = bs(open(filename).read())
[s.extract() for s in soup(['style', 'script'])]
tmpText = soup.get_text()
text = "".join("".join(tmpText.split('\t')).split('\n')).encode('utf-8').strip()
print text
Only for Windows users
import win32com.client
word = win32com.client.Dispatch("Word.Application")
word.visible = False
wb = word.Documents.Open("myfile.doc")
doc = word.ActiveDocument
print(doc.Range().Text)
RCvaram
- 3,901
- 3
- 16
- 32