1

docx to txt:

I tried the following code for extracting text from docx. It does not work when docx has images.

unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

For pptx to txt, I found a Perl script to extract txt. It does not work when the pptx has images - the same.

I want extracted txt content for enabling search option among documents. So a command/script that will skip the images and convert the docx text content to txt will even help!

RPS
  • 69
  • 1
  • 8

1 Answers1

4

The SO question How to extract just plain text from .doc & .docx files? provides other options.
The libreoffice answer almost works, probably did in 2012.
Now (LibreOffice 5.1) try:

libreoffice --convert-to txt text some.docx

or

libreoffice --headless --convert-to txt text some.docx

Be sure that you do not have libreoffice already open.

Ry-
  • 209,133
  • 54
  • 439
  • 449
Rolf of Saxony
  • 19,475
  • 3
  • 34
  • 50