1

I'm trying to convert DOC or DOCX files to plain text (TXT), ensuring that all formats and styles are ignored and encodings render properly, and that no manual pre-processing needs to be done by the user.

The officer package gets me most of the way. The following code yields a TXT file free from junk characters, and without any text indicating header styles etc:

doc <- officer::read_docx("my_doc.docx")
content <- docx_summary(doc)
writeLines(content$text, file("textout.txt", encoding="UTF-8"))

However, this output shows complete field codes. For example, a date in the input file is rendering as:

"DATE \@ "d MMMM yyyy" 17 July 2019"

And the Table of Contents object is omitted entirely.

Again, I cannot do any manual pre-processing, unless its automatable with code! I'm aware that I can unlink all the fieldcodes, but unless there's a automated way of doing this at the Command Line or in R only, that's not an option.

As an alternative, using pandoc leads to text that fixes the field code problem:

rmarkdown::pandoc_convert(doc_file, to="plain", from="docx")

But the encodings aren't right. Examples:

"those with an affinityÂ"
"Station’s business model?Â"

Can somebody help me sort out a solution here? Personally, I'm happy to incorporate other tools, but an R only approach would be excellent.

mb21
  • 31,690
  • 8
  • 105
  • 126
tef2128
  • 630
  • 1
  • 8
  • 17
  • Perhaps print to PDF [https://superuser.com/questions/393118/how-to-convert-word-doc-to-pdf-from-windows-command-line] and then scrape that with Tabulizer? – Jon Spring Jul 18 '19 at 00:35
  • https://stackoverflow.com/a/33149947? https://word2md.com? https://gist.github.com/vzvenyach/7278543? – r2evans Jul 18 '19 at 05:48
  • are you sure you're viewing the output file as utf-8? it should "just work" see https://pandoc.org/MANUAL.html#character-encoding – mb21 Jul 18 '19 at 10:31

0 Answers0