Convert DOCX to plain text (hopefully in R)

Question

I'm trying to convert DOC or DOCX files to plain text (TXT), ensuring that all formats and styles are ignored and encodings render properly, and that no manual pre-processing needs to be done by the user.

The officer package gets me most of the way. The following code yields a TXT file free from junk characters, and without any text indicating header styles etc:

doc <- officer::read_docx("my_doc.docx")
content <- docx_summary(doc)
writeLines(content$text, file("textout.txt", encoding="UTF-8"))

However, this output shows complete field codes. For example, a date in the input file is rendering as:

"DATE \@ "d MMMM yyyy" 17 July 2019"

And the Table of Contents object is omitted entirely.

Again, I cannot do any manual pre-processing, unless its automatable with code! I'm aware that I can unlink all the fieldcodes, but unless there's a automated way of doing this at the Command Line or in R only, that's not an option.

As an alternative, using pandoc leads to text that fixes the field code problem:

rmarkdown::pandoc_convert(doc_file, to="plain", from="docx")

But the encodings aren't right. Examples:

"those withÂ an affinityÂ"
"Stationâ€™s business model?Â"

Can somebody help me sort out a solution here? Personally, I'm happy to incorporate other tools, but an R only approach would be excellent.

Perhaps print to PDF [https://superuser.com/questions/393118/how-to-convert-word-doc-to-pdf-from-windows-command-line] and then scrape that with Tabulizer? — Jon Spring, Jul 18 '19 at 00:35
https://stackoverflow.com/a/33149947? https://word2md.com? https://gist.github.com/vzvenyach/7278543? — r2evans, Jul 18 '19 at 05:48
are you sure you're viewing the output file as utf-8? it should "just work" see https://pandoc.org/MANUAL.html#character-encoding — mb21, Jul 18 '19 at 10:31

Convert DOCX to plain text (hopefully in R)

0 Answers0