1

How can I read a Microsoft .docx file in R and get the text as one field and page number as another?

From the readtext R libraries, I can read the text, but wondering if you know how to get the page number as well?

install.packages("readtext")

library(readtext)

doc <- readtext(system.file("examples/realworld.docx", package="docxtractr"))

So the desired output should be

text                page_number
text from page 1     1
text from page 2     2

Please advise.

Geet
  • 2,263
  • 2
  • 15
  • 37
  • 1
    From looking into it, I'm not sure word actually notes page numbers, it just dynamically flows text onto new pages when it's full. There are page break tags in `xml`, but I think those are only for breaks that are inserted. I'd be interested in knowing if this is possible. https://stackoverflow.com/questions/23980268/find-a-new-page-in-a-word-document – Anonymous coward Jul 30 '18 at 20:39
  • I found that read_pdf function of the textreadr R package does read page number and line number, but then how should I convert .docx to .PDF file using R? – Geet Jul 30 '18 at 20:48
  • 1
    You can see if `pandoc` works. https://stackoverflow.com/questions/49113503/how-to-convert-docx-to-pdf-in-r – Anonymous coward Jul 30 '18 at 21:08

0 Answers0