23

I'm trying to extract data from tables inside some pdf reports.

I've seen some examples using either pdftools and similar packages I was successful in getting the text, however, I just want to extract the tables.

Is there a way to use R to recognize and extract only tables?

Brian Tompsett - 汤莱恩
  • 5,438
  • 68
  • 55
  • 126
RCS
  • 245
  • 1
  • 2
  • 9

2 Answers2

17

Awsome question, I wondered about the same thing recently, thanks!

I did it, with tabulizer ‘0.2.2’ as @hrbrmstr also suggests. If you are using R > 3.5.x, I'm providing following solution. Install the three packages in specific order:

# install.packages("rJava")
# library(rJava) # load and attach 'rJava' now
# install.packages("devtools")
# devtools::install_github("ropensci/tabulizer", args="--no-multiarch")

Update: After just testing the approach again, it looks like it's enough to just do install.packages("tabulizer") now. rJava will be installed automatically as a dependency.

Now you are ready to extract tables from your PDF reports.

library(tabulizer)

## load report
l <- "https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf" 
m <- extract_tables(l, encoding="UTF-8")[[2]]  ## comes as a character matrix
## Note: peep into `?extract_tables` for further specs (page, location etc.)!

## use first row as column names
dat <- setnames(type.convert(as.data.frame(m[-1, ]), as.is=TRUE), m[1, ])
## example-specific date conversion
dat$Date <- as.POSIXlt(dat$Date, format="%m/%d/%y")
dat <- within(dat, Date$year <- ifelse(Date$year > 120, Date$year - 100, Date$year))

dat ## voilà
#    Speed (mph)          Driver                        Car    Engine       Date
# 1      407.447 Craig Breedlove          Spirit of America    GE J47 1963-08-05
# 2      413.199       Tom Green           Wingfoot Express    WE J46 1964-10-02
# 3      434.220      Art Arfons              Green Monster    GE J79 1964-10-05
# 4      468.719 Craig Breedlove          Spirit of America    GE J79 1964-10-13
# 5      526.277 Craig Breedlove          Spirit of America    GE J79 1965-10-15
# 6      536.712      Art Arfons              Green Monster    GE J79 1965-10-27
# 7      555.127 Craig Breedlove Spirit of America, Sonic 1    GE J79 1965-11-02
# 8      576.553      Art Arfons              Green Monster    GE J79 1965-11-07
# 9      600.601 Craig Breedlove Spirit of America, Sonic 1    GE J79 1965-11-15
# 10     622.407   Gary Gabelich                 Blue Flame    Rocket 1970-10-23
# 11     633.468   Richard Noble                   Thrust 2 RR RG 146 1983-10-04
# 12     763.035      Andy Green                 Thrust SSC   RR Spey 1997-10-15

Hope it works for you.

Limitations: Of course, the table in this example is quite simple and maybe you have to mess around with gsub and this kind of stuff.

jay.sf
  • 46,523
  • 6
  • 46
  • 87
  • 1
    tabulizer can be ridiculously difficult to install. I never got it working on my Mac. – Nettle Sep 03 '18 at 20:45
  • .@jaySf - The issue I am facing is that `tabulizer()` is reading all the tables but only the header of the table and not the contents of it. Any suggestion how to solve this? – Chetan Arvind Patil Oct 17 '18 at 18:56
  • @ChetanArvindPatil Hard to tell w/o any example. I assume that it depends on the software that created the pdf whether tabulator works or not. – jay.sf Oct 18 '18 at 06:49
  • I found this helpful, but still didnt work completely ... https://stackoverflow.com/questions/43884603/installing-tabulizer-package-in-r gave alternative steps that worked for me. (Win10) – Marcus D Dec 31 '19 at 10:23
6

I would love to know the answer to this as well. But from my experience, you need to use regular expressions to get the data in a format that you want. You can see the following as an example:

library(pdftools)
dat <- pdftools::pdf_text("https://s3-eu-central-1.amazonaws.com/de-hrzg-khl/kh-ffe/public/artikel-pdfs/Free_PDF/BF_LISTE_20016.pdf")
dat <- paste0(dat, collapse = " ")
pattern <- "Berufsfeuerwehr\\s+Straße(.)*02366.39258"
extract <- regmatches(dat, regexpr(pattern, dat))
extract <- gsub('\n', "  ", extract)
strsplit(extract, "\\s{2,}")

From here the data can then be looped to create the table as desired. But as you can see in the link, the PDF is not only a table.

halfer
  • 19,471
  • 17
  • 87
  • 173