Extracting .pdf table

Asked May 15 '18 at 12:43

Active May 15 '18 at 15:04

Viewed 238 times

I wrote a chunk of code working to get the .pdf table I am interested in in R, but there must be a better way. Hence, I haven't a problem in importing the data from pdf. I am looking for a BETTER way than the following to extract the tables I am interested in.

df_st <- "http://www.drustvo-antropologov.si/AN/PDF/2012_2/Anthropological_Notebooks_XVIII_2_Bjelica.pdf"

df_st_table <- extract_tables(df_st)

df_str <- data.frame(matrix(unlist(df_st_table), nrow=195, byrow=T))

df_str_a <- df_str[29:52, ]
df_str_a <- data.frame(matrix(unlist(df_str_a), nrow=24, byrow=T))
df_str_b <- df_str[53:76, ]
df_str_b <- data.frame(matrix(unlist(df_str_b), nrow=24, byrow=T))
df_str_c <- df_str[101:126, ]
df_str_c <- data.frame(matrix(unlist(df_str_c), nrow=26, byrow=T))
df_str_d <- df_str[127:152, ]
df_str_d <- data.frame(matrix(unlist(df_str_d), nrow=26, byrow=T))

...and then I merge them all. Too long and inelegant.

edited May 15 '18 at 15:04

Brian Tompsett - 汤莱恩

5,438
68
55
126

asked May 15 '18 at 12:43

Helena

1

Possible duplicate of [Recognize PDF table using R](https://stackoverflow.com/questions/44141160/recognize-pdf-table-using-r) – Dror Bogin May 15 '18 at 12:58
2

@Dhiraj They are already using that package: `tabulizer::extract_tables` – zx8754 May 15 '18 at 13:05
@zx8754 yeah just realized, my bad! – Dhiraj May 15 '18 at 13:07
What is the expected output? – zx8754 May 15 '18 at 13:14
Have a look at this [post](https://medium.com/@CharlesBordet/how-to-extract-and-clean-data-from-pdf-files-in-r-da11964e252e). It is showing how to use 2 packages for pdf-extraction (pdftools and tm) – SeGa May 15 '18 at 13:24
Sorry, I forgot to tell I am using both tabulizer and tm. I guess there is something I am missing though. Anyway, the code I have copied here is actually working, nevertheless is awful. I am trying to find out something more agile. – Helena May 15 '18 at 13:31

Extracting .pdf table

0 Answers0