How to check if a PDF is a scanned image or contains text in bulk? I want to split 1000 files into 2 folders automatically

Asked Apr 08 '22 at 04:20

Active Apr 08 '22 at 04:20

Viewed 26 times

Aim to split them into 2 folders only. Don't want to extract text or whatever.

asked Apr 08 '22 at 04:20

Teoh Sin Yee

Does this answer your question? [How to check if PDF is scanned image or contains text](https://stackoverflow.com/questions/55704218/how-to-check-if-pdf-is-scanned-image-or-contains-text) – Savvas Nicolaou Apr 08 '22 at 04:23
Thanks @SavvasNicolaou, I found this snippet (https://stackoverflow.com/a/59421043/12307615) might work for a half pipeline. It prints out the pdf types. But how to store the PDFs into respective folder automatically? Imagine after running the code, all PDF files already split into 2 folders. – Teoh Sin Yee Apr 08 '22 at 04:26
To be honest I'm not sure. I haven't used python in a while...but you could try to use a loop and move each file based on searchability and filesize using import os. Unless it is something more complicated? – Savvas Nicolaou Apr 08 '22 at 04:34
Thanks @SavvasNicolaou. Have solved it recently. 1st, I loop through all files and check the PDF types of each of them. (Scanned-image, Non-scanned-image) Then use shutil to move the files to their respective folders. – Teoh Sin Yee Apr 12 '22 at 06:04

0 Answers0