0

I have a large reservoir of PDFs that get converted to images before ultimately being turned into text files. I ultimately will be performing machine learning on the text and we prefer to have them in paragraph form at that point.

I've found the following thread that explains how to take an image and place a green box around each block the script considered to be a paragraph. I need something that will (maybe using the coordinates this script generates to draw the boxes?) send the text into another data structure, like a list, and then create the final txt document with carriage returns between each "paragraph".

How to detect paragraphs in a text document image for a non-consistent text structure in Python

0 Answers0