Separating paragraphs in an image

Asked Aug 11 '21 at 17:58

Active Aug 11 '21 at 17:58

Viewed 46 times

I have a large reservoir of PDFs that get converted to images before ultimately being turned into text files. I ultimately will be performing machine learning on the text and we prefer to have them in paragraph form at that point.

I've found the following thread that explains how to take an image and place a green box around each block the script considered to be a paragraph. I need something that will (maybe using the coordinates this script generates to draw the boxes?) send the text into another data structure, like a list, and then create the final txt document with carriage returns between each "paragraph".

How to detect paragraphs in a text document image for a non-consistent text structure in Python

asked Aug 11 '21 at 17:58

PracticingPython

did you try to use ```pytesseract```? – crackanddie Aug 11 '21 at 18:39

Separating paragraphs in an image

0 Answers0