How to OCR tables of contents to proper outputs?

Question

Usually when OCR an table of content the columns are separated by a large space, so the outputs are not properly order. For example, for an table like this:

The output would be:

The Rank Function
Permutations of Atoms
Pure Set Theory and Axiom System ZF
3.5
3.6
3.7

I'd like it to be:

3.5 The Rank Function\112
3.6 Permutations of Atoms\116
3.7 Pure Set Theory and Axiom System ZF\118

But different TOCs has different the output patterns, so there is no way to build a regex script to automatically fix every book. The best approach is to fix it at the first place. But how?

Stilgar Dragonclaw · Accepted Answer · 2018-07-10T00:57:27.487

2

Define what is: "fix it at the first place".

If you want to fix wrong output from OCR analysis, a simple solution on an infinite set of TOCs you will never make. You will never apply all variations. You would have to create a machine learning algorithm that would analyze each TOC variant.

Or count substrings of the same characteristics (in simple TOC).

Chapter number
Chapter number
Chapter number
Chapter number
Chapter number
...

= 5

Chapter title
Chapter title
Chapter title
Chapter title
Chapter title
...

= 5

If you want to fix OCR analysis, it's a good to answer: What OCR tool do you use?

For example, in Tesseract you can set, that text is processed by rows instead of columns.

edited Jul 10 '18 at 00:57

answered Jul 10 '18 at 00:38

Stilgar Dragonclaw

351
2
3
16

This sounds promising. What GUI do you use to make that screenshot? I've checked the 3rd party projects but don't know which one – Ooker Jul 10 '18 at 05:21
I've opened a question in Software Recommendation: What OCR tools can recognize text by rows instead of columns? – Ooker Jul 10 '18 at 06:00
@Ooker : I think most of the set of OCR tools have the ability to set how the page should be analyzed. The picture is my schematic illustration. But this borders also shows FineReader and the advanced GUI for Tesseract gImageReader (GitHub). But in automatic analysis you do not need a GUI if you do not want to manually edit something IMHO. Because you can not use automatic batch processing over a large amount of TOC if you want to check and edit each TOC. – Stilgar Dragonclaw Jul 10 '18 at 18:12

Ooker · Answer 2 · 2018-07-10T06:00:56.807

-1

Not really answer the question, but some books in Google Books have TOC:

edited Jul 10 '18 at 06:00

answered May 21 '18 at 13:57

Ooker

317
3
15

How to OCR tables of contents to proper outputs?

2 Answers2