2

Usually when OCR an table of content the columns are separated by a large space, so the outputs are not properly order. For example, for an table like this:

The output would be:

The Rank Function
Permutations of Atoms
Pure Set Theory and Axiom System ZF
3.5
3.6
3.7

I'd like it to be:

3.5 The Rank Function\112
3.6 Permutations of Atoms\116
3.7 Pure Set Theory and Axiom System ZF\118

But different TOCs has different the output patterns, so there is no way to build a regex script to automatically fix every book. The best approach is to fix it at the first place. But how?

Ooker
  • 317
  • 3
  • 15

2 Answers2

2

Define what is: "fix it at the first place".

If you want to fix wrong output from OCR analysis, a simple solution on an infinite set of TOCs you will never make. You will never apply all variations. You would have to create a machine learning algorithm that would analyze each TOC variant.

Or count substrings of the same characteristics (in simple TOC).

Chapter number
Chapter number
Chapter number
Chapter number
Chapter number
...

= 5

Chapter title
Chapter title
Chapter title
Chapter title
Chapter title
...

= 5

If you want to fix OCR analysis, it's a good to answer: What OCR tool do you use?

For example, in Tesseract you can set, that text is processed by rows instead of columns.

enter image description here

Stilgar Dragonclaw
  • 351
  • 2
  • 3
  • 16
-1

Not really answer the question, but some books in Google Books have TOC:

Ooker
  • 317
  • 3
  • 15