Are there ways to automatically detect what book an e-text is?

Question

I know that these days, there are services (I think Google provides one) that can take a "digital fingerprint" of a song, and help you catalogue that song, e.g. composer/band/genre/year.

Is there a similar service for books, where it would, based on text snippets, automatically tell you the author/title/year/ISBN?

Don't care if the service is packaged as a "book cataloging" app or merely a per-single-book API that I can use in my own software.

+1 An interesting question, although typically just searching online will yield the book name if it is sufficiently popular. — Nathan Osman, Dec 20 '13 at 19:24
@NathanOsman - I'm interested in something that can be used from within a program, as opposed to requiring my own eyeballs and result-choice heuristics like random Google search would. — DVK, Dec 20 '13 at 19:28
I can see copyright laws preventing a single source storing all the text for [all/most] books. Maybe some of the big players can come together and provide such a service. — Jason Down, Dec 24 '13 at 14:56
@JasonDown - you don't need to store all the text. Just signatures (think the services that identify MP3 for you from sound signature; or Google's thing that automatically detects copyrighted material on Youtube) — DVK, Dec 24 '13 at 15:08
Yes that makes sense. When you said based on text snippets, I was thinking thinking you meant based on any text snippet within the document. — Jason Down, Dec 24 '13 at 15:12
This would be cool if it existed (hopeful!) @palacsint may be on to something. — Shimmy Hacked, Dec 25 '13 at 02:01

score 3 · Answer 1 · answered Dec 25 '13 at 13:05

I am not aware if it is such a tool/service/API, and generally publishers don't offer APIs IMHO mainly because copyright infringement sites or concurrent businesses might use them.

So you need to take a custom approach, using URL because most of the sites use GET method to do their queries and do some data-mining using scripts (wget/selenium etc).

You could do like this:

Search for exact text in google

ex search:

"Numerical boundaries take many forms but are always applied in finite games. Persons are selected for finite play."
Look for ISBN in resulted pages or for title and author using regular expressions or CSS selectors, XPATH etc.
Search using advanced query on amazon or other site: http://www.amazon.com/gp/search/ref=sr_adv_b/?search-alias=stripbooks&unfiltered=1&field-isbn=1476731713

notice &field-isbn=1476731713 same could be used for &field-author= or &field-title=

Use regular expressions to extract all the book data.

This would be my approach.

Are there ways to automatically detect what book an e-text is?

1 Answers1