3

I have a simple Text analyzer which generates keywords for a given input text. Till now I have been doing a manual evaluation of it, i.e., manually selecting keywords of a text and comparing them against the ones generated by the analyzer.

Is there any way in which I can automate this? I tried googling a lot for some free keyword generators which can help in this evaluation but have not found any till now. I would appreciate any suggestions on how to go about this.

MånsT
  • 11,979
  • This sounds like you just want help with writing some code that will automatically evaluate the validity of your algorithm. If so, & there's nothing really about the statistical &/or machine-learning aspects that you want help with, this may be better suited to other Stack Exchange sites, like Stack Overflow for general programming, or Theoretical Computer Science for a different perspective, but I can't be sure. – gung - Reinstate Monica Aug 06 '12 at 14:29
  • @snow_leopard What is a valid keyword ? Do you create them just to have them or are you planning a follow up usage / application ? – steffen Aug 06 '12 at 15:10
  • @gung I think it depends on the application. E.g. if the keyword extraction is the first step to create a document clustering, it is suitable for stats.SE. – steffen Aug 06 '12 at 15:12
  • Right now its just key word generation for creating tags for a document. My intention here is not get help in writing code to evaluate but to explore ways to automate the testing. e.g. may be some good free keyword generators links ? with whose output I can compare mine. In general how do people evaluate a statistical engine which generates tags for a given text ? – snow_leopard Aug 06 '12 at 16:42

4 Answers4

3

A free keyword generator is the Alchemy API (you need to register in order to get an API key). It generates keywords from any content in English and associates a relevance score to it. In my first tests I found that it does a pretty good job on texts about molecular biology.

You could use it to compare your results with theirs and weigh the differences by their relevance score.

gui11aume
  • 14,703
  • 1
    Thanks. Yep this seems promising. Only issue seems to be limited usage :(. Wonder why I did not find this on googling! What are various other approaches people employ to test keywords ? One way I think which people have suggested above is to use those keywords for some other application like document clustering and may be test that. Any other ways of may be testing keyword generation as stand alone ? – snow_leopard Aug 06 '12 at 16:48
  • Good point. I don't know how people usually measure quality of keyword extraction. There might be manually curated datasets that are used to train your method. I could not find one after 10 min. Googling, perhaps you'll be luckier. – gui11aume Aug 06 '12 at 17:38
2

There are several ways to evaluate keywords ...

Stand alone (evaluating only one generator at a time)

According to wikipedia (Index Term as a synonym for keyword in Information Retrieval), a keyword is

(a) term that captures the essence of the topic of a document

which can be either mean that the term is a summary which does not appear in the document (hard for machines) or a term, which (maybe in variations) appears often in the document (easy for machines), but not too often (so that it might be a common word like "and"). A commonly used method here is the TF-IDF-score.

But what means "often" and "too often" ? This is unclear ... it is in the eye of the beholder ... and exactly the reason why this sort of standalone validation is not possible

Comparing the output of two keyword generators

... for the same set of documents. Assuming that you trust one of the generator and hence use it as reference, you can calculate the overlap using e.g. Jaccard Index.

As a result, the keywords of your generator are as valid as the one from the reference generator, but not necessary valid or useful per se.

Evaluating the keyword relevance for an application

... to illustrate the issue why standalone validation is not possible.

Suppose you have two documents, each containing the following words (among useless others)

  • document A: love, feeling
  • document B: hate, feeling

and 100000 more documents all about statistics where neither of both words does appear.

Now you have to pick one, only one. Which one is the best ? It depends ...

  • If you want to cluster the documents according to their topic, you have to use feeling.
  • If you want to create a sentiment classifer, which labels all documents as positive, negative or neutral, you have to use love and hate, because otherwise you cannot distinguish both.

In summary one can easy evaluate whether a set of keywords is useful for an application, may it be a sentiment classifer, a spam detector or a search engine. But it is not said that a keyword useful for one application is useful for another one, too.

Update

Seems to be a rule of the internet: Everything you can think of is probably already a research discipline: Terminology Extraction.

steffen
  • 10,367
  • Thanks Steffen. This is a great explanation! I guess there are lots of open source datasets available for text classification, clustering etc like Reuters, Newsgroup20, but I could not find any good dataset for keywords evaluation. Do you know of any corpus having text files along with their keywords on which I can base my evaluation ? – snow_leopard Aug 08 '12 at 18:09
  • @snow_leopard I found a little gem and updated the answer. The most promising link is unfortunately broken, but here is the website of mr wilson wong. Or here, search for Ontology learning from text using Web data for information exploration. The example data does not seem to be available anymore, but if you contact him (or other researchers with promising papers) and ask politely he might help you out. In my experience, researchers react grateful if contacted because of their work (if they have time ;)). – steffen Aug 08 '12 at 20:34
  • Fantastic! Wish me luck in getting response from them :) – snow_leopard Aug 09 '12 at 11:04
0

Another free keyword generator is Text Mechanic's Keyword Generator's tool.

I found it extremely useful in my work since it automated everything with "Load", "Random", "Auto Save", etc....

0

Yahoo! has a content analysis service [1] that might do what you want. However, there are qps limitations (obviously) and its usually intended for non-commercial use. However, it should be ok if you are doing in-house evaluation of your tool instead of directly plugging this API into your application and qps limitation is just good enough to generate your own evaluation data.

[1] http://developer.yahoo.com/search/content/V2/contentAnalysis.html

TenaliRaman
  • 3,761