8

I am looking for an English news dataset with (relevant) entities mentioned in the article labelled with the sentiment/connotation expressed on the entity by the article.

e.g.

A sense of the change in political winds in Karnataka is also evident with how Bangalore has voted. The state capital, which hugely favoured the <NEGATIVE>BJP</NEGATIVE> last time with hopes of development, chose to ignore it.

Of course, I am not looking for the data in the exact same format as above.

svick
  • 869
  • 4
  • 9
redoc
  • 91
  • 1
  • 3

4 Answers4

5

Could you explain more about what you need the data for? I'm not aware of any pre-built data sets, but you could attempt to construct your own. You'll need to break the problem into two parts though.

The easiest route to identifying the entities is the OpenCalais API, which despite its name is a closed-source service, but has generous usage limits. You can also look at the American National Corpus, which contains a large number of automatically-tagged entities in an open data set.

You'll then separately need to figure out the sentiment associated with each entity, which is still an AI-complete problem to do totally accurately, especially in an example like yours where it would require understanding the meaning of the sentence. Most sentiment analysis techniques look at the frequency of particular words or small sequences of words, you can find a good overview of the algorithms here, along with some datasets matching words with their sentiment.

Pete Warden
  • 553
  • 2
  • 4
  • thanks but http://neuro.imm.dtu.dk/wiki/Text_sentiment_analysis does not include the kind of data I am looking for. – redoc May 24 '13 at 12:14
1

I think the simplest answer would be to look at how articles are labelled by the news aggregators.

If you go to http://newsnow.co.uk or https://news.google.com/ or http://www.moreover.com/ they aggregate news articles and then helpfully label them up to help you with sentiment.

amelvin
  • 235
  • 1
  • 6
1

The FIRST corpus sounds like it might fit the bill. You might also look at the corpora used in similar research like Good News or Bad News? Let the Market Decide. A Google search along the lines of "sentiment analysis labeled news corpus" (without the quotes) might turn up some other relevant leads to follow.

Putting a high quality corpus together is expensive, which is why you don't see many of them freely available, but there's a lot of work being done on bootstrapping methods to help generate semi-automatically labelled corpora.

Tom Morris
  • 1,001
  • 6
  • 13
1

I found the MPQA Opinion corpus here: http://mpqa.cs.pitt.edu/

this is close to what I am looking for.

redoc
  • 91
  • 1
  • 3