7

I'm looking to create a sentence extraction program, so a program that aims to get the most important sentences from a body of text. The first step for me is to try to evaluate what characteristics important sentences share, and how they are different from non-important sentences.

As such, I am looking for a corpus of documents which have the most important sentences somehow marked. I already have some metrics in mind, and I would like to do inference on how important each of those metrics are in distinguishing the interesting from the filler, and so need some data.

Edit: I believe that it defeats the purpose to use the results of a different sentence extraction program, and am instead looking for the results of human work.

John Madden
  • 201
  • 1
  • 7
  • 1
    Try DeepMind's http://arxiv.org/pdf/1506.03340v1.pdf and their dataset (on their github.com page). – Anton Tarasenko Dec 26 '15 at 13:42
  • Thanks very much for your reply. I pocked through their GitHub, and didn't see anything that exactly matched the specifications I'm looking for here. However, that paper was a very interesting read. – John Madden Dec 26 '15 at 20:07
  • https://github.com/deepmind/rc-data – Anton Tarasenko Dec 27 '15 at 06:15
  • 2
    What would be the common assessment of "important"? It seems like a considerable judgment call, probably best left to whatever researcher was going to act on the definition. – Joe Germuska Dec 31 '15 at 18:15
  • @AntonTarasenko Thank you for the clarification. I had actually already downloaded that folder. I found within a PY file and a couple of text files, the purpose of which is to generate question/answer pairs. I suppose you're implying I could repurpose this to my specifications. While certainly interesting, I think I am looking for the results of actual human work, and have clarified my initial question. – John Madden Jan 01 '16 at 00:31
  • 1
    @JoeGermuska That is a great question, one I was hoping to avoid by looking at the results of someone else's work. But to be specific, I am ideally looking for the results of "Here is an article, circle the sentences which you feel to be important" addressed to an untrained individual. As such, what "important" means is not exactly defined, but is left to be interpreted by the individual completing the task, and may vary. – John Madden Jan 01 '16 at 00:33
  • @JohnMadden Unless you're specifically interested in manually-tagged corpora, you may also want to look for papers about computational methods for identifying the "important" parts of documents. Most of what I've seen is more word-based than sentence based, but I haven't looked that hard. – Joe Germuska Jan 02 '16 at 17:15
  • I believe I am, because I believe that what words are in a sentence to not completely explain whether or not a sentence is important, and so hope to look at manually tagged corpora. Thanks for the suggestion though. – John Madden Jan 03 '16 at 21:51
  • Have you found any interesting corpus? – Franck Dernoncourt Aug 25 '17 at 22:10
  • 1
    @FranckDernoncourt I actually ended up making one by paying folks on Mechanical Turk to label docs. I'll throw it on my github and link to it when I get home, thanks for reminding me of this question. – John Madden Aug 26 '17 at 15:24
  • 1
    @FranckDernoncourt I actually already had em on there, I've pasted in the link in an answer. – John Madden Aug 26 '17 at 21:13

1 Answers1

3

I actually ended up paying folks on MechanicalTurk to label questions as important/unimportant from a couple of news articles I downloaded. There are 410 sentences total, which I have on my github here.

John Madden
  • 201
  • 1
  • 7