3

As you can see from this NGram, the total number of words in the indexed English corpus that were nouns, verbs, adjectives, adverbs, determinants, pronouns, adpositions, numerals, conjunctions, or particles was around 83%.

This could be an infrastructure or programming issue, but assuming it's not, what possible explanation is there that this number is not 100%? Things like interjections are left out, but are 17% of the words in the entirety of Google's indexed literature interjections, or is there a better explanation?

RegDwigнt
  • 97,231
  • From here, they posted the full list of what tags they use. Did you include all of them in your search? –  May 24 '13 at 01:36
  • Yep. You can go to the link in the original post and double check, if you want. – Lincoln Bergeson May 24 '13 at 01:39
  • Hmm, interesting. Is it possible NGrams just don't tag everything? –  May 24 '13 at 01:42
  • 1
    This is a database question, Not about English. – MetaEd May 24 '13 at 01:48
  • Are there no other parts of speech than the ones I identified in the original post? I was simply asking if there is a linguistic explanation. If there isn't one, I'll head over to stack overflow or something. – Lincoln Bergeson May 24 '13 at 01:49
  • 1
    @MετάEd I think the root question is really "what are all the parts of speech", which could still be on-topic –  May 24 '13 at 01:51
  • What's also interesting is that in 1500, the total number of all those parts of speech was around 12%. – Lincoln Bergeson May 24 '13 at 01:53
  • @ Lincoln: There are many different categorisation systems for "parts of speech". It would be Not Constructive to ask for a definitive list here. But as MετάEd says, it's simply Off Topic to ask why you get those results from NGram - we can't say why they coded the database and query facilities like that. Most part of speech list results suggest 8 categories - somewhat different to the 10 NGrams uses (but at least the first one included interjections! :) – FumbleFingers May 24 '13 at 02:01
  • Huh, I didn't realize "parts of speech" was such a broad and loosely defined term. Would someone with the proper authority please move this question to the Programmers.StackExchange? (Preferably with the irrelevant comments deleted.) – Lincoln Bergeson May 24 '13 at 02:04

1 Answers1

4

Algorithmically speaking, identifying a word or phrase as a particular part of speech is not possible in all contexts. A word functions as one part of speech in a context and another in a different place, even without any morphological changes.

To make matters more difficult, fiction thrives on using words creatively, leaving the reader to fend for himself as to how a word is to be interpreted or in how many different ways.

The statistics probably reflect that up to about 17 per cent of the words could not be categorically determined to belong to one or the other part of speech -- not that they do not belong to any of the known parts of speech.

Between circa 1523 - 1650 AD, the number of such unaccounted-for words scales vertically from (14.2) to (83.1) where it settles down and remains to date (2007: 83.9). Probably because the English language, especially in fiction writing, was not "normalized" until very recently.

Hm... as you can see this is a mere hypothesis and I could be wrong, even.

Kris
  • 37,386
  • I still have my doubts -- why should the figure hover around 80% all the time? What's special about that magic number? – Kris May 24 '13 at 05:54
  • I was just about to comment thus on the question: "It's possible (probable?) that their automated heuristics can't positively identify all words as a particular POS, so they are left as unidentified. (For example, what is unidentified in that sentence?)" – Andrew Leach May 24 '13 at 05:55
  • @AndrewLeach 'Automated heuristics' are built from human knowledge and theories, strengthened by statistics. Even the human reader cannot always tell for sure what a word is: 'For example, what is unidentified in that sentence?' :) – Kris May 24 '13 at 05:59
  • Exactly so: I was agreeing (and have upvoted accordingly). – Andrew Leach May 24 '13 at 06:01
  • It's an -ed word. – Edwin Ashworth May 24 '13 at 08:30
  • Speaking of gradience, there is what appears to be a comprehensive article on -ing forms by Janigova at http://books.google.co.uk/books?id=I3rVBbQ2QHoC&pg=PA73&lpg=PA73&dq=conjunction+followed+by+gerund+gradience&source=bl&ots=lh98hsW_P4&sig=vIwiPslq7koiwCMD1Hrkfe2OS-0&hl=en&sa=X&ei=uUGfUerDHdSg0wWq3ICYBw&ved=0CDMQ6AEwAQ#v=onepage&q=conjunction%20followed%20by%20gerund%20gradience&f=false . Sadly, I can't access all of it, so miss a discussion of terminology, I believe. – Edwin Ashworth May 24 '13 at 10:35