2

Q: How can I find noun 2-grams in the English language (e.g., "roller coaster", "test tube")? Better yet, how can I find them with proportions?

Ultimate goal: Generate distinct single images for each English letter-pair (e.g., "RC" -> "roller coaster" -> distinct image of roller coaster; "TT" -> "test tube" -> distinct image of a test tube)

My attempts:

  1. Woxikon, AcronymFinder, etc. There are some good ideas here. E.g., this is where I found "TT" -> "test tube". But most of these acronyms don't admit nice, distinct, single images.
  2. I've never done any textual analysis before. I adapted the Introduction to tidytext vignette for my case. See script below. It didn't produce what I was looking for. Maybe Jane Austen books aren't the best input :) Still, I think there's probably something more generally wrong with my approach.
  3. Use different rules for generating the letter-pair images. E.g., a phonetic rule where "PB" -> "^p[:vowels:]b.*$" -> "pub" -> image of a specific pub. Unfortunately, I haven't found any rules which don't require a bunch of conditional exceptions. E.g., the phonetic rule I proposed has problems with double vowels such as "AE".
  4. LaDEC is a database of compound noun 1-grams. This is very close to what I'm looking for. But still, I'm trying to get 2-grams.
# tidytext.R
# https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html

library(tidyverse) library(janeaustenr) library(tidytext)

names <- c("mr","mrs","miss","sir","captain","lady", "father","mother", "brother", "sister", "colonel", "jane", "frank", "fanny", "crawford", "emma", "elinor", "dr", "grant", "elizabeth", "catherine", "robert", "martin", "harriet", "smith", "camden", "edmund", "marianne", "elliot", "norris", "anne", "aunt", "tilney", "catherine", "lacey", "thornton")

get_stopwords() -> stop_words

parts_of_speech %>% filter(pos == "Noun" | pos == "Pronoun") -> nouns

austen_books() %>% group_by(book) %>% mutate(line = row_number(), chapter = cumsum(str_detect(text, regex("^chapter [\divxlc]", ignore_case = TRUE)))) %>% ungroup() -> original_books

original_books %>% mutate(text = str_remove_all(text, "'")) %>% unnest_tokens(word, text, token = "ngrams", n=2) %>% drop_na(word) %>% separate(word, into = c("w1", "w2"), sep = "\s") %>% filter(!(str_detect(w1,"\d") | str_detect(w2,"\d"))) %>% anti_join(stop_words, by = c("w1" = "word")) %>% anti_join(stop_words, by = c("w2" = "word")) %>% inner_join(nouns, by = c("w1" = "word")) %>% inner_join(nouns, by = c("w2" = "word")) %>% group_by(w1, w2) %>% count() %>% ungroup() %>% arrange(-n) %>% filter(!w1 %in% names) %>% filter(!w2 %in% names) -> result

result

# A tibble: 15,160 × 3

w1 w2 n

<chr> <chr> <int>

1 great deal 200

2 young man 200

3 dare say 180

4 let us 128

5 drawing room 106

6 one day 84

7 young ladies 82

8 nothing else 80

9 one morning 60

10 young woman 59

# … with 15,150 more rows

```

lowndrul
  • 121
  • 3

2 Answers2

2

One common proxy for useful bigrams is frequently co-occurring bigrams. Frequently co-occurring ngrams are called collocations.

There are many techniques for finding collocations. One method is to sort all bigrams by frequency and then set a threshold for frequency.

Brian Spiering
  • 21,136
  • 2
  • 26
  • 109
  • I think that's effectively what I did with the Jane Austen books text. I was even able to filter the list down to noun-noun bigrams. Still, what I came up with was unusable for my purpose, where I'm looking for bigrams which are, effectively, compound nouns, admitting a single meaningful image. Did you have something different in mind? – lowndrul Sep 26 '22 at 14:39
  • An alternative I'm considering is searching all Wikipedia titles which are noun-noun bigrams. I'm putzing around with Wikidata Query Service right now. I'm new to this tool and it's taking me longer than I would like. But it's worth a shot. – lowndrul Sep 26 '22 at 14:42
2

If you have enough data for it, I've gotten some good mileage out of Normalized Pointwise Mutual Information for Collocation Extraction (Bouma, 2009). I use the python library Gensim for it (specifically Phraser), but there's undoubtedly an R implementation.

Andy
  • 650
  • 4
  • 13
  • I'd just found Gensim yesterday. It seems that there are plenty of tools there.

    Btw, what text do you use for this sort of exercise? Does Gensim come with pre-loaded texts which you use, similar to that which is in the R package janeaustenr? Or do you look elsewhere?

    – lowndrul Sep 26 '22 at 14:47
  • 2
    I used it with an in-house collection of 30M+ news articles, but there are good built-in data sets in different NLP libraries like nltk. – Andy Sep 26 '22 at 14:53