Finding useful noun 2-grams?

Question

Q: How can I find noun 2-grams in the English language (e.g., "roller coaster", "test tube")? Better yet, how can I find them with proportions?

Ultimate goal: Generate distinct single images for each English letter-pair (e.g., "RC" -> "roller coaster" -> distinct image of roller coaster; "TT" -> "test tube" -> distinct image of a test tube)

My attempts:

Woxikon, AcronymFinder, etc. There are some good ideas here. E.g., this is where I found "TT" -> "test tube". But most of these acronyms don't admit nice, distinct, single images.
I've never done any textual analysis before. I adapted the Introduction to tidytext vignette for my case. See script below. It didn't produce what I was looking for. Maybe Jane Austen books aren't the best input :) Still, I think there's probably something more generally wrong with my approach.
Use different rules for generating the letter-pair images. E.g., a phonetic rule where "PB" -> "^p[:vowels:]b.*$" -> "pub" -> image of a specific pub. Unfortunately, I haven't found any rules which don't require a bunch of conditional exceptions. E.g., the phonetic rule I proposed has problems with double vowels such as "AE".
LaDEC is a database of compound noun 1-grams. This is very close to what I'm looking for. But still, I'm trying to get 2-grams.

# tidytext.R
# https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html
library(tidyverse)
library(janeaustenr)
library(tidytext)
names <- c("mr","mrs","miss","sir","captain","lady",
           "father","mother", "brother", "sister",
           "colonel", "jane", "frank", "fanny", 
           "crawford", "emma", "elinor", "dr", "grant", 
           "elizabeth", "catherine", "robert", "martin", 
           "harriet", "smith", "camden", "edmund", "marianne", 
           "elliot", "norris", "anne", "aunt", "tilney", "catherine",
           "lacey", "thornton")
get_stopwords() -> 
  stop_words
parts_of_speech %>% 
  filter(pos == "Noun" | pos == "Pronoun") -> 
  nouns
austen_books() %>%
  group_by(book) %>%
  mutate(line = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\divxlc]",
                                                 ignore_case = TRUE)))) %>%
  ungroup() ->
  original_books
original_books %>%
  mutate(text = str_remove_all(text, "'")) %>%
  unnest_tokens(word, text, token = "ngrams", n=2) %>%
  drop_na(word) %>%
  separate(word, into = c("w1", "w2"), sep = "\s") %>%
  filter(!(str_detect(w1,"\d") | str_detect(w2,"\d"))) %>%
  anti_join(stop_words, by = c("w1" = "word")) %>%
  anti_join(stop_words, by = c("w2" = "word")) %>%
  inner_join(nouns, by = c("w1" = "word")) %>%
  inner_join(nouns, by = c("w2" = "word")) %>%
  group_by(w1, w2) %>%
  count() %>%
  ungroup() %>%
  arrange(-n) %>%
  filter(!w1 %in% names) %>%
  filter(!w2 %in% names) ->
  result
result
# A tibble: 15,160 × 3
w1      w2          n
<chr>   <chr>   <int>
1 great   deal      200
2 young   man       200
3 dare    say       180
4 let     us        128
5 drawing room      106
6 one     day        84
7 young   ladies     82
8 nothing else       80
9 one     morning    60
10 young   woman      59
# … with 15,150 more rows
```

score 2 · Answer 1 · answered Sep 26 '22 at 14:10

2

One common proxy for useful bigrams is frequently co-occurring bigrams. Frequently co-occurring ngrams are called collocations.

There are many techniques for finding collocations. One method is to sort all bigrams by frequency and then set a threshold for frequency.

answered Sep 26 '22 at 14:10

Brian Spiering

21,136
2
26
109

I think that's effectively what I did with the Jane Austen books text. I was even able to filter the list down to noun-noun bigrams. Still, what I came up with was unusable for my purpose, where I'm looking for bigrams which are, effectively, compound nouns, admitting a single meaningful image. Did you have something different in mind? – lowndrul Sep 26 '22 at 14:39
An alternative I'm considering is searching all Wikipedia titles which are noun-noun bigrams. I'm putzing around with Wikidata Query Service right now. I'm new to this tool and it's taking me longer than I would like. But it's worth a shot. – lowndrul Sep 26 '22 at 14:42

score 2 · Answer 2 · answered Sep 26 '22 at 14:27

2

If you have enough data for it, I've gotten some good mileage out of Normalized Pointwise Mutual Information for Collocation Extraction (Bouma, 2009). I use the python library Gensim for it (specifically Phraser), but there's undoubtedly an R implementation.

answered Sep 26 '22 at 14:27

Andy

650
4
13

I'd just found Gensim yesterday. It seems that there are plenty of tools there.
Btw, what text do you use for this sort of exercise? Does Gensim come with pre-loaded texts which you use, similar to that which is in the R package janeaustenr? Or do you look elsewhere?
– lowndrul Sep 26 '22 at 14:47
2

I used it with an in-house collection of 30M+ news articles, but there are good built-in data sets in different NLP libraries like nltk. – Andy Sep 26 '22 at 14:53

Finding useful noun 2-grams?

# A tibble: 15,160 × 3

w1 w2 n

<chr> <chr> <int>

1 great deal 200

2 young man 200

3 dare say 180

4 let us 128

5 drawing room 106

6 one day 84

7 young ladies 82

8 nothing else 80

9 one morning 60

10 young woman 59

# … with 15,150 more rows

2 Answers2