Q: How can I find noun 2-grams in the English language (e.g., "roller coaster", "test tube")? Better yet, how can I find them with proportions?
Ultimate goal: Generate distinct single images for each English letter-pair (e.g., "RC" -> "roller coaster" -> distinct image of roller coaster; "TT" -> "test tube" -> distinct image of a test tube)
My attempts:
- Woxikon, AcronymFinder, etc. There are some good ideas here. E.g., this is where I found "TT" -> "test tube". But most of these acronyms don't admit nice, distinct, single images.
- I've never done any textual analysis before. I adapted the Introduction to tidytext vignette for my case. See script below. It didn't produce what I was looking for. Maybe Jane Austen books aren't the best input :) Still, I think there's probably something more generally wrong with my approach.
- Use different rules for generating the letter-pair images. E.g., a phonetic rule where "PB" -> "^p[:vowels:]b.*$" -> "pub" -> image of a specific pub. Unfortunately, I haven't found any rules which don't require a bunch of conditional exceptions. E.g., the phonetic rule I proposed has problems with double vowels such as "AE".
- LaDEC is a database of compound noun 1-grams. This is very close to what I'm looking for. But still, I'm trying to get 2-grams.
# tidytext.R
# https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html
library(tidyverse)
library(janeaustenr)
library(tidytext)
names <- c("mr","mrs","miss","sir","captain","lady",
"father","mother", "brother", "sister",
"colonel", "jane", "frank", "fanny",
"crawford", "emma", "elinor", "dr", "grant",
"elizabeth", "catherine", "robert", "martin",
"harriet", "smith", "camden", "edmund", "marianne",
"elliot", "norris", "anne", "aunt", "tilney", "catherine",
"lacey", "thornton")
get_stopwords() ->
stop_words
parts_of_speech %>%
filter(pos == "Noun" | pos == "Pronoun") ->
nouns
austen_books() %>%
group_by(book) %>%
mutate(line = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() ->
original_books
original_books %>%
mutate(text = str_remove_all(text, "'")) %>%
unnest_tokens(word, text, token = "ngrams", n=2) %>%
drop_na(word) %>%
separate(word, into = c("w1", "w2"), sep = "\s") %>%
filter(!(str_detect(w1,"\d") | str_detect(w2,"\d"))) %>%
anti_join(stop_words, by = c("w1" = "word")) %>%
anti_join(stop_words, by = c("w2" = "word")) %>%
inner_join(nouns, by = c("w1" = "word")) %>%
inner_join(nouns, by = c("w2" = "word")) %>%
group_by(w1, w2) %>%
count() %>%
ungroup() %>%
arrange(-n) %>%
filter(!w1 %in% names) %>%
filter(!w2 %in% names) ->
result
result
# A tibble: 15,160 × 3
w1 w2 n
<chr> <chr> <int>
1 great deal 200
2 young man 200
3 dare say 180
4 let us 128
5 drawing room 106
6 one day 84
7 young ladies 82
8 nothing else 80
9 one morning 60
10 young woman 59
# … with 15,150 more rows
```