3

As explained here, t-SNE maps high dimensional data such as word embedding into a lower dimension in such that the distance between two words roughly describe the similarity. It also begins to create naturally forming clusters. For example with the code

if(!"pacman" %in% installed.packages()[,"Package"]) install.packages("pacman")
pacman::p_load(dplyr)
# grab reviews
reviews_all = read.csv("https://raw.githubusercontent.com/rjsaito/Just-R- 
Things/master/NLP/sample_reviews_venom.csv", stringsAsFactors = F)
# create ID for reviews
review_df <- reviews_all %>%
  mutate(id = row_number())
str(reviews_all)
pacman::p_load(text2vec, tm, ggrepel)
tokens <- space_tokenizer(reviews_all$comments %>%
                          tolower() %>%
                          removePunctuation())
# Create vocabulary. Terms will be unigrams (simple words).
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)

vocab <- prune_vocabulary(vocab, term_count_min = 5L)

Use our filtered vocabulary

vectorizer <- vocab_vectorizer(vocab)

use window of 5 for context words

tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)

glove = GlobalVectors$new(rank = 50, x_max = 10) glove$fit_transform(tcm, n_iter = 20)

word_vectors = glove$components

load packages

pacman::p_load(tm, Rtsne, tibble, tidytext, scales)

create vector of words to keep, before applying tsne (i.e. remove stop words)

keep_words <- setdiff(colnames(word_vectors), stopwords())

keep words in vector

word_vec <- word_vectors[, keep_words]

prepare data frame to train

train_df <- data.frame(t(word_vec)) %>% rownames_to_column("word")

train tsne for visualization

tsne <- Rtsne(train_df[,-1], dims = 2, perplexity = 50, verbose=TRUE, max_iter = 500)

create plot

colors = rainbow(length(unique(train_df$word))) names(colors) = unique(train_df$word)

plot_df <- data.frame(tsne$Y) %>% mutate( word = train_df$word, col = colors[train_df$word] ) %>% left_join(vocab, by = c("word" = "term")) %>% filter(doc_count >= 20)

p <- ggplot(plot_df, aes(X1, X2)) + geom_text(aes(X1, X2, label = word, color = col), size = 3) + xlab("") + ylab("") + theme(legend.position = "none") p

We obtain the following picture.

enter image description here

What can be done after this preliminary analysis? is it possible to get the word list for each cluster? Or can a clustering algorithm be applied to the points represented in the image (and stored in plot_df)?

Mark
  • 307

1 Answers1

4

You can, of course, use any numerical clustering (k-means, hierarchical clustering, spectral clustering...) on the projected data (the points). But why should it be better than clustering in the original, high dimensional feature space?

Igor F.
  • 9,089