how to tokenize text like in tidytext?

Question

I am trying to reproduce in Python the exploding tokenization of tidytext

> tibble(text = c('hasta la vista baby',
+                 'I am the terminator'),
+        value = c(1,2)) %>% 
+   unnest_tokens(input = 'text',output = 'word', token = 'words')
# A tibble: 8 x 2
  value word      
  <dbl> <chr>     
1     1 hasta     
2     1 la        
3     1 vista     
4     1 baby      
5     2 i         
6     2 am        
7     2 the       
8     2 terminator

Is it possible to do so in Pandas as well? I am focusing on speed of execution here.

import pandas as pd

pd.DataFrame({'text': ['hasta la vista baby', 'I am the terminator'],
              'value': [1,2]})
Out[3]: 
                  text  value
0  hasta la vista baby      1
1  I am the terminator      2

Thanks!

Similar to [this question](https://stackoverflow.com/questions/62216774/extracting-top-words-by-date/62217094#62217094) — Quang Hoang, Jun 05 '20 at 19:46
`df.assign(text=df['text'].str.split()).explode('text')` in pandas — anky, Jun 05 '20 at 19:47
@anky very interesting, thanks! but I guess the pandas native solution only allows a very simple tokenization (here based on white spaces)... — ℕʘʘḆḽḘ, Jun 05 '20 at 19:52
Not necessarily, you can pass the delimiter inside `str.split()`, example for a comma you would do `df.assign(text=df['text'].str.split(",")).explode('text')` you can check more [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html) — anky, Jun 05 '20 at 19:54
i believe this is similar to [this](https://stackoverflow.com/questions/53218931/how-to-unnest-explode-a-column-in-a-pandas-dataframe/53218939#53218939) do you think this is a dupe? it also covers all versions of pandas — anky, Jun 05 '20 at 19:57
note a dupe because we focus on text here. Perhaps you can specify how to split on sentences as well? — ℕʘʘḆḽḘ, Jun 05 '20 at 19:59

how to tokenize text like in tidytext?

0 Answers0