2

I used tokenizer = RegexpTokenizer(r'\w+') which retains alphanumeric characters But how do I combine a regular expression to remove every other element retaining just characters greater than length 2

Below is one row in the dataframe which contains random text

0 [ANOTHER 2'' F/P SAMPLE 01:52 ...A13232 / AS OUTPUT MSG...

Wiktor Stribiżew
  • 561,645
  • 34
  • 376
  • 476
Hackerds
  • 1,055
  • 2
  • 12
  • 31

1 Answers1

4

I think you need for find words with len>2:

RegexpTokenizer(r'\w{3,}')

Or if need only letters:

RegexpTokenizer(r'[a-zA-Z]{3,}')
jezrael
  • 729,927
  • 78
  • 1,141
  • 1,090