0

I have a text. I split it into sentences and words. Next I must split it on tokens(,,.,?,!, ...) And I have a trouble here. Can you advise me which regex choose?

This is my code which split text into sentences and words.

String s = ReadFromFile();
String sentences[] = s.split("[.!?]\\s*");
String words[][] = new String[sentences.length][]; 
for (int i = 0; i < sentences.length; ++i)
{
    words[i] = sentences[i].split("[\\p{Punct}\\s]+");
}
System.out.println(Arrays.deepToString(words));

So, I have a separete array of sentences and array of words. But with tokens I have a problem.

Input data

Arithmetic operators are used in mathematical expressions in the same way that they are used in algebra. The following table lists the arithmetic operators: Assume integer variable A holds 10 and variable B holds 20, then:

Expected result

. : , :

Pshemo
  • 118,400
  • 24
  • 176
  • 257
vika
  • 73
  • 1
  • 9

1 Answers1

0

Simplest solution is to not use split which requires from you description of things you don't want in result, but using Matcher#find and describing things you want to find.

String s = "Arithmetic operators are used in mathematical expressions in the same way that they are used in algebra. The following table lists the arithmetic operators: Assume integer variable A holds 10 and variable B holds 20, then:";

Pattern p = Pattern.compile("\\p{Punct}");
       //or Pattern.compile("[.]{3}|\\p{Punct}"); if you want to find "..."
Matcher m = p.matcher(s);
while (m.find()) {
    System.out.println(m.group());
}

Output:

.
:
,
:

Instead of printing m.group() you can store it in collection like List.

Pshemo
  • 118,400
  • 24
  • 176
  • 257
  • maybe you know how fix this error `String s = ReadFromFile(); int i = 0; int j= 0; String sentences[] = s.split("[.!?]\\s*"); String tokens[][] = new String[sentences.length][]; Pattern p = Pattern.compile("\\p{Punct}"); Matcher m = p.matcher(s); while (m.find()) { if (m.group().equals(".")) { i++; j = 0; tokens[i][j] = m.group(); } else { tokens[i][j] = m.group(); j++; } }` I have error java.lang.NullPointerException on line tokens[i][j] = m.group(); – vika Oct 23 '15 at 17:05
  • http://stackoverflow.com/questions/218384/what-is-a-null-pointer-exception-and-how-do-i-fix-it – Pshemo Oct 23 '15 at 18:02