3

I need a clear text with only words, excluding all digits, extra spaces, dashes, commas, dots, brackets, etc. It is used for a word generation algorithm (taken from gamasutra). I suppose that regular expression can help here. How can I do this with help of String.split?

UPD:

Input: I have 1337 such a string with different stuff in it: commas, many spaces, digits - 2 3 4, dashes. How can I remove all stuff?

Output: I have such a string with different stuff in it commas many spaces digits dashes How can I remove all stuff

Community
  • 1
  • 1
vladfau
  • 993
  • 8
  • 21
  • possible duplicate of [Splitting strings through regular expressions by punctuation and whitespace etc in java](http://stackoverflow.com/questions/7384791/splitting-strings-through-regular-expressions-by-punctuation-and-whitespace-etc) – xlecoustillier Jun 12 '13 at 08:36
  • 1
    Please add an example with input text and expected output text. – pepuch Jun 12 '13 at 08:38

3 Answers3

4

In two steps you could do:

String s = "asd asd   asd.asd, asd";
String clean = s.replaceAll("[\\d[^\\w\\s]]+", " ").replaceAll("(\\s{2,})", " ");
System.out.println(clean);

The first step removes all characters that are not a letter or a space and replaces them with a space. The second step removes multiple spaces by only one space.

Output:

asd asd asd asd asd


If all you need is an array containing the words, then this would be enough:

String[] words = s.trim().split("[\\W\\d]+");
assylias
  • 310,138
  • 72
  • 642
  • 762
3

If you care about Unicode (you should), then use Unicode properties.

String[] result = s.split("\\P{L}+");

\p{L} is the Unicode property for a letter in any language.

\P{L} is the negation of \p{L}, means it will match everything that is not a letter. (I understood that is what you want.)

stema
  • 85,585
  • 19
  • 101
  • 125
  • This perfectly fits in our scenario where we required an accurate length of strings coming from WordPress (via GraphQL). They were showing a different length (`string.length`, usually +1 than the real length) due to the presence of non-Unicode characters and this helped to purge those. – KeshavDulal Dec 15 '21 at 06:31
1

I would do it this way

    str = str.replaceAll("\\s+", " ");
    str = str.replaceAll("\\p{Punct}|\\d", "");
    String[] words = str.split(" ");
Evgeniy Dorofeev
  • 129,181
  • 28
  • 195
  • 266