Get the most used words with special characters

Question

I want to get the most used word from an array. The only problem is that the Swedish characters (Å, Ä, and Ö) will only show as �.

$string = 'This is just a test post with the Swedish characters Å, Ä, and Ö. Also as lower cased characters: å, ä, and ö.';
echo '<pre>';
print_r(array_count_values(str_word_count($string, 1, 'àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ')));
echo '</pre>';

That code will output the following:

Array
(
    [This] => 1
    [is] => 1
    [just] => 1
    [a] => 1
    [test] => 1
    [post] => 1
    [with] => 1
    [the] => 1
    [Swedish] => 1
    [characters] => 2
    [�] => 1
    [�] => 1
    [and] => 2
    [�] => 1
    [Also] => 1
    [as] => 1
    [lower] => 1
    [cased] => 1
    [�] => 1
    [�] => 1
    [�] => 1
)

How can I make it to "see" the Swedish characters and other special characters?

You shouldn't be surprised by any PHP function with a name starting with `str` not being multi-byte safe. The user comments in the manual suggest alternatives. — CBroe, Sep 24 '16 at 11:51
@CBroe `...PHP function with a name starting with str...` where is this function? — SaidbakR, Sep 24 '16 at 12:02
try this function `mb_str_word_count` instead `str_word_count`: http://stackoverflow.com/a/17725577/6797531 — CatalinB, Sep 24 '16 at 12:07
@CatalinB Thank you but the output will then be like this: `Array([This is just a test post with the Swedish characters �, �, and Ö. Also as lower cased characters: �, �, and �.] => 1)` — Airikr, Sep 24 '16 at 13:03

user3942918 · Answer 1 · 2016-09-24T14:51:38.223

All of this is running under the assumption that you're using UTF-8.

You can take a naive approach using preg_split() to split your string on any separator, punctuation, or control character.

`preg_split` example:

$split = preg_split('/[\pZ\pP\pC]/u', $string, -1, PREG_SPLIT_NO_EMPTY);
print_r(array_count_values($split));

Output:

Array
(
    [This] => 1
    [is] => 1
    [just] => 1
    [a] => 1
    [test] => 1
    [post] => 1
    [with] => 1
    [the] => 1
    [Swedish] => 1
    [characters] => 2
    [Å] => 1
    [Ä] => 1
    [and] => 2
    [Ö] => 1
    [Also] => 1
    [as] => 1
    [lower] => 1
    [cased] => 1
    [å] => 1
    [ä] => 1
    [ö] => 1
)

This works fine for your given string, but does not necessarily split words in a way that is locale-aware. For example contractions such as "isn't" would be broken up into "isn" and "t" by this.

Thankfully the Intl extension adds a great deal of functionality for dealing with things like this in PHP 7.

The plan would be to:

*Normalize the input with Normalizer::normalize() to make sure graphemes are all encoded in a consistent manner. For example ä might be encoded, and hence counted, in a couple ways:
- U+00E4 'LATIN SMALL LETTER A WITH DIAERESIS' or
- U+0061 'LATIN SMALL LETTER A' followed by U+0308 'COMBINING DIAERESIS'
Get an IntlBreakIterator that breaks on words in a locale-dependent way via IntlBreakIterator::createWordInstance(). This understands what makes up a "word" for the given locale, including handling contractions like "isn't".
Get its IntlPartsIterator via IntlBreakIterator::getPartsIterator() for ease of iterating over the text fragments.
Skip things you don't care about via IntlChar::ispunct() and IntlChar::isspace()

(*Note that you'll likely want to perform normalization regardless of what method you use to break up the string - it'd be appropriate to do before the preg_split above or whatever you decide to go with.)

Intl example:

$string = Normalizer::normalize($string);

$iter = IntlBreakIterator::createWordInstance("sv_SE");
$iter->setText($string);
$words = $iter->getPartsIterator();

$split = [];
foreach ($words as $word) {
    // skip text fragments consisting only of a space or punctuation character
    if (IntlChar::isspace($word) || IntlChar::ispunct($word)) {
        continue;
    }
    $split[] = $word;
}

print_r(array_count_values($split));

Output:

Array
(
    [This] => 1
    [is] => 1
    [just] => 1
    [a] => 1
    [test] => 1
    [post] => 1
    [with] => 1
    [the] => 1
    [Swedish] => 1
    [characters] => 2
    [Å] => 1
    [Ä] => 1
    [and] => 2
    [Ö] => 1
    [Also] => 1
    [as] => 1
    [lower] => 1
    [cased] => 1
    [å] => 1
    [ä] => 1
    [ö] => 1
)

This is more verbose but may be worthwhile if you'd prefer ICU (the library backing the Intl extension) to do the heavy lifting when it comes to understanding what makes up a word.

Many thanks for a very detailed answer. Both of yours answer and MarZab's answer are very good. Your regex will accept smileys while MarZab's regex will not. If I could, I would accept both answers but since MarZab's regex doesn't accept smileys, I'll accept his answer instead. — Airikr, Sep 24 '16 at 14:13

score 1 · Accepted Answer · edited Oct 03 '16 at 17:04

Here is a solution with regex using Unicode punctuation to split the "words" then just a regular array occurrence count.

array_count_values(preg_split('/[[:punct:]\s]+/u', $string, -1, PREG_SPLIT_NO_EMPTY));

Produces:

Array
(
    [This] => 1
    [is] => 1
    [just] => 1
    [a] => 1
    [test] => 1
    [post] => 1
    [with] => 1
    [the] => 1
    [Swedish] => 1
    [characters] => 2
    [Å] => 1
    [Ä] => 1
    [and] => 2
    [Ö] => 1
    [Also] => 1
    [as] => 1
    [lower] => 1
    [cased] => 1
    [å] => 1
    [ä] => 1
    [ö] => 1
)

This was tested in a unicode console, you might want to empose a encoding if you are using a browser. Either make a <meta> tag or set encoding within your browser, or send PHP headers.

score 0 · Answer 3 · answered Sep 24 '16 at 13:15

0

I managed to remove the � sign by adding ÅåÄäÖö into àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ.

answered Sep 24 '16 at 13:15

Airikr

5,802
12
53
104

Get the most used words with special characters

3 Answers3

preg_split example:

Output:

Intl example:

Output:

`preg_split` example: