All of this is running under the assumption that you're using UTF-8.
You can take a naive approach using preg_split() to split your string on any separator, punctuation, or control character.
preg_split example:
$split = preg_split('/[\pZ\pP\pC]/u', $string, -1, PREG_SPLIT_NO_EMPTY);
print_r(array_count_values($split));
Output:
Array
(
[This] => 1
[is] => 1
[just] => 1
[a] => 1
[test] => 1
[post] => 1
[with] => 1
[the] => 1
[Swedish] => 1
[characters] => 2
[Å] => 1
[Ä] => 1
[and] => 2
[Ö] => 1
[Also] => 1
[as] => 1
[lower] => 1
[cased] => 1
[å] => 1
[ä] => 1
[ö] => 1
)
This works fine for your given string, but does not necessarily split words in a way that is locale-aware. For example contractions such as "isn't" would be broken up into "isn" and "t" by this.
Thankfully the Intl extension adds a great deal of functionality for dealing with things like this in PHP 7.
The plan would be to:
(*Note that you'll likely want to perform normalization regardless of what method you use to break up the string - it'd be appropriate to do before the preg_split above or whatever you decide to go with.)
Intl example:
$string = Normalizer::normalize($string);
$iter = IntlBreakIterator::createWordInstance("sv_SE");
$iter->setText($string);
$words = $iter->getPartsIterator();
$split = [];
foreach ($words as $word) {
// skip text fragments consisting only of a space or punctuation character
if (IntlChar::isspace($word) || IntlChar::ispunct($word)) {
continue;
}
$split[] = $word;
}
print_r(array_count_values($split));
Output:
Array
(
[This] => 1
[is] => 1
[just] => 1
[a] => 1
[test] => 1
[post] => 1
[with] => 1
[the] => 1
[Swedish] => 1
[characters] => 2
[Å] => 1
[Ä] => 1
[and] => 2
[Ö] => 1
[Also] => 1
[as] => 1
[lower] => 1
[cased] => 1
[å] => 1
[ä] => 1
[ö] => 1
)
This is more verbose but may be worthwhile if you'd prefer ICU (the library backing the Intl extension) to do the heavy lifting when it comes to understanding what makes up a word.