5

I want to split a large string by a series of words.

E.g.

$splitby = array('these','are','the','words','to','split','by');
$text = 'This is the string which needs to be split by the above words.';

Then the results would be:

$text[0]='This is';
$text[1]='string which needs';
$text[2]='be';
$text[3]='above';
$text[4]='.';

How can I do this? Is preg_split the best way, or is there a more efficient method? I'd like it to be as fast as possible, as I'll be splitting hundreds of MB of files.

hakre
  • 184,866
  • 48
  • 414
  • 792
Alasdair
  • 12,912
  • 15
  • 79
  • 131
  • Afternote: racar's answer is the fastest, if array_flip is performed on $splitby and then isset() is used instead of in_array(). preg_split does not work because there are hundreds of words in $splitby. – Alasdair Nov 10 '11 at 07:05

4 Answers4

7

This should be reasonably efficient. However you may want to test with some files and report back on the performance.

$splitby = array('these','are','the','words','to','split','by');
$text = 'This is the string which needs to be split by the above words.';
$pattern = '/\s?'.implode($splitby, '\s?|\s?').'\s?/';
$result = preg_split($pattern, $text, -1, PREG_SPLIT_NO_EMPTY);
mellamokb
  • 55,194
  • 12
  • 105
  • 134
5

preg_split can be used as:

$pieces = preg_split('/'.implode('\s*|\s*',$splitby).'/',$text,-1,PREG_SPLIT_NO_EMPTY);

See it

codaddict
  • 429,241
  • 80
  • 483
  • 523
4

I don't think using pcre regex is necessary ... if it's really splitting words you need.

You could do something like this and benchmark see if it's faster / better ...

$splitby = array('these','are','the','words','to','split','by');
$text = 'This is the string which needs to be split by the above words.';

$split = explode(' ', $text);
$result = array();
$temp = array();

foreach ($split as $s) {

    if (in_array($s, $splitby)) {
        if (sizeof($temp) > 0) {
           $result[] = implode(' ', $temp);
           $temp = array();
        }            
    } else {
        $temp[] = $s;
    }
}

if (sizeof($temp) > 0) {
    $result[] = implode(' ', $temp);
}

var_dump($result);

/* output

array(4) {
  [0]=>
  string(7) "This is"
  [1]=>
  string(18) "string which needs"
  [2]=>
  string(2) "be"
  [3]=>
  string(5) "above words."
}

The only difference with your output is the last word because "words." != "word" and it's not a split word.

malletjo
  • 1,776
  • 16
  • 18
  • Thank you for your help. Though in_array() is very slow for large arrays, preg_split is much faster. – Alasdair Nov 10 '11 at 04:00
  • maybe you're right, but you may get "Compilation failed: regular expression is too large at offset ******" if you use preg_split. I just try with a array of 5490 words and it failed. – malletjo Nov 10 '11 at 04:44
  • Well it turned out that preg_split was taking too long for my liking. See my solution below. Your solution is good, but in_array() function has problems in PHP. A faster way to check for the existence for a value in an array is to array_flip the array and then check for the existence of the key with isset(), which is about 1000x faster than using in_array(). – Alasdair Nov 10 '11 at 04:52
  • array_flip + isset seems a good idea. But the difference is "only" 30ms for an array of 200k element. – malletjo Nov 10 '11 at 05:12
  • In my experience the difference is seconds vs. hours, literally. I think there's a serious problem with in_array(). Anyway, neither the preg_split nor my method I posted then deleted has achieved what I want. I'm now testing your method modified to use isset(). – Alasdair Nov 10 '11 at 05:22
  • Genius! With modification to use array_flip() & isset() this is both fast and efficient. I'm using this now. Thank you! – Alasdair Nov 10 '11 at 05:43
  • you still can optimize my code by removing sizeof and use a variable instead and maybe some other micro-optimization – malletjo Nov 10 '11 at 22:24
-1

Since the words in your $splitby array are not regular expression maybe you can use

str_split

Yada
  • 28,965
  • 22
  • 100
  • 142
  • `str_split()` cannot separate a string by a string. It merely splits a string up into an array of characters the length of the last argument (which defaults to 1). – Bailey Parker Nov 10 '11 at 03:20
  • This answer doesn't make sense, considering he wants to split the string by the specific words, not split it into word-sized chunks. – Joe C. Nov 10 '11 at 03:24