I am working in PHP.
I have some text from a web page/html file, it has been through readability/simplification process, now I want to split it up into phrases/messages no more than a certain number of characters.
Currently: I start with the full page I strip_tags except for pargraph tags, I replace the paragraph closing tags with nothing I explode on opening paragraph tags (This idea taken from another question/answer hereabouts)
This gives an array of paragraphs.
For each paragraph longer than the allowed maximum allowed length, I explode it on '. '.
Giving an array of sentences.
For each sentence longer than the max allowed length, I explode on '.'. (No space... for the lazy).
For each of these longer than max allowed length I look for the last ' ' within max length and split on this.
If any text is still too long, it is chunked to the maximum length.
This is all quite sequential and loopy, and multiple short sentences that could go as a single message go individually - I am sure this could be done better with a couple of regular expressions.
edit
This is what I have ended up with:
function phraseify($text,$maxlen) {
$text = strip_tags($text,'<p>');
$srch= array ('/‘/u', '/’/u', '/“/u', '/”/u', '/—/u');
$repl= array ('\'','\'','"','"','-');
$text = preg_replace($srch,$repl,$text);
$text = html_entity_decode($text,ENT_QUOTES, 'UTF-8');
$text = str_replace('</p>','',$text);
$paras = explode('<p>',$text);
$paras = phraseit($paras,array ('. ',', ','? ','.',',','?','-',' '),$maxlen);
return $paras;
}
function phraseit($arr,$on,$maxlen) {
$ret = array();
foreach ($arr as $str) {
if (strlen($str)<=$maxlen) {
array_push($ret,$str);
} else {
while (strlen($str)>0) {
$sub=substr($str,0,$maxlen);
$pos='';
for ($i=0; $i<count($on); $i++ ) {
$pos=strrpos($sub,$on[$i]);
if ($pos!==false) {break; }
}
if ($pos===false) {$pos = $maxlen; }
array_push($ret,substr($str,0,$pos+1));
$str = substr($str,$pos+1);
}
}
}
return $ret;
}