0

I am working in PHP.

I have some text from a web page/html file, it has been through readability/simplification process, now I want to split it up into phrases/messages no more than a certain number of characters.

Currently: I start with the full page I strip_tags except for pargraph tags, I replace the paragraph closing tags with nothing I explode on opening paragraph tags (This idea taken from another question/answer hereabouts)

This gives an array of paragraphs.

For each paragraph longer than the allowed maximum allowed length, I explode it on '. '.

Giving an array of sentences.

For each sentence longer than the max allowed length, I explode on '.'. (No space... for the lazy).

For each of these longer than max allowed length I look for the last ' ' within max length and split on this.

If any text is still too long, it is chunked to the maximum length.

This is all quite sequential and loopy, and multiple short sentences that could go as a single message go individually - I am sure this could be done better with a couple of regular expressions.

edit

This is what I have ended up with:

function phraseify($text,$maxlen) {
    $text = strip_tags($text,'<p>');
    $srch= array ('/&lsquo;/u', '/&rsquo;/u', '/&ldquo;/u', '/&rdquo;/u', '/&mdash;/u');
    $repl= array ('\'','\'','"','"','-');
    $text = preg_replace($srch,$repl,$text);
    $text = html_entity_decode($text,ENT_QUOTES, 'UTF-8');
    $text = str_replace('</p>','',$text);
    $paras = explode('<p>',$text);
    $paras = phraseit($paras,array ('. ',', ','? ','.',',','?','-',' '),$maxlen);
    return $paras;
}

function phraseit($arr,$on,$maxlen) {
    $ret = array();
    foreach ($arr as $str) {
        if (strlen($str)<=$maxlen) {
            array_push($ret,$str);
        } else {
            while (strlen($str)>0) {
                $sub=substr($str,0,$maxlen);
                $pos='';
                for ($i=0; $i<count($on); $i++ ) {
                    $pos=strrpos($sub,$on[$i]);
                    if ($pos!==false) {break; }
                }
                if ($pos===false) {$pos = $maxlen; }
                array_push($ret,substr($str,0,$pos+1));
                $str = substr($str,$pos+1);
            }
        }
    }
    return $ret;
}
pperrin
  • 1,457
  • 15
  • 31
  • Well I am suprised you can do all of that without single line of code? – MSadura Apr 04 '14 at 11:12
  • **Don't**. Use an HTML parser instead of whatever it is you're doing. [`DOMDocument`](http://www.php.net/dom) is built into PHP. If you need jQuery/css selectors you can use [simplehtmldom](http://simplehtmldom.sourceforge.net/). – h2ooooooo Apr 04 '14 at 11:21
  • @marks my existing code is irrelevant - it works as described, the question was for an alternative using regular expressions to do as described... – pperrin Apr 10 '14 at 08:30
  • @h2ooooooo thanks, but simplexml works will only parse down to the dom element - I am parsing text based on full stops, commas, spaces etc... My starting point (after the doc has been through the simplificaton/readability process) is really a list of paragraphs, with all other *ml stuff stripped out. – pperrin Apr 10 '14 at 08:35

0 Answers0