-2

I am using a regular expression to extract the price on the right from the following HTML:

<p class="pricing ats-product-price"><em class="old_price">$99.99</em>$94.99</p>

Using preg match in PHP:

preg_match_all('!<p class="pricing ats-product-price"><em class="old_price">.*?<\/em>(.*?)<\/p>!', $output, $prices);

Except, I noticed that sometimes the HTML doesn't include an old price. So sometimes the HTML looks like this:

<p class="pricing ats-product-price">$129.99</p>

It seems like my goal should be to extract the last price from the expression, or in other words the text that directly follows after the last question mark and before the </p>. This sort of expression is way out of my league though - hoping for some help here. Thanks.

Ben86
  • 27
  • 4

1 Answers1

1

Use a regular expression in combination with a parser:

<?php

$data = <<<DATA
    <p class="pricing ats-product-price">
        <em class="old_price">$99.99</em>
        $94.99
    </p>
    <p class="pricing ats-product-price">$129.99</p>
DATA;

# set up the dom
$dom = new DOMDocument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

# set up the xpath
$xpath = new DOMXPath($dom);

$regex = '~\$\d+[\d.]*\b\s*\Z~';
foreach ($xpath->query("//p") as $line) {
    if (preg_match($regex, $line->nodeValue, $match)) {
        echo $match[0] . "\n";
    }
}

This yields

$129.99
$129.99


The snippet sets up the DOM, queries it for p tags and searches for the last price within.
See a demo for the expression on regex101.com.
Jan
  • 40,932
  • 8
  • 45
  • 77
  • I think it isn't important to check how looks like the nodeValue with a pattern. Here the main goal is to return the old price when it exists or the price when the old price doesn't exist. You can do it with an XPath query, for example: `//p[./@class[contains(.,"pricing") and contains(.,"ats-product-price")]]/em[contains(@class,"old_price")] | //p[./@class[contains(.,"pricing") and contains(.,"ats-product-price")]][not(./em)]` – Casimir et Hippolyte Jan 31 '18 at 21:13
  • @CasimiretHippolyte: I find mine slightly more readable, admittedly :) But you are right, one could do without a regular expression here. – Jan Jan 31 '18 at 21:14
  • Also, be careful with the option `LIBXML_HTML_NOIMPLIED`, when the document doesn't have a root element, libxml transforms something like `

    ` into `

    ` (it uses the first element and moves silently the closing `` to the end to have a root element).
    – Casimir et Hippolyte Jan 31 '18 at 21:20