0

I'm looking for regex that can extract the dates from the following html code:

<div class="image-subtitle">
                    Fri 10 Jun 2022             </div>

and make it so that it can still find the date format from strings like these, where there can be additional code before or after the date, but the beginning and end pattern stays the same:

<div class="image-subtitle">
                    Thu 9 Jun 2022 <sup class="sold-out-highlight highlight-light">&nbsp;Few Tickets Left&nbsp;</sup>               </div>

or like this

<div class="image-subtitle">
                    <span class="sold-out sold-out-light">Sat 4 Jun 2022</span> <sup class="sold-out-highlight highlight-light">&nbsp;Sold Out&nbsp;</sup>              </div>
cooperano
  • 17
  • 2
  • Please try this: [A-Z]{1}[a-z]{2} [1-3]{0,1}[0-9]{1} [A-Z]{1}[a-z]{2} [1-2]{1}[0-9]{3} https://regex101.com/r/vQ8X4N/1 – Markus Meyer Jun 04 '22 at 04:24
  • It works! Thank you so much, but is there a way for me to specify the beginning and end pattern for regex to look in? Since there might be more dates of the same format that maybe irrelevant to what I'm looking for – cooperano Jun 04 '22 at 04:36
  • Please have a look: https://regex101.com/r/YLPV4w/1 – Markus Meyer Jun 04 '22 at 04:41

1 Answers1

0

This will extract date-looking string.

import re
re.findall("\w{3}\s*?\d{2}\s*?\w{3}\s*?\d{4}\s*?", the_text)

This is more specific to week day and month names.

import re
re.findall("(?:Mon|Tue|Wed|Thu|Fri|Sat|Sun)\s*?\d{2}\s*?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s*?\d{4}\s*?", the_text)

If you want the beginning to be same (<div class="image-subtitle">) everytime

import re
re.findall('<div class="image-subtitle">\s*?((?:Mon|Tue|Wed|Thu|Fri|Sat|Sun)\s*?\d{2}\s*?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s*?\d{4}\s*?)', the_text)

You still need to verify validity of date though

Nawal
  • 34
  • 3