-3

Having the following European driving license OCR text extracted using Tesseract.js, I would like to write multiple regular expressions that match different data fields on the driving license (ordering numbers below correspond to the digit preceding the field on any European driving license; the rules for labelling the data fields of these documents can be checked on wikipedia):

  1. surname (last name)
  2. other names ( first name(s) )
  3. date of birth
  4. b date of expiry
  5. ID drivingLicense
  6. address
HR 1. UZORAK
2. SPECIMEN
3. 01011977
1 42.01.07.2013 4. PUDUBROVACKO - NERETVANSKA
e 4b. 01.07.2023
5. 1234587 S i
, E %I\\'\f Dt — |
: = L 9.8 =
D112345671234567890121012017<2

My question is: why is the regex /4(\.|,)*(b|!|8)\.?\s*[0-9\.]*/u matching /4b. 01.07.2023, but 4(\.|,)*(b|!|8)*\.?\s*[0-9\.]*/u (one extra asterisk after the second capture group, as compared to the former regex) is not ? (can be seen checked here: regex101)

M.Ionut
  • 179
  • 1
  • 3
  • 14

1 Answers1

0

My issue was with not setting the proper flags. I also needed /g at the end of the regex, so that I do not get back only the first match. /g translates to "global" aka "don't return after first match", as shown by regex101. Basically, the problem was that: without the /g flag, another occurence was the first one and not the one I expected; the working (for my scenario) regex is eventually: 4(\.|,)*(b|!|8)*\.?\s*[0-9\.]*/gu

M.Ionut
  • 179
  • 1
  • 3
  • 14