3

I have a field which contains two charecters, some digits and potentially a single letter. For example

QU1Y
ZL002
FX16
TD8
BF007P
VV1395
HM18743
JK0001

I would like to consistently return all letters in their original position, but digits as follows.

for 1 to 3 digits : return all digits OR the digits left padded with zeros

For 4 or more digits : it must not begin with a zero and return the 4 first digits OR if the first is a zero then truncate to three digits

example from the data above

QU001Y
ZL002
FX016
TD008
BF007P
VV1395
HM1874
JK001

The implementation will be in R but I'm interested in a straight regex solution, I'll work out the R side of things. It may not be possible in straight regex which is why I can't get my head round it.

This identifies the correct ones, but I'm hoping to correct those which are not right.

"[A-Z]{2}[1-9]{0,1}[0-9]{1,3}[F,Y,P]{0,1}"

For the curious, they are flight numbers but entered by a human. Hence the variety...

rj3838
  • 63
  • 5

1 Answers1

0

You may use

> library(gsubfn)
> l <- c("QU1Y", "ZL002", "FX16", "TD8", "BF007P", "VV1395", "HM18743", "JK0001")
> gsubfn('^[A-Z]{2}\\K0*(\\d{1,4})\\d*', ~ sprintf("%03d",as.numeric(x)), l, perl=TRUE)
[1] "QU001Y" "ZL002"  "FX016"  "TD008"  "BF007P" "VV1395" "HM1874" "JK001" 

The pattern matches

  • ^ - start of string
  • [A-Z]{2} - two uppercase letters
  • \\K - the text matched so far is removed from the match
  • 0* - 0 or more zeros
  • (\\d{1,4}) - Capturing group 1: one to four digits
  • \\d* - 0+ digits.

Group 1 is passed to the callback function where sprintf("%03d",as.numeric(x)) pads the value with the necessary amount of digits.

Wiktor Stribiżew
  • 561,645
  • 34
  • 376
  • 476