-2

I have a dataframe which contains a series of patterns.

Example dataframe:

mydata: 
[ 'Ticket number INS 00909',
'Ticket number INS00909',
'Ticket number REQ 8776',
'Ticket number REQ#8777',
'Ticket number REQ #8778',
'Ticket number REQ8778',
'Number is CR 0098445554',
'No INS number',
'No REQUEST',
'DESCRIBED']

Search_Values = ['INS', 'REQ', 'CR' ]
pattern = '|'.join(Search_Values)
df['check'] = df['mydata'].str.contains(pattern)

Basically, I want to find and extract only the ticket numbers. What logic should be applied to pull the CR, REQ , INS followed by number? Sometimes the string would contain special characters like #, -, :, *, space(\s) in between Req and number.Can anyone please help?

Wiktor Stribiżew
  • 561,645
  • 34
  • 376
  • 476
Newbie
  • 13
  • 3
  • Which is your desired output? A list of numbers all together? A dictionary as `{'type of ticket': list of ticket numbers}`? A new dataframe column with a string `'type of ticket - ticket number'`? – Zephyr Jun 06 '20 at 23:00
  • @Andrea Blengino: I want the entire ticket number in a new dataframe column. – Newbie Jun 07 '20 at 02:34
  • FYI: Thoroughly answering questions is time-consuming. If your question is **solved**, say thank you by _**accepting** the solution that is **best for your needs**_. The **✔** is below the **▲** / **▼** arrow, at the top left of the answer. A new solution can be accepted if a better one shows up. You may also vote on the usefulness of an answer with the **▲** / **▼** arrow, if you have a 15+ reputation. **Leave a comment if a solution doesn't answer the question**. [What should I do when someone answers my question?](https://stackoverflow.com/help/someone-answers). Thank you – Zephyr Aug 22 '21 at 16:32

2 Answers2

2

This will give you a new df column. The indices of the mydata strings are used to take slices with only your desired info. The last if/else block checks to see if there are numbers in the string to avoid appending false positive matches.

order_list = []

for idx, row in df.iterrows():

    if 'INS' in row['mydata']:
        index = row['mydata'].index('INS')
    elif  'REQ' in row['mydata']:
        index = row['mydata'].index('REQ')
    elif  'CR' in row['mydata']:
        index = row['mydata'].index('CR')


    if any(map(str.isdigit, row['mydata'])):
        order_list.append(row['mydata'][index:])
    else:
        order_list.append('')

df['order'] = order_list

print(df)

    mydata                  order
0   Ticket number INS 00909 INS 00909
1   Ticket number INS00909  INS00909
2   Ticket number REQ 8776  REQ 8776
3   Ticket number REQ#8777  REQ#8777
4   Ticket number REQ #8778 REQ #8778
5   Ticket number REQ8778   REQ8778
6   Number is CR 0098445554 CR 0098445554
7   No INS number   
8   No REQUEST  
9   DESCRIBED   
Matthew Borish
  • 2,796
  • 2
  • 11
  • 23
1

If you do not have to distinguish between INS, RES and CR numbers, you can use this code:

import re

data = [ 'Ticket number INS 00909',
'Ticket number INS00909',
'Ticket number REQ 8776',
'Ticket number REQ#8777',
'Ticket number REQ #8778',
'Ticket number REQ8778',
'Number is CR 0098445554',
'No INS number',
'No REQUEST',
'DESCRIBED']

numbers = []

for item in data:
    res = re.findall(r'[0-9]+$', item)
    if res != []:
        numbers.append(res[0])

print(numbers)

which gives a simple list with all the ticket numbers together:

['00909', '00909', '8776', '8777', '8778', '8778', '0098445554']

Otherwise, if you need to distinguish the ticket types, use this:

import re

data = [ 'Ticket number INS 00909',
'Ticket number INS00909',
'Ticket number REQ 8776',
'Ticket number REQ#8777',
'Ticket number REQ #8778',
'Ticket number REQ8778',
'Number is CR 0098445554',
'No INS number',
'No REQUEST',
'DESCRIBED']

numbers = {'INS': [], 'REQ': [], 'CR': []}

for item in data:
    res = re.findall(r'[0-9]+$', item)
    if res != []:
        if 'INS' in item:
            numbers['INS'].append(res[0])
        elif 'REQ' in item:
            numbers['REQ'].append(res[0])
        elif 'CR' in item:
            numbers['CR'].append(res[0])

print(numbers)

which gives a dictionary with a key for each ticket type:

{'INS': ['00909', '00909'], 'REQ': ['8776', '8777', '8778', '8778'], 'CR': ['0098445554']}

I kept the ticket numbers as str and I had not converted them into int in order to avoid this, perhaps unwanted, conversion: '00909' ---> 909.
This solution is based on the assumption that the ticket number is always at the end of the string.

Zephyr
  • 10,450
  • 29
  • 41
  • 68