0

I have a df with journals. I have different journals.

I want to extract journals with titles below only

Blood, Cancer, Chest, Circulation, Diabetes, JAMA, Endocrinology, Gastroenterology, Gut, Medicine, Neurology, Pediatrics, Physical therapy, Radiology, Surgery, Geriatrics

Some journals have the same words - Blood circulation, Cancer History, etc. I do not want to select them.

Example

Id Title
1  Blood
2  Blood
3  Blood purification
4  Blood transfusion
5  Cancer
6  Chest
7  Cancer History
8  Chest Analysis

I want to keep the exact journal title and create new column "Influential", but cannot find the way with str.contains or str.match.

I am trying two approaches

df.loc[df['Title'].str.contains("Blood", case = True, na = False), 'Influential'] = 'Blood'
df.loc[df['Title'].str.match("Blood", case = True, na = False), 'Influential'] = 'Blood'

Expected output with the exact title of the journal:

Id Title              Influential
1  Blood              Blood
2  Blood              Blood
3  Blood purification NA
4  Blood transfusion  NA
5  Cancer             Cancer
6  Chest              Chest
7  Cancer History     NA
8  Chest Analysis     NA

Should I do it somehow via regex? Thanks.

Anakin Skywalker
  • 2,118
  • 4
  • 27
  • 48

1 Answers1

2

If you want to set Influential column values with the values from Title column if the latter is an exact match of the words in your lst list, you can use Series.isin:

df = pd.DataFrame({'Id':[1,2,3,4,5,6,7,8], 'Title': ['Blood','Blood', 'Blood purification', 'Blood transfusion', 'Cancer', 'Chest', 'Cancer History', 'Chest Analysis']})
lst = ['Blood', 'Chest', 'Cancer']
df['Influential'] = np.where(df['Title'].isin(lst), df['Title'], np.nan)
# >>> df
#    Id               Title Influential
# 0   1               Blood       Blood
# 1   2               Blood       Blood
# 2   3  Blood purification         NaN
# 3   4   Blood transfusion         NaN
# 4   5              Cancer      Cancer
# 5   6               Chest       Chest
# 6   7      Cancer History         NaN
# 7   8      Chest Analysis         NaN

Note the use of numpy.where (also suggested in the comments).

Wiktor Stribiżew
  • 561,645
  • 34
  • 376
  • 476