How to extract only uppercase substring from pandas series?

Question

I have been trying to extract the uppercase substring from pandas dataframe but to avail. How to extract only uppercase sub string in pandas?

Here is my MWE:

MWE

import numpy as np
import pandas as pd


df = pd.DataFrame({'col': ['cat', 'cat.COUNT(example)','cat.N_MOST_COMMON(example.ord)[2]']})
df['feat'] = df['col'].str.extract(r"[^A-Z]*([A-Z]*)[^A-Z]*")


print(df)

"""
                                 col feat
0                                cat  NaN
1                 cat.COUNT(example)    T
2  cat.N_MOST_COMMON(example.ord)[2]    N
""";

Expected output

                                 col feat
0                                cat  
1                 cat.COUNT(example)    COUNT
2  cat.N_MOST_COMMON(example.ord)[2]    N_MOST_COMMON

Possible duplicate from [this question](https://stackoverflow.com/questions/15886340/how-to-extract-all-upper-from-a-string-python). Anyways, you can apply any of these alternatives in a lambda function just like the answer here. — Cainã Max Couto-Silva, Oct 20 '20 at 19:45
@CainãMaxCouto-Silva That question is about regex module, here I am trying to use pandas str EXTRACT method not re.sub method. — BhishanPoudel, Oct 20 '20 at 19:47
@cs95 In my dataframe there is only one aggregation function as UPPERCASE, the original column names are already lowercased. — BhishanPoudel, Oct 20 '20 at 19:51
@MilkyWay001, Got it! It seems like you have your (quite neat) answer then =) — Cainã Max Couto-Silva, Oct 20 '20 at 19:56
I have a follow up question to this, instead of updating question I posted new question: https://stackoverflow.com/questions/64452644/how-to-extract-the-uppercase-as-well-as-some-substring-from-pandas-dataframe-usi — BhishanPoudel, Oct 20 '20 at 20:11

score 3 · Accepted Answer · answered Oct 20 '20 at 19:47

3

How about:

 df['feat'] = df.col.str.extract('([A-Z_]+)').fillna('')

Output:

                                 col           feat
0                                cat               
1                 cat.COUNT(example)          COUNT
2  cat.N_MOST_COMMON(example.ord)[2]  N_MOST_COMMON

answered Oct 20 '20 at 19:47

Quang Hoang

131,600
10
43
63

I appreciate the extract method. Actually I had wanted to extract `agg` and `feat` from the string. Intead of updating this question I am going to ask new question as a follow up. – BhishanPoudel Oct 20 '20 at 20:04
I have more complicated question for using extract here: https://stackoverflow.com/questions/64452644/how-to-extract-the-uppercase-as-well-as-some-substring-from-pandas-dataframe-usi – BhishanPoudel Oct 20 '20 at 20:13

score 2 · Answer 2 · answered Oct 20 '20 at 19:59

If you say you have only one upper-case word in each cell, you may also use replace

df['feat'] = df['col'].str.replace(r"[^A-Z_]", '')

Out[681]:
                                 col           feat
0                                cat
1                 cat.COUNT(example)          COUNT
2  cat.N_MOST_COMMON(example.ord)[2]  N_MOST_COMMON

score 1 · Answer 3 · answered Oct 20 '20 at 19:43

You can use re.sub() with pattern [^A-Z|_]:

import re
df = pd.DataFrame({'col': ['cat', 'cat.COUNT(example)','cat.N_MOST_COMMON(example.ord)[2]']})
df['feat'] = df['col'].apply(lambda x: re.sub('[^A-Z|_]', '', x))
df
Out[1]: 
                                 col           feat
0                                cat               
1                 cat.COUNT(example)          COUNT
2  cat.N_MOST_COMMON(example.ord)[2]  N_MOST_COMMON

How to extract only uppercase substring from pandas series?

MWE

Expected output

3 Answers3

Linked