10

Assume that we have the following pandas dataframe:

df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})

input:
 col1 col2  start
0  A>G  TCT   1000
1  C>T  ACA   2000
2  C>T  TCA   3000
3  G>T  TCA   4000
4  C>T  GCT   5000
5  A>G  ACT   6000
6  A>G  CTG  10000
7  A>G  ATG  20000
8  C>A  TCT  10000
9  C>T  ACA   2000
10 C>T  TCA   3000
11 C>T  TCA   4000

What I want to get is the number of consecutive values in col1 and length of these consecutive values and the difference between the last element's start and first element's start:

output:
 type length  diff
0  C>T  2   1000
1  A>G  3   14000
2  C>T  3   2000
burcak
  • 203
  • 1
  • 2
  • 4

1 Answers1

12

Break col1 into sub-groups of consecutive strings. Extract first and last entry per sub-group.

Something like this:

df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})

df['subgroup'] = (df['col1'] != df['col1'].shift(1)).cumsum()

col1 col2 start subgroup 0 A>G TCT 1000 1 1 C>T ACA 2000 2 2 C>T TCA 3000 2 3 G>T TCA 4000 3 4 C>T GCT 5000 4 5 A>G ACT 6000 5 6 A>G CTG 10000 5 7 A>G ATG 20000 5

df.groupby('subgroup',as_index=False).apply(lambda x: (x['col1'].head(1), x.shape[0], x['start'].iloc[-1] - x['start'].iloc[0]))

0 ([A>G], 1, 0) 1 ([C>T], 2, 1000) 2 ([G>T], 1, 0) 3 ([C>T], 1, 0) 4 ([A>G], 3, 14000)

Tweak as needed.

UPDATE: for pandas 1.1+ replace the last part with:

def func(x):
    result = {"type":x['col1'].head(1).values[0], "length": x.shape[0], "diff": x['start'].iloc[-1] - x['start'].iloc[0]}
    return pd.Series(result, name="index")
df.groupby('subgroup',as_index=False).apply(func)
oW_
  • 6,347
  • 4
  • 28
  • 47
  • Thanks for your solution. shift(1) shifts the column one row down but how does the cumsum works after this point? I did not understand that part. Could you please explain? Thanks. – burcak Nov 19 '18 at 22:26
  • The first part is True whenever a new subgroup starts, .cumsum() simply adds (or counts) all the True values up to this point (True and False can be treated as 1 and 0). Since there is exactly one True per subgroup this gives the result. – oW_ Nov 19 '18 at 22:44
  • Thanks @oW_ How do we return a dataframe and other columns of df that are after start column? – burcak Nov 19 '18 at 23:10
  • you can add them to the tuple and unpack it. if you need it in a different way I suggest asking a question on stackoverflow. this is not really data science related. – oW_ Nov 19 '18 at 23:27
  • Can use .ne() method instead of != – smci Jan 07 '20 at 07:24
  • 2
    @oW_ I don't believe this works anymore, at least in my environment. I get an error ValueError: 1 columns passed, passed data had 3 columns. – d84_n1nj4 Dec 07 '20 at 15:18
  • Thanks for pointing that out. I think the update should work in the newer pandas versions as well. – oW_ May 17 '22 at 15:51