calculating mutation frequencies for every gene

Question

I have a dataset for mutation data and I want to calculate mutation frequencies across all genes

df (This is only the small subset of data)

Gene name   Sample id   MUTATION_ID Mutation Description
ARID1B  2719660 171258500   Substitution - Missense
ARID1B  2719659     
ARID1B  2719661 171258501   Substitution - Missense
ARID1B  2719662     
ARID1B  2719663 171258501   Substitution - Nonsense
CD58    2878555 110346783   Substitution - Nonsense
CD58    2877956     
CD58    2878557     
CD58    2877958 110346784   Substitution - Nonsense
CD58    2878559 110346785   Substitution - Nonsense
CD58    2877960     
MRE11   2861617 123320443   Substitution - coding silent
MRE12   2861617 123320444   Substitution - coding 
MRE13   2861617     
MRE14   2861617 123320445   Substitution - coding silent
MRE15   2861617     
MRE16   2861617 123320446   Substitution - coding

The formula for calculating the mutation is

Positives ÷ (Positives + Negatives) x 100

where,

Positives = No of samples where MUTATION_ID is present

Negative = No of samples MUTATION_ID for the sample

I want to calculate mutation frequency for every gene in the column_1:Gene name with python script

I tried the following code

df = df.groupby("Gene name").count()
Positives = df["MUTATION_ID"]
Negatives = df["Sample id"] - df["MUTATION_ID"] 
df['Mutation_Frequency'] = Positives / (Positives + Negatives) * 100

Are your fields tab separated? They probably have to be since you have spaces in the description field. And what are those extra -? Is that a separate field with no header? — terdon, Mar 01 '23 at 10:12

M__ · Accepted Answer · 2023-03-01T01:21:36.913

The code is fine. The only issue is you'd put \s characters in-between the column header names. I've place an additional "mutation_frequency" column because thats how I'd do it.

df = pd.read_csv('/pathtodir/test/bioinfo.csv',sep="\s",engine='python')
df = df.groupby("Genename").count()
df['Mutation_Frequency1'] = ((df["MUTATION_ID"] / (df["Sampleid"] + (df["Sampleid"] - df["MUTATION_ID"]))) * 100).round(2)
df['Mutation_Frequency2'] = (df["MUTATION_ID"] / df["Sampleid"])*100
df.drop(['Description', 'Mutation'], axis=1, inplace=True)
print(df)

-	Sampleid	MUTATION_ID	Mutation_Frequency1	Mutation_Frequency2
ARID1B	5	3	42.86	60.0
CD58	6	3	33.33	50.0
MRE11	1	1	100.00	100.0
MRE12	1	1	100.00	100.0
MRE13	1	0	0.00	0.0
MRE14	1	1	100.00	100.0
MRE15	1	0	0.00	0.0
MRE16	1	1	100.00	100.0

calculating mutation frequencies for every gene

1 Answers1