4

I have a dataset for mutation data and I want to calculate mutation frequencies across all genes

df (This is only the small subset of data)

Gene name   Sample id   MUTATION_ID Mutation Description
ARID1B  2719660 171258500   Substitution - Missense
ARID1B  2719659     
ARID1B  2719661 171258501   Substitution - Missense
ARID1B  2719662     
ARID1B  2719663 171258501   Substitution - Nonsense
CD58    2878555 110346783   Substitution - Nonsense
CD58    2877956     
CD58    2878557     
CD58    2877958 110346784   Substitution - Nonsense
CD58    2878559 110346785   Substitution - Nonsense
CD58    2877960     
MRE11   2861617 123320443   Substitution - coding silent
MRE12   2861617 123320444   Substitution - coding 
MRE13   2861617     
MRE14   2861617 123320445   Substitution - coding silent
MRE15   2861617     
MRE16   2861617 123320446   Substitution - coding 

The formula for calculating the mutation is

Positives ÷ (Positives + Negatives) x 100

where,

Positives = No of samples where MUTATION_ID is present

Negative = No of samples MUTATION_ID for the sample

I want to calculate mutation frequency for every gene in the column_1:Gene name with python script

I tried the following code

df = df.groupby("Gene name").count()
Positives = df["MUTATION_ID"]
Negatives = df["Sample id"] - df["MUTATION_ID"] 
df['Mutation_Frequency'] = Positives / (Positives + Negatives) * 100
Priya
  • 351
  • 1
  • 3
  • 8
  • Are your fields tab separated? They probably have to be since you have spaces in the description field. And what are those extra -? Is that a separate field with no header? – terdon Mar 01 '23 at 10:12

1 Answers1

2

The code is fine. The only issue is you'd put \s characters in-between the column header names. I've place an additional "mutation_frequency" column because thats how I'd do it.

df = pd.read_csv('/pathtodir/test/bioinfo.csv',sep="\s",engine='python')
df = df.groupby("Genename").count()
df['Mutation_Frequency1'] = ((df["MUTATION_ID"] / (df["Sampleid"] + (df["Sampleid"] - df["MUTATION_ID"]))) * 100).round(2)
df['Mutation_Frequency2'] = (df["MUTATION_ID"] / df["Sampleid"])*100
df.drop(['Description', 'Mutation'], axis=1, inplace=True)
print(df)
- Sampleid MUTATION_ID Mutation_Frequency1 Mutation_Frequency2
ARID1B 5 3 42.86 60.0
CD58 6 3 33.33 50.0
MRE11 1 1 100.00 100.0
MRE12 1 1 100.00 100.0
MRE13 1 0 0.00 0.0
MRE14 1 1 100.00 100.0
MRE15 1 0 0.00 0.0
MRE16 1 1 100.00 100.0
M__
  • 12,263
  • 5
  • 28
  • 47