How can i make this algorithm more efficient using dataframes?

Question

I am trying to get the outliers of a column (with IQR), once I get the outliers I want to set the values where the outliers are in my main dataframe to null in order to impute them afterwards. This is the way I implemeted it:

 df_outliers_detected = detect_outliers_IQR(df['Outliers'])
 df_outliers_detected = pd.DataFrame(df_outliers_detected)
 print(df_outliers_detected)

 for i in range(len(df)):
  for j in range(len df_outliers_detected)):
     if(df.loc[i, "Outliers"] ==  df_outliers_detected.iloc[j,0]):
       df.loc[i,'Outliers'] = None
                    
 print(df['Outliers'].head(100))

This 2 for loops makes the program really slow, is their a better way to implement this?

The function code of "remove_outliers_IQR":

def detect_outliers_IQR(df):

    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    
    print(df)
    print("\n")
    df_outlier = df[((df<(Q1-1.5*IQR)) | (df>(Q3+1.5*IQR)))]
    print(len(df_outlier))
    return df_outlier

Pantelis · Answer 1 · 2022-03-24T02:06:53.667

2

You can take advantage of the logical indexing you already used in your function.

def detect_outliers_IQR(df_input):
    df = df_input.copy()
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    df_outlier = (df<(Q1-1.5*IQR)) | (df>(Q3+1.5*IQR))
    df[df_outlier] = None
    return df

# replace outliers
df_outliers_detected = detect_outliers_IQR(df['Outliers'])
print(df_outliers_detected)

edited Mar 24 '22 at 02:06

answered Mar 24 '22 at 01:42

Pantelis

161
1
9

Of course, happy to help! – Pantelis Mar 24 '22 at 01:59
i get this warning tho: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame – Miguel Angel Peláez Mar 24 '22 at 02:03
I have edited the code to create a copy of the dataframe when it's passed to the function. This should get rid of the warning. – Pantelis Mar 24 '22 at 02:08
Thanks again, why is this needed by the way? – Miguel Angel Peláez Mar 24 '22 at 02:14
It's a warning about setting a value to the original dataframe or to a view (slice) of it. Here is a more detailed explanation https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas. – Pantelis Mar 24 '22 at 02:24

How can i make this algorithm more efficient using dataframes?

1 Answers1