0

I have a dataset of customers and their purchase data. Meaning, for each customer id, I have variables indicating number of unique products they bought, Number of online orders they placed, How many different product categories did they buy from etc.

Basically, I have customer id and count data of multiple other columns (80% of columns are count data).

As I intend to apply clustering (and I also had some numeric variables which are in different scale), I followed the below steps

a) As my data distribution was skewed for certain variables, I applied log transform

b) Later, I also applied standardization to normalize the data.

But as soon as I do step a), log transform returns -inf for zero values (records with value as 0) and I understand that this is expected.

But for further processing steps to work, I replaced -inf with Zero again.

Is this okay to do this way? Or should I drop -inf records?

Or should I first normalize my data and then apply log transform?

How can I address this?

Find my code below

data_log = data.apply(np.log, axis = 1).round(3)
data_log = data_log.replace(-np.inf,0)

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(data_log) #Store it separately for clustering data_normalized= scaler.transform(data_log)

The Great
  • 3,272
  • 1
    You're treating values of 0 as if they were 1. That's changing the data. It is hard to say ho important that is to your project. There are many other fudges and dodges. One is to replace the logarithm that is undefined not with zero, but with some negative number implying a fraction, say $\log(1/2)$. Another is to use $.\log($whatever $+1)$ systematically. – Nick Cox May 28 '22 at 08:20
  • @NickCox - Is it mandatory to apply log transform, when we are standardizing/scaling etc? – The Great May 28 '22 at 08:26
  • Both try to make the data distribution look normal? So, is applying two different transforms necessary? – The Great May 28 '22 at 08:26
  • Mandatory? No. The main reason for taking logarithms is to pull in long right tails if (1) they exist and (2) they would be a problem for whatever else you want to do. Neither (1) nor (2) is true of all data. This is, or should be, covered in any good general statistics text. – Nick Cox May 28 '22 at 09:08

0 Answers0