I have a dataset of customers and their purchase data. Meaning, for each customer id, I have variables indicating number of unique products they bought, Number of online orders they placed, How many different product categories did they buy from etc.
Basically, I have customer id and count data of multiple other columns (80% of columns are count data).
As I intend to apply clustering (and I also had some numeric variables which are in different scale), I followed the below steps
a) As my data distribution was skewed for certain variables, I applied log transform
b) Later, I also applied standardization to normalize the data.
But as soon as I do step a), log transform returns -inf for zero values (records with value as 0) and I understand that this is expected.
But for further processing steps to work, I replaced -inf with Zero again.
Is this okay to do this way? Or should I drop -inf records?
Or should I first normalize my data and then apply log transform?
How can I address this?
Find my code below
data_log = data.apply(np.log, axis = 1).round(3)
data_log = data_log.replace(-np.inf,0)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data_log)
#Store it separately for clustering
data_normalized= scaler.transform(data_log)