1 year ago

#373843

test-img

Sergio Rivera

How do I remove skewness from a distribution?

I am working with the most famous Credit Card Fraud Detection dataset which includes 28 PCA transformed columns. I'm dealing with the most skewed feature of all which after running the following snippet of code turns out to be V28:

abs_skew_values = pca.skew().abs().sort_values(ascending=False)
selected_feature = abs_skew_values.index[0]  # index[0]: most skewed feature
selected_feature  # 'V28'

pca is the Pandas DataFrame containing the entire dataset with the PCA columns (V1, V2, V3, etc.).

Now, I wanted to test two things:

  1. How much does the original distribution resemble a normal distribution?
  2. How much skeweness (left or right) is there in the original distribution?

The first thing I have done is plot the histogram of the feature V28:

histogram

There are a lot of data points far from 0, these are right skewing the distribution with a score of 11.192. Also, tons of outliers outside of the boxplot fences.

I fixed this by applying a log transformation sign(x) * log(|x|) rather than plain log(x) because there are negative values in the distribution.

transformed_histogram

It significantly reduced the skew score to 0.184 and you can see less outliers in the distribution.

Running some normality tests also give an insight into how this is clearly not coming from a normal distribution.

Anderson-Darling test
---------------------
15.000: 0.576, data does not look normal (reject H0)
10.000: 0.656, data does not look normal (reject H0)
5.000: 0.787, data does not look normal (reject H0)
2.500: 0.918, data does not look normal (reject H0)
1.000: 1.092, data does not look normal (reject H0)

D'Agostino K^2 test
-------------------
statistic=96189.836, pvalue=0.000

It turns out that, after the log transformation, there are only 26 outliers that may (or may not) be outliers in other features, therefore I don't think I can outright remove them from the original dataset.

So, my question is, am I right in assuming that the transformation I applied is enough to correct the skewness that originally came from the given distribution?

Bonus points: Why is the pvalue in D'Agostino's test exactly 0, shouldn't it be a small number?

python

pandas

data-science

normal-distribution

skew

0 Answers

Your Answer

Accepted video resources