1 year ago
#373843
Sergio Rivera
How do I remove skewness from a distribution?
I am working with the most famous Credit Card Fraud Detection dataset which includes 28 PCA transformed columns. I'm dealing with the most skewed feature of all which after running the following snippet of code turns out to be V28
:
abs_skew_values = pca.skew().abs().sort_values(ascending=False)
selected_feature = abs_skew_values.index[0] # index[0]: most skewed feature
selected_feature # 'V28'
pca
is the Pandas DataFrame containing the entire dataset with the PCA columns (V1, V2, V3, etc.).
Now, I wanted to test two things:
- How much does the original distribution resemble a normal distribution?
- How much skeweness (left or right) is there in the original distribution?
The first thing I have done is plot the histogram of the feature V28
:
There are a lot of data points far from 0, these are right skewing the distribution with a score of 11.192
. Also, tons of outliers outside of the boxplot fences.
I fixed this by applying a log transformation sign(x) * log(|x|)
rather than plain log(x)
because there are negative values in the distribution.
It significantly reduced the skew score to 0.184
and you can see less outliers in the distribution.
Running some normality tests also give an insight into how this is clearly not coming from a normal distribution.
Anderson-Darling test
---------------------
15.000: 0.576, data does not look normal (reject H0)
10.000: 0.656, data does not look normal (reject H0)
5.000: 0.787, data does not look normal (reject H0)
2.500: 0.918, data does not look normal (reject H0)
1.000: 1.092, data does not look normal (reject H0)
D'Agostino K^2 test
-------------------
statistic=96189.836, pvalue=0.000
It turns out that, after the log transformation, there are only 26 outliers that may (or may not) be outliers in other features, therefore I don't think I can outright remove them from the original dataset.
So, my question is, am I right in assuming that the transformation I applied is enough to correct the skewness that originally came from the given distribution?
Bonus points: Why is the pvalue
in D'Agostino's test exactly 0, shouldn't it be a small number?
python
pandas
data-science
normal-distribution
skew
0 Answers
Your Answer