How to optimize Spark Correlation to work similarly to pandas correlat - Enhance your coding expertise with YoshiLeo on @onlycoders.net

1 year ago

#222485

YoshiLeo

How to optimize Spark Correlation to work similarly to pandas correlation?

I am trying to get the correlation between column in a large dataset in Spark.

Initially I followed the documentation and did it using the documentation for the spark mllib Correlation:

def calculate_corr(df):
vector_col = "corr_features"
assembler = VectorAssembler(inputCols=[ALL_MY_COLUMN], outputCol=vector_col, handleInvalid="keep")
df_vector = assembler.transform(df).select(vector_col)

corr_matrix = Correlation.corr(df_vector, vector_col, method=correlation_type)

This worked fine initially, but after some analysis I realized the correlation did not make much sense. So I loaded a sample of the data and used the correlation method for pandas and Spark on it. This gave me different results.

I figured out that the reason for the different results was the null values I had in my dataframe.

So I tried using sparks correlation function on a few column pairs but only selecting the values that were not null on both columns. Doing so gave me the same result as the pandas function.

My question is, is there a way to make spark automatically ignore the null values when taking the correlation of all columns in the dataframe just like pandas?

Currently I have a loop in my code to calculate the correaltion of every possible column pair filtering out nulls, but this takes really long.

def calculate_corr(colA, colB):
    vector_col = "corr_features"
    assembler = VectorAssembler(inputCols=[colA, colB], outputCol=vector_col, handleInvalid="keep")
    df_vector = assembler.transform(df).select(vector_col)

    corr_matrix = Correlation.corr(df_vector, vector_col, method=correlation_type)
    corr_coef = corr_matrix.collect()[0][f"pearson({vector_col})"].toArray()[0, 1]

    return corr_coef 

for columnA, columnB in column_pairs:
    coefficient = calculate_corr()

Any sugestions on how to improve the performance? Maybe run the calculations in parallel and storing the results somewhere?

python

apache-spark

pyspark

pearson-correlation

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs