1 year ago

#366855

test-img

tomas-silveira

pySpark scalable onehotencoder with binary representation

I've been trying to find the best way to onehotencode a large number of categorical columns (e.g. 200+) with an average of 4 categories, and a very large amount of instances. I've found two ways to do this:

  1. Using StringIndexer + OneHotEncoder + VectorAssembler + Pipeline from pySpark.MlLib. I like this approach because I can just chain several of these transformers and get a final onehotencoded vector representation.
  2. Appending binary columns to the main dataframe (using .withColumn), where each column represents a category from a categorical feature.

Since I'll be modelling (including feature selection and data balancing) using Python packages, I need the OneHotEncoded features to be encoded like Point 2. And since .withColumn is computationally very expensive, this is not feasible with the resulting amount of columns I have. Is there any way to get a reference to the column names after fitting the Pipeline described in Point 1? Or is there any way to make a scalable version of Point 2 that doesn't involve sequentially appending more columns to the dataframe?

python

apache-spark

pyspark

one-hot-encoding

0 Answers

Your Answer

Accepted video resources