1 year ago
#366855
tomas-silveira
pySpark scalable onehotencoder with binary representation
I've been trying to find the best way to onehotencode a large number of categorical columns (e.g. 200+) with an average of 4 categories, and a very large amount of instances. I've found two ways to do this:
- Using
StringIndexer
+OneHotEncoder
+VectorAssembler
+Pipeline
frompySpark.MlLib
. I like this approach because I can just chain several of these transformers and get a final onehotencoded vector representation. - Appending binary columns to the main dataframe (using
.withColumn
), where each column represents a category from a categorical feature.
Since I'll be modelling (including feature selection and data balancing) using Python packages, I need the OneHotEncoded features to be encoded like Point 2. And since .withColumn
is computationally very expensive, this is not feasible with the resulting amount of columns I have. Is there any way to get a reference to the column names after fitting the Pipeline described in Point 1? Or is there any way to make a scalable version of Point 2
that doesn't involve sequentially appending more columns to the dataframe?
python
apache-spark
pyspark
one-hot-encoding
0 Answers
Your Answer