1 year ago

#385792

test-img

smati

PySpark mergeSchema on Read operation Parquet vs Avro

I have around 200 parquet files with each parquet file having a different schema and I am trying to read these parquet files using mergeSchema enabled during read and it takes almost 2 hours. If I instead create equivalent Avro files and try to read them using the mergeSchema option on read ( Available only on Databricks runtime 9.3 LTS ) , it can do the merge within 5 minutes.

Question - Why does Parquet Schema merge on Read take too long whereas the Avro files are faster ?

pyspark

databricks

0 Answers

Your Answer

Accepted video resources