1 year ago
#385792
smati
PySpark mergeSchema on Read operation Parquet vs Avro
I have around 200 parquet files with each parquet file having a different schema and I am trying to read these parquet files using mergeSchema enabled during read and it takes almost 2 hours. If I instead create equivalent Avro files and try to read them using the mergeSchema option on read ( Available only on Databricks runtime 9.3 LTS ) , it can do the merge within 5 minutes.
Question - Why does Parquet Schema merge on Read take too long whereas the Avro files are faster ?
pyspark
databricks
0 Answers
Your Answer