1 year ago

#382082

test-img

dragonachu

pyspark csv format - mergeschema

I have a large dump of data that spans in TB's. The files contain activity data on a daily basis. Day 1 can have 2 columns and Day 2 can have 3 columns. The file dump was in csv format. Now I need to read all these files and load it into a table. Problem is the format is csv and I am not sure how to merge the schema so as to lose not any columns. I know this can be achieved in parquet through mergeschema, but I cant convert these files one by one into parquet as the data is huge. Is there any way to merge schema with format as csv?

apache-spark

pyspark

file-format

0 Answers

Your Answer

Accepted video resources