1 year ago
#381325
dwolfeu
Databricks Delta tables from json files: Ignore initial load when running COPY INTO
I am working with Databricks on AWS. I have mounted an S3 bucket as /mnt/bucket-name/
. This bucket contains json files under the prefix jsons
. I create a Delta table from these json files as follows:
%python
df = spark.read.json('/mnt/bucket-name/jsons')
df.write.format('delta').save('/mnt/bucket-name/delta')
%sql
CREATE TABLE IF NOT EXISTS default.table_name
USING DELTA
LOCATION '/mnt/bucket-name/delta'
So far, so good. Then new json files arrive in the bucket. In order to update the Delta table, I run the following:
%sql
COPY INTO default.table_name
FROM '/mnt/bucket-name/jsons'
FILEFORMAT = JSON
This does indeed update the Delta table, but it duplicates the rows contained in the initial load, i.e. the rows in df
are now contained in table_name
twice. I have the following workaround, whereby I create an empty dataframe with the correct schema:
%python
df_schema = spark.read.json('/mnt/bucket-name/jsons').schema
df = spark.createDataFrame([], df_schema)
df.write.format('delta').save('/mnt/bucket-name/delta')
%sql
CREATE TABLE IF NOT EXISTS default.table_name
USING DELTA
LOCATION '/mnt/bucket-name/delta'
%sql
COPY INTO default.table_name
FROM '/mnt/bucket-name/jsons'
FILEFORMAT = JSON
This works and there is no duplication, but it seems neither elegant nor efficient, since spark.read.json('/mnt/bucket-name/jsons').schema
reads all the json files, even though only the schema needs to be inferred. (The schema of the json files can be assumed to be stable.) Is there a way to tell COPY INTO
to ignore the initial json files? There's the option modifiedAfter
, but that would be cumbersome and doesn't sit well idempotently. I also considered recreating the dataframe and then running Edit: This method also duplicates the initial load.df.write.format('delta').mode('append').save('/mnt/bucket-name/delta')
followed by REFRESH TABLE default.table_name
, but this seems inefficient, since why should the initial json files be read again?
Or is there a way to circumvent using a Spark dataframe entirely and create a Delta table from the json files directly? I have searched for such a solution but to no avail.
One last point: Schema inference is crucial and so I do not want a solution that requires the schema of the json files to be written out manually.
json
apache-spark
databricks
delta-lake
0 Answers
Your Answer