1 year ago

#381325

test-img

dwolfeu

Databricks Delta tables from json files: Ignore initial load when running COPY INTO

I am working with Databricks on AWS. I have mounted an S3 bucket as /mnt/bucket-name/. This bucket contains json files under the prefix jsons. I create a Delta table from these json files as follows:

%python
df = spark.read.json('/mnt/bucket-name/jsons')
df.write.format('delta').save('/mnt/bucket-name/delta')

%sql
CREATE TABLE IF NOT EXISTS default.table_name
USING DELTA
LOCATION '/mnt/bucket-name/delta'

So far, so good. Then new json files arrive in the bucket. In order to update the Delta table, I run the following:

%sql
COPY INTO default.table_name
FROM '/mnt/bucket-name/jsons'
FILEFORMAT = JSON

This does indeed update the Delta table, but it duplicates the rows contained in the initial load, i.e. the rows in df are now contained in table_name twice. I have the following workaround, whereby I create an empty dataframe with the correct schema:

%python
df_schema = spark.read.json('/mnt/bucket-name/jsons').schema
df = spark.createDataFrame([], df_schema)
df.write.format('delta').save('/mnt/bucket-name/delta')

%sql
CREATE TABLE IF NOT EXISTS default.table_name
USING DELTA
LOCATION '/mnt/bucket-name/delta'

%sql
COPY INTO default.table_name
FROM '/mnt/bucket-name/jsons'
FILEFORMAT = JSON

This works and there is no duplication, but it seems neither elegant nor efficient, since spark.read.json('/mnt/bucket-name/jsons').schema reads all the json files, even though only the schema needs to be inferred. (The schema of the json files can be assumed to be stable.) Is there a way to tell COPY INTO to ignore the initial json files? There's the option modifiedAfter, but that would be cumbersome and doesn't sit well idempotently. I also considered recreating the dataframe and then running df.write.format('delta').mode('append').save('/mnt/bucket-name/delta') followed by REFRESH TABLE default.table_name, but this seems inefficient, since why should the initial json files be read again? Edit: This method also duplicates the initial load.

Or is there a way to circumvent using a Spark dataframe entirely and create a Delta table from the json files directly? I have searched for such a solution but to no avail.

One last point: Schema inference is crucial and so I do not want a solution that requires the schema of the json files to be written out manually.

json

apache-spark

databricks

delta-lake

0 Answers

Your Answer

Accepted video resources