1 year ago
#361689
M4rk
From Postgres CDC to BigQuery with materialized views
I'm designing a Change Data Capture from our Postgres to BigQuery.
I came across something that seems a powerful combination of:
- Google BigQuery sync connector: https://docs.confluent.io/kafka-connect-bigquery/current/overview.html
- BigQuery Materialized Views: https://cloud.google.com/bigquery/docs/materialized-views-intro#:~:text=Each%20base%20table%20can%20be,rewrite%20(or%20smart%20tuning).
My intuition tells me that by combining these two features, I should be able to have always up to date, incremental synced tables, with the least amount of engineering.
However, the lack of documentation (or suggesting different approaches) from the Google documentation, makes me skeptical that the above approach would work: e.g. from google docs:
- This seems a good approach, but falls short on "why materialized don't work" and just names a scheduled merge job: https://cloud.google.com/architecture/database-replication-to-bigquery-using-change-data-capture
- this falls short of the compaction of changes (https://cloud.google.com/architecture/capturing-change-logs-with-debezium)
So yes, the idea would be having a materialized view such as
given a table composed by the columns (id, version, json_valeus, soft_delete)
CREATE Materialized view SELECT * FROM myTable WHERE concat(id, version) in (SELECT concat(id, max(version)) FROM myTable GROUP BY id)
Would this be extremely inefficient/not work?
System size:
- about 1000 tables,
- with 1M to 500M rows each.
- 10-50M updates/day total across all tables
Expected cost:
- 5-15k/month
postgresql
google-bigquery
gcloud
debezium
cdc
0 Answers
Your Answer