From Postgres CDC to BigQuery with materialized views

1 year ago

#361689

M4rk

I'm designing a Change Data Capture from our Postgres to BigQuery.

I came across something that seems a powerful combination of:

Google BigQuery sync connector: https://docs.confluent.io/kafka-connect-bigquery/current/overview.html
BigQuery Materialized Views: https://cloud.google.com/bigquery/docs/materialized-views-intro#:~:text=Each%20base%20table%20can%20be,rewrite%20(or%20smart%20tuning).

My intuition tells me that by combining these two features, I should be able to have always up to date, incremental synced tables, with the least amount of engineering.

However, the lack of documentation (or suggesting different approaches) from the Google documentation, makes me skeptical that the above approach would work: e.g. from google docs:

This seems a good approach, but falls short on "why materialized don't work" and just names a scheduled merge job: https://cloud.google.com/architecture/database-replication-to-bigquery-using-change-data-capture
this falls short of the compaction of changes (https://cloud.google.com/architecture/capturing-change-logs-with-debezium)

So yes, the idea would be having a materialized view such as

given a table composed by the columns (id, version, json_valeus, soft_delete)

CREATE Materialized view SELECT * FROM myTable WHERE concat(id, version) in (SELECT concat(id, max(version)) FROM myTable GROUP BY id)

Would this be extremely inefficient/not work?

System size:

about 1000 tables,
with 1M to 500M rows each.
10-50M updates/day total across all tables

Expected cost:

5-15k/month

postgresql

google-bigquery

gcloud

debezium

cdc

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs