Tuesday, 23 December 2025

Google BigQuery - the goods and the odds of the columnar store, but not a document store yet

Good parts:

Google BigQuery is a truly Big Data-size scalable, performant storage with SQL-like querying capabilities. It is also positioning itself as a useful part of AI platforms, for RAG enhancements, vector searches and others.

Notice: This post is not aspiring to be a full review of BigQuery, but rather notes from the battle field, when I was approaching problems and needed efficient solution.


Json - work in progress - and actually there is a progress

It is a columnar store. Some field formats, while available, may not be as useful as in other database systems - e.g. keeping complex data inside as JSON does not provide quick direct subqueries of JSON datatypes like we can do with Postgresql or any psql-compatible database, like Google Cloud AlloyDb or AWS RDS. They all offer far more advanced querying and efficient indexing to deal with JSON-like structured documents in fields. In BigQuery - you can build a query to retrieve what you want, but it will be using full scan of your column, which means - slower and more costly. In such case, a good option could be to create search index on the choice of your fields.


Odd parts

BigQuery misses good tooling for typical operations like:

- full, correct backup/restore of tables

- different formats support different range of BigQuery capabilities, fields and types of compression, for example:
    - Parquet and Avro seem to be the best choices, yet Parquet cannot handle BigQuery json fields, and Avro cannot correctly move by itself date-time/timestamp fields - you need there a help with explicit schemas.

 - due to such constraints, moving dataset between projects could be adventurous

 

Tooling and permissions: bq is different from using API

Shell scripts using command line usually take permissions of your GCP account, while python scripting, executed in the same way, will be using API calls, that need the BigQuery Service Account with the expected permissions.

While the python API access may be more complicated to setup needed permissions, it should benefit in much easier development and maintenance of the tooling, especially when you want to do a lot of async operations.


Not so flexible parts:

- You cannot rename datasets

- There are no easy ways to efficiently move data between GCP accounts, however Data Transfer Service (DTS) may be the helpful missing chain-link.


Summary:

Very interesting product with ecosystem of tools and data-pipeline solutions, but the deeper you go, the more advanced you need to be in understanding the choices and pros/cons of the chosen path.

Update after 6 months:

The BigQuery ecosystem is evolving and looks like actually Google is putting good effort in developing this project and its ecosystem of tooling. Tools like DTS can actually help a lot in day-by-day and these less-so-frequent operations.

No comments: