Tuesday, 23 December 2025

Google BigQuery - the goods and the odds of the columnar store, but maybe not a document store yet


[in picture: Big Easy - Elizabeth Line station in London Canary Wharf, before opening the line]

Good parts:

Google BigQuery is a truly Big Data-size scalable, performant storage with SQL-like querying capabilities. It is also positioning itself as a useful part of AI platforms, for RAG enhancements, vector searches and others.

Notice: This post is not aspiring to be a full review of BigQuery, but rather notes from the battle field, when I was approaching problems and needed efficient solution.


Json - work in progress - and actually there is a progress

It is a columnar store. Some field formats, while available, may not be as useful as in other database systems - e.g. keeping complex data inside as JSON does not provide quick direct subqueries of JSON datatypes like we can do with Postgresql or any psql-compatible database, like Google Cloud AlloyDb or AWS RDS. They all offer far more advanced querying and efficient indexing to deal with JSON-like structured documents in fields. In BigQuery - you can build a query to retrieve what you want, but it will be using full scan of your column, which means - slower and more costly. In such case, a good option could be to create search index on the choice of your fields.


Odd parts

BigQuery misses good tooling for typical operations like:

- full, correct backup/restore of tables

There are caveats such as:

- different output formats for your tables support different range of BigQuery capabilities, fields and types of compression, for example:
    - Parquet and Avro seem to be the best choices, yet Parquet cannot handle BigQuery json fields, and Avro cannot correctly move by itself date-time/timestamp fields - you need there a help with explicit schemas, that you backup to separate files and use while restoring data.

 - due to such constraints, moving datasets between projects could be adventurous

 

Permissions for CMD and API are different - User Account vs Service Account

Tooling and permissions: command-line bq is different from using the API. This is due to the fact that calling bq is quite straight-forward from the permissions perspective, it just uses your User Account permission and you may have quite a lot of them already assigned just to do your job.

When you use API by calling it from Python or any other language, probably via GCP SDK to make it easier - you act as a service, an app - and here are the Service Account permissions, usually set at the lowest possible level to accommodate current app permission needs, but nothing more - the principle of least privilege.

While the python API access feel more complicated to setup needed permissions, it should benefit in much easier development and maintenance of the tooling compared to e.g. using bash with bq, especially when you want to do a lot of async operations by using multi-threading with ThreadPoolExecutor and similar features of modern Python.


Not so flexible parts:

- You cannot rename datasets

- There are no easy ways to efficiently move data between GCP accounts, however Data Transfer Service (DTS) may be the helpful missing chain-link.


Summary:

Very interesting product with ecosystem of tools and data-pipeline solutions, but the deeper you go, the more advanced you need to be in understanding the choices and pros/cons of the chosen path.

Update after 6 months:

The BigQuery ecosystem is evolving and looks like actually Google is putting good effort in developing this project and its ecosystem of tooling. Tools like DTS can actually help a lot in day-by-day tasks but also in these less-so-frequent operations.

No comments: