Skip to main content
Skip to main content
Edit this page

Ordering keys

Ordering Keys (a.k.a. sorting keys) define how data is sorted on disk and indexed for a table in ClickHouse. When replicating from Postgres, ClickPipes sets the Postgres primary key of a table as the ordering key for the corresponding table in ClickHouse. In most cases, the Postgres primary key serves as a sufficient ordering key, as ClickHouse is already optimized for fast scans, and custom ordering keys are often not required.

For larger use cases, you should include additional columns beyond the Postgres primary key in the ClickHouse ordering key to optimize queries. By default, choosing an ordering key different from the Postgres primary key can cause data deduplication issues in ClickHouse. This happens because the ordering key in ClickHouse serves a dual role: it controls data indexing and sorting while acting as the deduplication key. The easiest way to address this issue is by defining refreshable materialized views.

Use Refreshable Materialized Views

A simple way to define custom ordering keys (ORDER BY) is using refreshable materialized views (MVs). These allow you to periodically (e.g., every 5 or 10 minutes) copy the entire table with the desired ordering key.

Below is an example of a Refreshable MV with a custom ORDER BY and required deduplication:

Custom ordering keys without refreshable materialized views

If refreshable materialized views don't work due to the scale of data, here are a few recommendations you can follow to define custom ordering keys on larger tables and overcome deduplication-related issues.

Choose ordering key columns that don't change for a given row

When including additional columns in the ordering key for ClickHouse (besides the primary key from Postgres), we recommend selecting columns that don't change for each row. This helps prevent data consistency and deduplication issues with ReplacingMergeTree.

For example, in a multi-tenant SaaS application, using (tenant_id, id) as the ordering key is a good choice. These columns uniquely identify each row, and tenant_id remains constant for an id even if other columns change. Since deduplication by id aligns with deduplication by (tenant_id, id), it helps avoid data deduplication issues that could arise if tenant_id were to change.

Note: If you have scenarios where ordering keys need to include columns that change, please reach out to us at support@clickhouse.com. There are advanced methods to handle this, and we will work with you to find a solution.

Set Replica Identity on Postgres Tables to Custom Ordering Key

For Postgres CDC to function as expected, it is important to modify the REPLICA IDENTITY on tables to include the ordering key columns. This is essential for handling DELETEs accurately.

If the REPLICA IDENTITY does not include the ordering key columns, Postgres CDC will not capture the values of columns other than the primary key - this is a limitation of Postgres logical decoding. All ordering key columns besides the primary key in Postgres will have nulls. This affects deduplication, meaning the previous version of the row may not be deduplicated with the latest deleted version (where _peerdb_is_deleted is set to 1).

In the above example with owneruserid and id, if the primary key does not already include owneruserid, you need to have a UNIQUE INDEX on (owneruserid, id) and set it as the REPLICA IDENTITY for the table. This ensures that Postgres CDC captures the necessary column values for accurate replication and deduplication.

Below is an example of how to do this on the events table. Make sure to apply this to all tables with modified ordering keys.

Projections

Projections are useful for running queries on a column that is not a part of the primary key.

The biggest caveat with Projections is that they get skipped when querying the table with the FINAL keyword and do not account for deduplication. This could work for a few use cases where duplicates (updates, deletes) are not present or are less common.

Projections are defined on the table we want to add a custom ordering key for. Then, each time a query is executed on this table, ClickHouse determines if the query execution can benefit from using one of the existing Projections.

Let's take an example where we want to order the table posts by the field creationdate instead of the current one id. This would benefit query that filter using a date range.

Consider the following query that finds the most viewed posts mentioning "clickhouse" in 2024.

By default, ClickHouse needs to do a full scan of the table as the order by is id, we can note in the last query processed 4.69 million rows. Now, let's add a Projection to order by creationdate.

Then, we run again the same query.

ClickHouse utilized the Projection to execute the query, reducing rows scanned to just 386,000 compared to 4.69 million previously, while also lowering memory usage.