7 Reasons why I like ClickHouse
One of the rising names in the world of databases, particularly those designed for storing large volumes of data, is ClickHouse.
ClickHouse is not a newcomer to the market. In fact, its development began back in 2012 as an internal project at Yandex, aiming to support their competitor to Google Analytics, called Yandex.Metrica.
Over the years, ClickHouse became an open-source project and this days maintained by ClickHouse company (no longer part of Yandex).
Over the last couple years, ClickHouse not only developing and improving their core database product (which anyone can download and run wherever they wish) but also offers a cloud-based version that let you use ClickHouse as a managed-service in AWS, GCP or Azure.
ClickHouse is an extremely flexible engine, but its greatest value shines in several types of workloads:
- Event Tables: If your workload includes a significant portion of tables representing events that are immutable (not modified or deleted), there’s a good chance you’ll benefit from storing them in ClickHouse. It doesn’t really matter whether these events represent logs, monitoring data, security information, or domain-specific content.
- OLAP Usage: Massive queries (that can leverage ClickHouse’s Massive Parallel Processing capabilities) analyzing large amounts of data, most of which is located in a single table.
- Time-Series Data: Tables where data is organized by time, and many queries focus on specific time ranges, can benefit greatly from ClickHouse’s ability to keep the data sorted by the sorting key, and efficenetly access it later in queries.
Since the database market is full of trends, and nearly every month you hear about a new product claiming to be “the next big thing,” it’s understandable that people might be skeptical about products they encounter for the first time.
My goal in this post is to share the key things I love about ClickHouse and why I enjoy using it. Perhaps you’ll come to appreciate it as well.
It’s Fast.
If you search for ClickHouse on Google, you’ll find that query speed and data loading speed are the main reasons many users start working with ClickHouse.
One of the first tables I loaded while experimenting with ClickHouse contained hundreds of millions of rows, including a short text column (about 200–300 characters), on which I performed a LIKE
query. It was fast. Very fast. In fact, considering that the text data wasn’t pre-indexed (like in Elasticsearch or similar engines), my jaw dropped when I saw a scan rate of hundreds of millions of rows per second.
When it comes to filtering and performing aggregations on tables (especially those that don’t return many groups), it’s not uncommon to see ClickHouse process billions of rows per second without any special optimizations. Of course, once optimizations are introduced, even significantly larger datasets become manageable (there are clusters handling petabytes of data out there).
Naturally, you don’t have to just take my word for it. On the ClickHouse benchmarks page, they compare their performance to various engines on different hardware setups.
If you search Google for comparisons between ClickHouse and your favorite database engine, you’ll likely find plenty of posts detailing how people were driven to migrate (due to performance) from other products to ClickHouse. A recurring theme in these posts and benchmarks is the smaller hardware footprint ClickHouse requires to achieve the same performance as its competitors, making it far more cost-effective.
It’s Efficient at Storing Data.
When discussing data warehouses or simply storing massive tables, one of the key factors is how well the data is compressed and how efficient/fast the compression method is (since we pay the decompression cost when querying).
ClickHouse is renowned for its efficient data storage capabilities, achieving high compression ratios (5:1 is common, but 10:1 or higher is possible, depending on the data).
This efficiency is achieved through several mechanisms:
- Columnar Data Storage:
ClickHouse’s primary table format, MergeTree, stores data in a columnar format. Each column is stored in a separate file (with some nuances when tables contain very few rows, but we’re focusing on the common case). This storage model allows significantly better compression compared to row-based formats, as columns are more likely to contain repetitive values within themselves than across rows. - Compression Algorithms:
By default, ClickHouse compresses table data using the LZ4 algorithm, which balance between good compression and computational efficiency (enabling faster data scans). Other compression algorithms can be used as well, such as those offering more aggressive compression at the cost of higher CPU usage during querying. - Optimizing Column Data to Reduce Size:
Beyond compression, ClickHouse provides additional options to minimize the storage footprint of tables
LowCardinality(String): If a column contains a small number of repeated values (e.g., street names, city names, or user-agent strings), it can be defined asLowCardinality(String)
instead of a regularString
. This performs dictionary-encoding, meaning that instead of repeating the full string, only a number pointing to the full string is stored.
Using different types of CODEC’s: For example, for slowly changing values, codecs like Delta can store just the difference (instead of the full value).So, instead of storing 8 bytes for every precise timestamp, only the difference (which fits into 4 bytes or less) is saved. This is just one example, as ClickHouse offers a range of sophisticated codecs for further optimization.
Here are some examples from my own personal experience of how storage-efficient ClickHouse is compared to some other products:
- Logs in ClickHouse vs. Elasticsearch:
Logs stored in ClickHouse take up only 28% of the space they occupy in Elasticsearch, even when considering that our ClickHouse deployment uses 3 replicas compared to Elasticsearch’s 2. This allows us to retain logs in ClickHouse for a much longer period than in Elasticsearch. (In fact, our main motivation for using Elasticsearch here is Kibana and the rapid full-text search, although new versions of ClickHouse already has a beta version of inverted index) - Structured-repetitive data in ClickHouse vs. Elasticsearch:
For a table with more structured and repetitive data (without free-text columns), whereLowCardinality
can be used in ClickHouse, the storage difference is even more pronounced: the same data for the same time period takes 14x more space in Elasticsearch. Even if we disabled_source
storage in Elasticsearch to save space, the gap would still be impressive. - MSSQL vs. ClickHouse:
One of our tables is duplicated, for some reasons, to both MSSQL and ClickHouse. The MSSQL tables takes about 8.6TB (data size, not including any non-clustered indexes. The clustered index is compressed using page-compression) and the ClickHouse table storing the same data takes 3.2TB.
What’s remarkable is that this storage efficiency comes without compromising performance — in fact, performance is often superior to the alternatives for many types of queries.
It uses SQL for querying data
The fact that SQL is used is already a significant advantage for me (compared to using a whole new query-language). In the case of ClickHouse, their SQL variant is particularly well-designed, which enhances this advantage even further.
It offers a lot of useful functions that cover virtually anything you might need. There are also certain changes compared to standard SQL that make writing a query more convenient.
Some of the things I particularly like about ClickHouse’s SQL variant:
- The ability to define an alias for a column or expression in
SELECT
and use it later inWHERE
andGROUP BY
simplifies query writing compared to traditional SQL. - The various combinators that can be appended to functions make queries more readable. For example, here’s a sample SQL query for ClickHouse that anyone familiar with writing SQL for other databases will appreciate for its simplicity and elegance:
SELECT
toStartOfFifteenMinutes(DateCreated) as t,
countIf(MetricValue = 500) as NumErrors,
countIf(MetricValue = 200) as NumSuccess,
argMax(Username, ElapsedMs) as UserThatWaitedLongestInTimeFrame
FROM tbl
GROUP BY t
HAVING NumErrors > 0
ORDER BY t
It’s easy to install and to maintain (also on-prem)
Many of the data-related products released in recent years have focused primarily — or even exclusively — on the cloud, often without any support for on-premises deployment. Some made this choice as a business decision, while others are products created by the major cloud providers themselves.
Even with products that do allow on-premises deployment (or installation on IaaS infrastructure in a cloud provider), there’s often a tendency toward unnecessary complexity.
This complexity typically involves requiring the installation of numerous components that serve various parts of the product, along with the need to manage their scaling independently.
With ClickHouse, things are much simpler: to run ClickHouse on a single node, all you need is the ClickHouse executable — nothing more.
To run ClickHouse in a cluster. you need the nodes themselves (executing the same binary, as there’s only one role among the ClickHouse nodex) and ClickHouse Keeper (a Zookeeper-compatible product developed by ClickHouse to replace Zookeeper and provide better performance). ClickHouse Keeper can run as a standalone service or as part of the ClickHouse executable itself.
Maintaining ClickHouse is also quite straightforward: like many other SQL databases, it exposes its internal information (metrics, logs, etc.) as system-tables, which can be easily queried. This makes the creation of dashboards for monitoring and other purposes very convenient.
You don’t need your ETL tools for the basic tasks
Anyone working in the data field is familiar with the need to address ETL missions. These typically fall into several categories:
- Importing data from external sources
- Exporting data to external sources
- Running various summary or transformation processes
ClickHouse has built-in table engines that allow the creation of tables where the data is not actually stored in ClickHouse but represents data located elsewhere: a topic in Kafka, a queue in RabbitMQ, a table in MSSQL, or any other external database accessible via ODBC, and more.
Using these tables, along with materialized views or functions designed for querying data in external sources, makes it significantly easier to import and export data from ClickHouse automatically, without requiring external tools outside the product (though if desired, ClickHouse does offer numerous integrations with tools such as Kafka Connect and others).
ClickHouse also has a lot of integration with BI tools. it integrates with many products (a list that is continuously expanding). Personally, I often use the ClickHouse ODBC driver for Windows, which allows setting up a linked server from MSSQL to ClickHouse (enabling queries that originate in MSSQL but access data stored in ClickHouse).
This is highly convenient, especially if you’re already using other databases. However, I’d love to see improved stability in the ODBC driver for Windows, as I’ve encountered too many times crashes and bugs.
ClickHouse also has a built-in solution for creating tables summarizing the data to makes querying faster. These typically involve a combination of ClickHouse’s materialized views (which are somewhat different from what the term means in other DB infrastructures, that in ClickHouse called ‘Refrashable Materialized Views’) and one of the other “flavors” of MergeTree (ClickHouse’s main table format), which allows controlling how the data is stored .
For example, one can use ReplacingMergeTree to store only the latest version of each row. If you want to store pre-calculated aggregations, you can use AggregatingMergeTree.
Recently, ClickHouse has been introducing many features that let you use object storage from other providers (not just S3) and to efficiently handle various table formats (Parquet, Delta Lake, Iceberg, and others). This makes it possible to use ClickHouse as a query engine over a data lake, regardless of whether the data is eventually stored in ClickHouse itself.
Separation between compute and storage
ClickHouse can use different type of storage for the table’s data. The most straightforward and obvious option is, of course, storing data on local disks. However, in ClickHouse, you can also configure data to be stored on object storage for cost savings, creating a separation between the compute and storage layers, and reducing interdependence between them.
The ability to store MergeTree tables on S3 in ClickHouse is an integral part of the product. However, some of the more advanced capabilities developed by ClickHouse for separating compute from storage are only available in the cloud version and hnot in the open-source version that can be downloaded and installed independently. For example, the SharedMergeTree allows storing data in object storage only once, instead of replicating it based on the number of replicas in each shard.
It’s important to emphasize that using S3 or any other object storage is not mandatory in ClickHouse. If you prefer — whether for operational, performance, convenience, or other reasons — to store all data on local disks, nothing prevents you from doing so.
It’s developing rapidly.
Almost every month, a new version of ClickHouse is released, often containing interesting new features and performance improvement. In fact, just browsing through the pull requests on ClickHouse’s GitHub shows how fast the product is evolving (both by the “official” developers and by the community).
I’ve been working with ClickHouse for about three years, and the product’s progress during this time is definitely impressive.
Updating ClickHouse is easy, and if you’re worried what happens if things don’t work as expected after an upgrade due to bugs (and that’s the flip side — sometimes releases contain bugs you’d rather not deal with), it’s also very easy to roll back.
By the way, if you want to stay up to date with the new features in ClickHouse, it’s highly recommended to review their changelog document and watch the monthly webinar of new features usually presented by Alexey Milovidov (ClickHouse’s CTO) on YouTube.
But there are also downsides
Like everything, using ClickHouse is not without its drawbacks, and there are some things I wish would become better.
- ClickHouse doesn’t handle queries that require a lot of memory very well.
If you try to do massive JOINs (where each side has many rows), or attempt a GROUP BY on columns with high cardinality, you’ll find that it can become quite painful. To be honest, this is quite a tough issue, and ClickHouse is working on improving various aspects of it (for example: by developing new JOIN algorithms). - Some new versions of ClickHouse contain regressions/severe bugs.
As mentioned, this is the flip side of the frequent release policy. Sometimes bugs or regressions are introduced that, as a user, you’d really rather not see. This is especially true with new features, which often come in when they’re not mature enough and clearly haven’t been tested well enough. - There are operational limitations that would be better removed.
For example, every ReplicatedMergeTree comes with some overhead to manage synchronization with Zookeeper and bring new data to the other replicas in the same shard. This creates an effective limit on the number of such tables you can maintain before you see a significant rise in CPU consumption and as a result longer time for data to be accessible across all replicas. - The documentation is just OK.
ClickHouse has documentation, and it’s not bad. However, there have been times when I remembered seeing a pull request related to a specific feature, or something that appeared in the changelog but wasn’t mentioned in the documentation, and I had to look at the code or tests to understand exactly how to use that feature. In a product with many contributors and a lot of code being added, it’s probably not easy to keep the documentation comprehensive, up-to-date, but also not overwhelming or confusing. But without it, there can be useful features that simply get lost.
You should try using ClickHouse
It’s very easy to start using ClickHouse and check if it’s suitable for your use cases.
You can download the Docker images and easily run a single node locally (which is probably powerful enough to handle basic experiments).
If you want to experiment with a full cluster, there are some docker-compose files that let you easily spin a full cluster on your own machine.
Good luck!