Petabyte-scale log analytics with elasticsearch

Andrew Montalenti

Chief Product Officer

Parse.ly

Despite Elasticsearch's popularity as a distributed search engine and as a core plank of the Elastic stack (utilized together with projects like Kibana and Logstash), Elasticsearch can also be thought of as a powerful columnar data store and time series analytics engine. This makes Elasticsearch usable in contexts where you might normally be thinking of using systems like Google BigQuery, Amazon Athena, Snowflake, or, in open source, Dremel, Presto, Druid, or Spark. (You might even have considered tools like TimescaleDB, InfluxDB, or Prometheus in this category.) But Elasticsearch has some important advantages as a time series log analytics engine, especially when you need your ES cluster to power live concurrent queries from real users. Using Elasticsearch in this way, however, requires some lateral thinking on how to index, store, and query your data -- especially with regard to pre-aggregation. In this 20-minute talk, we will discuss how a small engineering team built out a large-scale time series analytics engine atop Elasticsearch, running in production atop AWS EC2, starting with a small cluster powered by Elasticsearch 1.3 (2014), when "aggregations" first became stable at scale, all the way through a much larger cluster powered by Elasticsearch 6.8 (2019-2020), as our system crossed over into over a petabyte of log data stored. We'll also discuss our home-grown query layer for Elasticsearch, which bridges the gap between the ES query DSL and time-series-aware SQL. Finally, we will discuss the open source work the team has done to prototype "index-stored aggregations" in Elasticsearch, with a focus on the challenging cardinality aggregation (aka approximate distinct count), a place where we see the potential for massive cost savings and query performance improvement, with just a little open source work. We'll close with a discussion of the broader open source effort for index-stored data aggregations and data sketches in ES, which might lead to even more innovation, and make it yet more competitive relative to other columnar time series storage options.

Interested in Tooling?

Visit our Tooling community!

We are using more and more tools every day. Here we discuss new and all tools every CTO or engineering leader should be aware of, we share feedback and best practices and help each other to use tools more efficiently. Currently, our main topics are Project management, CI/CD, Feature flagging, Security, Incident Response, Reliability/chaos engineering, monitoring/observability, low code/no-code/Serverless, Hosting.

Go to the topic

VIDEOS RELATED TO TOOLING