Petabyte-scale log analytics with elasticsearch
Andrew Montalenti
Chief Product Officer
Despite Elasticsearch's popularity as a distributed search engine and as a core plank of the Elastic stack (utilized together with projects like Kibana and Logstash), Elasticsearch can also be thought of as a powerful columnar data store and time series analytics engine. This makes Elasticsearch usable in contexts where you might normally be thinking of using systems like Google BigQuery, Amazon Athena, Snowflake, or, in open source, Dremel, Presto, Druid, or Spark. (You might even have considered tools like TimescaleDB, InfluxDB, or Prometheus in this category.) But Elasticsearch has some important advantages as a time series log analytics engine, especially when you need your ES cluster to power live concurrent queries from real users. Using Elasticsearch in this way, however, requires some lateral thinking on how to index, store, and query your data -- especially with regard to pre-aggregation. In this 20-minute talk, we will discuss how a small engineering team built out a large-scale time series analytics engine atop Elasticsearch, running in production atop AWS EC2, starting with a small cluster powered by Elasticsearch 1.3 (2014), when "aggregations" first became stable at scale, all the way through a much larger cluster powered by Elasticsearch 6.8 (2019-2020), as our system crossed over into over a petabyte of log data stored. We'll also discuss our home-grown query layer for Elasticsearch, which bridges the gap between the ES query DSL and time-series-aware SQL. Finally, we will discuss the open source work the team has done to prototype "index-stored aggregations" in Elasticsearch, with a focus on the challenging cardinality aggregation (aka approximate distinct count), a place where we see the potential for massive cost savings and query performance improvement, with just a little open source work. We'll close with a discussion of the broader open source effort for index-stored data aggregations and data sketches in ES, which might lead to even more innovation, and make it yet more competitive relative to other columnar time series storage options.
Interested in Tooling?
Visit our Tooling community!
We are using more and more tools every day. Here we discuss new and all tools every CTO or engineering leader should be aware of, we share feedback and best practices and help each other to use tools more efficiently. Currently, our main topics are Project management, CI/CD, Feature flagging, Security, Incident Response, Reliability/chaos engineering, monitoring/observability, low code/no-code/Serverless, Hosting.
Russ Muzzolini, CEO at Mode Market
Karim Butt, Cofiunder & CTO at GlossGenius
Rebecca Parsons, CTO at ThoughtWorks
John Difini, VP of Technology at LT Trust
John Goode, Director of Engineering at TopstepTrader
Nofar Asselman, VP at Epsagon
Jay Zeschin, VP of Technology at Highwing
Ben Sigelman, CEO and Cofounder at Lightstep
Tim Berglund, Sr. Dir. of Developer Advocacy at Confluent
Heidi Waterhouse, Developer Advocate at LaunchDarkly
Will Maier, CISO at Even Responsible Finance, Inc
James Smith, CEO and Co-Founder at Bugsnag

Copyright © 2023 CTO Connection, All Rights Reserved