Petabyte-scale log analytics with elasticsearch
Andrew Montalenti
Chief Product Officer
Parse.ly
Despite Elasticsearch's popularity as a distributed search engine and as a core plank of the Elastic stack (utilized together with projects like Kibana and Logstash), Elasticsearch can also be thought of as a powerful columnar data store and time series analytics engine. This makes Elasticsearch usable in contexts where you might normally be thinking of using systems like Google BigQuery, Amazon Athena, Snowflake, or, in open source, Dremel, Presto, Druid, or Spark. (You might even have considered tools like TimescaleDB, InfluxDB, or Prometheus in this category.) But Elasticsearch has some important advantages as a time series log analytics engine, especially when you need your ES cluster to power live concurrent queries from real users. Using Elasticsearch in this way, however, requires some lateral thinking on how to index, store, and query your data -- especially with regard to pre-aggregation. In this 20-minute talk, we will discuss how a small engineering team built out a large-scale time series analytics engine atop Elasticsearch, running in production atop AWS EC2, starting with a small cluster powered by Elasticsearch 1.3 (2014), when "aggregations" first became stable at scale, all the way through a much larger cluster powered by Elasticsearch 6.8 (2019-2020), as our system crossed over into over a petabyte of log data stored. We'll also discuss our home-grown query layer for Elasticsearch, which bridges the gap between the ES query DSL and time-series-aware SQL. Finally, we will discuss the open source work the team has done to prototype "index-stored aggregations" in Elasticsearch, with a focus on the challenging cardinality aggregation (aka approximate distinct count), a place where we see the potential for massive cost savings and query performance improvement, with just a little open source work. We'll close with a discussion of the broader open source effort for index-stored data aggregations and data sketches in ES, which might lead to even more innovation, and make it yet more competitive relative to other columnar time series storage options.
Interested in Tooling?
Visit our Tooling community!
We are using more and more tools every day. Here we discuss new and all tools every CTO or engineering leader should be aware of, we share feedback and best practices and help each other to use tools more efficiently. Currently, our main topics are Project management, CI/CD, Feature flagging, Security, Incident Response, Reliability/chaos engineering, monitoring/observability, low code/no-code/Serverless, Hosting.
VIDEOS RELATED TO TOOLING
Jeff Casimir, Executive Director at Turing School
Abraham Kuri vargas, CTO at Icalia Labs
Scott Davis, WebArchitect at ThoughtWorks
Rohini Pradeep, Head of Benefits Engineering at Gusto
Douglas Ferguson, Founder at Voltage Control
Adam Zimman, VP of Product & Platform at LaunchDarkly
Sha Ma, VP Software Engineering at GitHub
Premanand Chandrasekaran, Principal at ThoughtWorks
Jan Chong, Senior director of engineering at Twitter
Phil Calçado, Senior Director Engineering at SeatGeek

Copyright © 2022 CTO Connection, All Rights Reserved