Logs, metrics, and the evolution of observability at Coinbase
Up until mid 2018, the only supported way to generate custom metrics for an application at Coinbase was to print to standard output. Each printed line would work its way through a series of pipelines where they could be analyzed, transformed, and parsed until finally landing on an index in our monolithic self-managed Elasticsearch cluster, where engineers could build visualizations and dashboards with Kibana.
This workflow represents the best of many worlds. Log entries are easy to generate, and because they can contain fields with unlimited cardinality at microsecond granularity, it’s possible to slice, dice, and aggregate these entries to diagnose the root cause of even the most complex issues.
In 2016, we reaffirmed our decision to self-manage Elasticsearch. At the time, managed solutions like Amazon Elasticsearch Service were still in their infancy (for example, IAM policy changes would require a full redeploy of the cluster) and we were reluctant to trust not yet mature third party vendor technology at the time with the potentially sensitive nature of our applications’ standard output data. We built elaborate automation to manage and blue-green deploy our Elasticsearch data, master and coordinating nodes with our existing deployment pipeline and comply with our self-enforced 30 day project.
Managing this newly automated Elasticsearch ourselves actually worked out fine for over a year. As an organization of thirty engineers serving up two moderately trafficked web applications, both our indexing (~100GB per day) and search volume (a few queries per minute) were relatively tame.
However during 2017, interest in cryptocurrency skyrocketed, catapulting our moderately trafficked web applications into the spotlight. This massive surge in traffic to our customer facing applications (60x!) had the downstream effect of an equivalent surge in downstream log entries. Earlier technical challenges contributed to this surge of new entries as engineers worked overtime to add diagnostics to our platform (see our previous scaling blog post for more detail). During this period, a single Coinbase API call could generate up to 100 separate log entries, at times resulting in terabytes of log data per hour.
The cluster became unwieldy. As the size of the engineering team increased dramatically, the combined weight of terabytes per day of logs, a monolithic cluster, and unrestricted access patterns led to outages that halted engineering productivity and created operational hurdles.
It became clear that a monolithic Elasticsearch cluster would not be sufficient for our observability needs. Some of our problems included:
- A single bad Kibana query (hello, Timelion) could cause a hiccup severe enough to affect the entire engineering organization.
- Certain issues were difficult to diagnose or prevent — Elasticsearch, Kibana and X-Pack lack the ability to analyze slow/expensive queries by user nor the controls to prevent expensive queries like multi-week aggregations.
- Many of Elasticsearch’s load-related failure conditions (like a series of very expensive queries) would leave the cluster in a disabled state even after load had entirely subsided. Our only recourse to recover from these crippled states was to perform a full cluster restart.
- Elasticsearch does not have a separate management interface, meaning that during failures we would be unable to perform basic diagnostics on the cluster. Opening a support request with Elastic requires diagnostic dumps that we were unable to generate until the cluster had recovered.
- Alerting on data stored in Elasticsearch data was not intuitive — tools like Elastalert and Watcher don’t allow engineers to interactively build alerts and were difficult to integrate with PagerDuty and Slack.
- We were forced to reduce log retention to reduce the impact of massive queries and speed up cluster recovery following failures.
- Aside from application server standard output, we were also storing and querying security data like AWS VPC Flow Logs and Linux Auditd records which required separate security controls and exhibited different performance characteristics.
- Our two largest internal applications consumed 80–90% of our log pipeline capacity, reducing the performance and retention for other smaller applications in our environment.
We chose to solve these problems in two ways:
- Introduce a new platform to provide engineers with features that are impractical to provide with Elasticsearch (namely inexpensive aggregations, alerting, and downsampling for long retention).
- Split our self-managed, monolithic Elasticsearch cluster into several managed clusters segmented by business unit and use case.
Based on our challenges self-managing Elasticsearch combined with our short experiments at developing a reliable blue-green deploy strategy for Prometheus backends influenced our decision to choose a managed service for our metrics provider. After evaluating several providers, we settled on Datadog.
Unlike log platforms which store data as discrete structured events, metric stores like Datadog store numerical values over time in a matrix of time by tag/metric value. In other words, in Datadog, if you send a single gauge or counter metric 100,000 times in a 10s period, it will only be stored once per tag value combination while on a logs platform that same value would result in 100,000 separate documents.
As a result of this tradeoff, metrics stores like Datadog allow for extremely fast aggregations and long retention at the cost of reduced granularity and cardinality. Specifically with cardinality, because every additional tag value added to a given metric requires a new column to be added to the time vs tag/metric matrix, Datadog makes it expensive to breach their 10k unique tag combination per host limit.
Despite these tradeoffs, we’ve found Datadog to be a near-perfect complement to our internal logs pipeline.
- Creating and iterating on alerts is fast and intuitive. Internally, we’ve built a tool to automatically monitor and codify our Datadog monitors and dashboards to prevent accidental modifications.
- Pulling up complex dashboards with metrics from the past several months is guilt-free and near-instantaneous.
- Features like distribution metrics allow engineers to easily understand the performance from a block of code with global percentiles.
- Limitations on cardinality can be confusing for engineers who want to keep track of high cardinality values like user_id or wallet address.
Metrics vs Logs
A common topic internally, we’ve found the distinction between Metrics and Logs to be important. At a high level the tools have similar feature sets — both allow applications to send arbitrary data which can be used to combine visualizations into fancy dashboards. However, beyond their basic feature sets, there are major differences in the features, performance, and retention of these tools. Neither tool can support 100% of use cases, so it’s likely that engineers will need to leverage both in order to provide full visibility into their applications.
In short, we think that Metrics should be used for dashboards and alerts; while Logs should be used as an investigative tool to find the root cause of issues.
Securing the Datadog Agent
From a security perspective, we are comfortable using a third-party service for metrics, but not for logs. This is because metrics tend to be numeric values associated with string keys and tags, compared to logs which contain entire lines of standard output.
Datadog offers operators a series of tight integrations — if you provide the Datadog agent access to a host’s /proc directory, Docker socket, and AWS EC2 metadata service, you’ll be provided rich metadata and system stats attached to every metric you generate. Running a 3rd party agent like Datadog on every host in your infrastructure, however, carries some security risk regardless of vendor or product, so we chose to take a more secure approach to employing this technology.
We took several actions in order to gain maximum exposure to these Datadog integrations while also reducing the risk associated with running a third party agent.
- Rather than use the pre-built Docker container, we built our own stripped down version with as few optional third party integrations as possible.
- By default, the agent is opt-in — projects need to explicitly choose to allow the Datadog container.
- On every host, put the container on a separate “untrusted” network bridge without access to other containers on the host, our VPC, or the EC2 AWS metadata service. We hard code the `DD_HOSTNAME` environment to the host’s instance ID to allow the AWS integration to continue working.
- Run a special Docker socket proxy service on hosts to enable the Datadog container integration without exposing the container’s potentially secret environment variable values.
Developing a strong metrics foundation at Coinbase helped to alleviate some of the problems we experienced with Elasticsearch as many workloads naturally migrated to Datadog. Now at least when an issue did occur on the monolithic cluster, engineers had data they could fall back on.
But the Elasticsearch issues and outages continued. As engineer headcount continued to grow, logging continued to be a source of frustration. There were about 7 new engineering productivity impeding incidents in Q4 2018. Each incident would require operational engineers to step through elaborate runbooks to shut down dependent services, fully restart the cluster, and backfill data once the cluster had stabilized.
The root cause of each incident was opaque — could it be a large aggregation query by an engineer? A security service gone rouge? However the source of our frustration was clear — we’d jammed so many use cases into this single Elasticsearch cluster that operating and diagnosing the cluster had become a nightmare. We needed to separate our workloads in order to speed incident diagnosis and reduce the impact of failures when they did occur.
Functionally sharding the cluster by use case seemed like a great next step. We just needed to make a decision between investing further in the elaborate automation we’d put in place to manage our existing cluster, or reapproach a managed solution to handle our log data.
So we chose to reevaluate managed solutions for handling our log data. While we’d previously decided against using Amazon Elasticsearch Service due to what we considered at the time to be a limited feature set and stories of questionable reliability, we found ourselves intrigued by its simplicity, approved vendor status, and AWS ecosystem integration.
We used our existing codification framework to launch several new clusters. Since we leverage AWS Kinesis consumers to write log entries to Elasticsearch, simply launching duplicate consumers pointed at the newly launched clusters allowed us to quickly evaluate the performance of Amazon Elasticsearch Service against our most heavy workloads.
Our evaluation of Amazon Elasticsearch Service went smoothly, indicating that the product had matured significantly over the past two years. Compared to our previous evaluation, we were happy to see the addition of instance storage, the support of modern versions of Elasticsearch (only a minor version or two behind at most), as well as various other small improvements like instant IAM policy modification.
While our monolithic cluster relied heavily on X-Pack to provide authentication and permissions for Kibana. Amazon Elasticsearch Service relies on IAM to handle permissions at a very coarse level (no document or index level permissions here!). We were able to work around this lack of granularity by dividing the monolith into seven new clusters, four for the vanilla logs use case, and three for our various security team use cases. Each cluster’s access is controlled by leveraging a cleverly configured nginx proxy and our existing internal SSO service.
Migrating a team of over 200 engineers from a single, simple to find Kibana instance (kibana.cb-internal.fakenet) to several separate Kibana instances (one for each of our workloads) presented a usability challenge. Our solution is to point a new wildcard domain (*.kibana.cb-internal.fakenet) at our nginx proxy, and use a project’s Github organization to direct engineers the appropriate Kibana instance. This way we can point several smaller organizations at the same Elasticsearch cluster with the option to split them out as their usage grows.
Functionally sharding Elasticsearch has not only had a massive impact on the reliability of our logging pipeline, but has dramatically reduced the cognitive overhead required by the team to manage the system. In the end, we’re thrilled to hand over the task of managing runbooks, tooling, and a fickle Elasticsearch cluster to AWS so that we can focus on building the next generation of observability tooling at Coinbase.
Today, we’re focusing on building the next generation of observability at Coinbase — if these types of challenges sound interesting to you, we’re hiring Reliability Engineers in San Francisco and Chicago (see our talk at re:invent about the types of problems the Reliability team is solving at Coinbase). We have many other positions available in San Francisco, Chicago, New York, and London — visit our careers page at http://coinbase.com/careers to see if any positions spark your interest!
This website may contain links to third-party websites or other content for information purposes only (“Third-Party Sites”). The Third-Party Sites are not under the control of Coinbase, Inc., and its affiliates (“Coinbase”), and Coinbase is not responsible for the content of any Third-Party Site, including without limitation any link contained in a Third-Party Site, or any changes or updates to a Third-Party Site. Coinbase is not responsible for webcasting or any other form of transmission received from any Third-Party Site. Coinbase is providing these links to you only as a convenience, and the inclusion of any link does not imply endorsement, approval or recommendation by Coinbase of the site or any association with its operators.
Unless otherwise noted, all images provided herein are by Coinbase.