Over the years of my engineering career I’ve been involved in a number of monitoring, observability, and visualization projects. I’ve found that many of my system-oriented techniques learned from across industries aren’t always intuitively obvious to newer engineers, so I’ve wanted to write about the subject for some time. Here you’ll find my thoughts on observability, instrumentation, monitoring, etc.
The key insight for me has always been looking at any given system as a system in the abstract sense. A system of any kind has an intended (or emergent) purpose, various inputs and outputs, and a number of interconnected components. Each component can in turn be viewed as a system and further subdivided. It is this layered breakdown of a system into components, and the connections between them, that guides my approach to system observability.
Continue reading to find my take on what I mean by “observability,” what the goal of observation is, and how I approach instrumenting a system. I’ll provide examples from different types of system, and compare them to software systems.
What is observability?
For this article, by “observability” I mean the ability to monitor the performance of a system to ensure it serves its intended purpose. The term can also be used to refer to the tools and practices involved in observing a system, e.g. an engineer asked to “add observability to the X system” would be adding necessary instrumentation to that system to allow stakeholders to monitor the system.
You’ll come across a few different terms when reading about observability. Here’s how I might define those terms:
- instrumentation
- The modifications made to a system to allow gathering data about its performance.
- monitoring
- Systems for continuously gathering data from instrumentation. Can also be synonymous with "observability."
- metrics
- Numeric data describing the performance of a system at a given point in time or over a range of time.
- logging
- Records of events that occur within a system.
- alerting
- Systems for sending notifications when human intervention is required.
If you want to build an observable system, you will instrument the system, install tools for monitoring the instrumented system, track performance metrics of the system, log crucial events and changes within the system, and alert someone if the system metrics or logs indicate human intervention is needed.
Common observability pitfalls
Before we explore system-oriented observability in detail, here are some common pitfalls and failures I’ve seen in software systems:
- Logging noise without details
- Not logging factors affecting key decisions
- Having multiple disconnected tools for monitoring different systems
- Recording too much data
- Recording too little data
- Missing subsystems and components
Instead you want to:
- Log the what and the why in a way that anyone could understand. Your
future self will thank you when you don’t have to figure out what
DEBUG: foo here
means. - Consolidate observability tooling as much as possible, and use request IDs to link across systems when consolidation isn’t possible.
- Don’t record 100% of detailed cross-system tracing through third-party tools unless you want your costs to explode.
- Make sure you’re recording the essentials to understand all of your systems.
Keep reading for more detail about what to record and how.
Observability by example
Definitions will only take you so far, so let’s look at some examples. For each example, pay attention to the critical purpose of the system, the components within the system, and the parameters we use to judge system performance. Note how the system design influences what we measure and how. Later we’ll come up with some general principles for turning a system diagram into an observability plan.
Car
Let's start with something most of us are familiar with, the everyday automobile. We generally know what cars are used for, and have seen what metrics are present on a car dashboard. But you may not have thought about a system breakdown of a car. We'll look at a gasoline-powered car since there are more inputs and outputs than for an electric car.
Purpose
Transportation over roads, while following traffic laws. Some cars may have other purposes, such as racing, fun, comfort, or off-road transportation.
Components
- Chassis
- Engine
- Transmission
- Suspension
- Wheels
- Fuel tank
Inputs
- Fuel
- Air
- Steering position
- Throttle position
- Brake position
- Gear selection
Outputs
- Exhaust
- Heat
- Motion
Key metrics
- Speed
- Distance
- Engine RPM
- Oil temperature
- Fuel level
- Fuel economy
Logs and events
- Engine starts and stops
- Gear shifts
- System faults
- ABS activation events
Alerts
- Low fuel
- Engine failure
- ABS failure
Swimming pool treatment system
Another system you might have used without even realizing it is the chemical and thermal management system of a swimming pool. Pools must be kept within certain ranges for the health and comfort of swimmers, and in larger commercial pools this may be automated. Either way, the caretaker of the pool needs to monitor chemical levels and temperatures to make sure the pool is safe and comfortable.
Purpose
Maintain neutral pH, comfortable temperature, cleanliness, and clarity of swimming pool water.
Components
- Pool
- Water pump
- Filter
- Chlorine pump
- Acid pump
- Boiler
Inputs
- Inlet water
- Chlorine
- Acid
- Electricity
- Natural gas
Outputs
- Outlet water
- Exhaust
Key metrics
- pH
- Free chlorine
- Temperature
- Pressure
Logs and events
- Water pump started or stopped
- Boiler started or stopped
- Chemical pump started or stopped
Alerts
- pH out of range
- Temperature out of range
Electric generation station
Power generation stations are a great case study for observability. You've probably seen scenes on the news or in documentaries of power plant control rooms. Because of the critical nature of power generation systems, a lot of thought is put into monitoring and controlling them. So we can gain a lot of insight into system observability by looking at how power plants are monitored.
See this video on the saVRee channel for a more detailed breakdown of a thermal power station.
Purpose
Provide electrical power to the electric grid.
Components
- Boiler
- Turbine
- Generator
- Cooling tower
- Circuit breaker
- Transmission line
Inputs
- Coal
- Air
- Water
Outputs
- Electricity
- Steam
- Exhaust
Key metrics
- Frequency / turbine speed
- Voltage
- Current
- Temperature
Logs and events
- Generator started
- Circuit breaker tripped
- Circuit breaker reset
Alerts
- Frequency outside range
- Voltage outside range
- Overcurrent
- Temperature outside range
Sound system
The world of sound is also a great source of observability inspiration. Sound systems are full of meters, gauges, indicators, and visualizations that can serve as examples for other domains. As with other types of system, the breakdown into components and connections helps us decide what to monitor and how.
Purpose
Combine multiple sound inputs from a musical or spoken performance in a pleasing way, with the result amplified for an audience.
Components
- Preamp
- Equalizer
- Compressor
- Mixer
- Amplifier
- Speaker
Inputs
- Audio (line/mic level)
- Channel level controls
- Frequency controls
Outputs
- Audio (speaker level)
Key metrics
- Volume
- Frequency
- Phase
- Power output
- Temperature
Logs and events
- Input selection changed
- Preset loaded
- System powered on/off
Alerts
- Clipping
- Loss of signal
- Amp protection tripped
- Equipment overheated
Sales funnel
While engineers may not often think about sales, some of the same observability concepts are used by product owners, marketing, and sales to study and improve their sales and conversion funnel. The inputs are leads, the components are participants in the funnel, the connections are opportunities, and the outputs are won and lost deals. Just as with other systems, we monitor flows along the connections and key metrics of the components.
Purpose
Identify the best leads and shepherd them through the conversion process, resulting in product sales.
Components
- Organic search
- Search ads
- Video ads
- Industry publications
- Sales team
- Self-serve sign-up
- Onboarding
- Payment
Inputs
- Leads
Outputs
- Disqualified leads
- Successful sales
Key metrics
- Conversion rate
- Lead quality
- Total conversions
- Lifetime customer value
Logs and events
- Ad clicked
- Sales contacted
- Account created
- Onboarding completed
- Payment made
Alerts
- High rate of fraud from a given lead source
- Sudden change in conversion rate
- Sudden change in lead count
Web app backend
Now that we've looked at a variety of non-software systems, let's compare those to how we monitor a typical software system. We'll look at a simplified Ruby on Rails web backend, with a web server, a job queue, and a database. The metrics for the job queue in this simple example are similar to metrics we would use for a more complex message queue system.
Purpose
Serve dynamic and static content to users, while storing and processing user-supplied data asynchronously.
Components
- Web server
- Job server
- Database
Inputs
- Web requests
Outputs
- Web responses
Key metrics
- Requests per minute
- Request size
- Error rate
- Processing delay
- Processing time
- Queue size
- Record count
- Connection count
- Message rate
Logs and events
- Request received
- Websocket connected
- User logged in
- Invalid sign-in
- Job completed
- Error occurred
Alerts
- High error rate
- New error type
- Site not loading
- Site certificate expiring soon
- Sudden change in throughput
- Queue latency above threshold
- Average processing time above threshold
Instrumenting any system
We’ve seen several examples of systems and some of the key metrics that will be monitored for those systems. Let’s generalize this concept so that we can instrument and observe any system. The key questions are:
- What do we monitor?
- How do we monitor it?
As part of answering these questions, you’ll want to ask a few more:
- What is our system’s purpose?
- What are the components of our system?
- What are the layers of infrastructure above, below, and within our system?
- What are the connections between components?
As you explore the answers to these questions for your system, you should start to find natural metrics for each subsystem, component, and connection, especially when you consider the type of connection.
What do we monitor?
Let’s start identifying metrics and events that we will want to monitor based on the system purpose and component breakdown. We might not need to monitor everything that we possibly could monitor, but it’s always helpful to examine the entire system in case we’ve missed something important.
Once we understand our system and have a system diagram, choosing what to monitor becomes pretty straightforward:
- Look at the lines — Monitor the inputs, outputs, and connections in the system.
- Look at the boxes — Monitor the components of the system.
- Look at the layers — Monitor each major layer of abstraction.
- Look at the big picture — Monitor the system as a whole.
Connections
Look for the natural metrics for a given input, output, or connection. Some examples:
- The natural metrics for an electric power connection are voltage and current.
- Natural metrics for a fluid in a pipe are temperature, pressure, acidity, and flow rate.
- For an audio connection, you’d probably monitor amplitude.
- For web requests, we want to know the number of requests per second (throughput).
- For a message queue, monitor queue size, message size, latency, and throughput.
- For a network connection, look at bytes per second, packets per second, and packet error rate.
Whatever the type of connection, there will be natural metrics to monitor the performance of the connection.
Components
A chain is only as strong as its weakest link, and a system only works as well as its poorest performing component. Monitoring each component will allow us to anticipate pending failures or bottlenecks and debug any issues.
It’s especially important to monitor internal states of components that aren’t otherwise easy to determine, and the reasons for key decisions.
Some examples of internal component states to monitor as metrics:
- Gain reduction amount on an audio compressor
- CPU usage on a web server
- Disk usage on a database server
Some key decisions and reasons to log as events:
- User sign-in rejected because…
- The password was invalid.
- The captcha system is down.
- The user’s IP address has failed to sign in too many times.
- Chemical mixer shut down because…
- The input flow rate dropped below set threshold X, indicating a pipe blockage.
- The mixer’s input electrical current exceeded set threshold Y, indicating a solidified mixture or failed bearing.
Note that you might not always send this information to your end users, but you will definitely want these things internally.
If you still aren’t sure how to monitor a given component, try breaking it down into smaller pieces and connections, or try looking for similar examples.
Layers
While we haven’t spent as much time in this article looking at layers of abstraction, it’s important to understand the infrastructure your system is built on.
For a physical structure like a bridge, this might mean monitoring strain and vibration in the structure as well as the weight and number of vehicles.
For a software system, you might monitor hardware-level metrics like temperature, OS-level metrics like CPU and memory usage, app-level metrics like request rate and error rate, and user-level metrics like conversion rate and customer satisfaction.
System
What does the system do, and is it actually doing it? Whatever the system’s purpose, you will need to be able to answer this question. If you are struggling to identify a key metric for the system as a whole, try working on the components and connections for a while, then revisit the whole-system monitoring question.
In most cases there will be some key binary metric that tells you whether the system is functioning or not. For a web server, this can be as simple as, “Does the sign-in page load?”
How do we monitor it?
Once you know what data to monitor, you need to find the best way to gather and record that data. This process is called instrumentation. You also need ways to store and display data once gathered.
Instrumentation
As mentioned in the Connections section above, there are often natural metrics for a system, and standard ways to measure them. This could be pressure gauges, strain sensors, etc.
For software, many frameworks also have standard instrumentation available for things like request rate, errors, logging, etc. either built-in or as a third-party library. Make sure you understand what’s available off the shelf for your framework before implementing something from scratch.
If your monitoring needs exceed what’s available or standardized in your industry, here are some common approaches for deciding where to add custom instrumentation:
- First, again, focus on what’s built-in or available off the shelf. Always start with the standard metrics for your system or framework.
- For logic decisions, report or log the decision at the point the decision is made.
- Identify or create a bottleneck through which connections flow. E.g. route all pipes along the same route so all gauges can be placed together, or design software so that all requests pass through one common code path.
- Add cross-system tracking information. For web requests, this can be a header you set in the global configuration for your HTTP client and log on both source and destination systems.
- For software, be careful not to log sensitive data like passwords, personal information, etc. when adding your instrumentation.
Data storage
Once your system is instrumented, you need a place to store current and historical data. It’s common to have separate systems for metrics, logs, and errors, but the more you can centralize storage, the easier it will be to cross-check and correlate different sources of data.
Focusing on software systems, you’ll want at minimum a log storage system and a metric storage system. This could be something traditional like syslog and MRTG, a modern self-hosted system like ELK stack and Grafana, or a third-party service like New Relic or Datadog.
Access and visualization
Data is only as valuable as it is accessible, so think about how you’re going to visualize your data. Dashboards are essential for metrics, and search and chronological review are critical for logs.
Charts and dashboards
Metrics are most useful when presented on an organized dashboard. You’ll use metrics to understand the overall health of your system, identify patterns of use and abuse, and anticipate possible failures.
- Start with a heads-up view of the key metrics for your system. You want to be able to know at a glance if your system is approaching failure or overload.
- Ideally, you’ll also want a query language or user interface for charting raw data in new ways.
- If possible, include request IDs and other metadata as dimensions in your metrics so you can correlate specific metric changes with detailed logs.
Logs
Logs should be useful, actionable, and searchable. You’ll use logs to debug specific events and failures within the system.
- Make sure your logs are useful — you want to maximize signal to noise.
- Log everything that’s essential, unless you have severe cost constraints on
log storage.
- If you do have storage constraints, start by filtering out the least important events, such as health checks, debugging info, and successful requests to non-sensitive endpoints.
- Make sure your logs are detailed — don’t just log
here
in the middle of a function. Instead log a description of the step that was just completed, along with any non-sensitive data that will affect the outcome of the process. For example,OAuth flow: users service check passed for ID 5
is far more informative thanhere
. - Include request IDs and cross-system correlation IDs in your logs, plus any other relevant metadata (user ID, account ID, etc.). If you find someone abusing your system, you want to be able to reconstruct their full list of actions.
- Use structured logs (e.g. JSON) rather than raw text if at all possible.
- Have both a search-focused and timeline-focused way of viewing your logs. You might start by searching for a specific request ID, then look at all logs within a few seconds of the first log event you find.
A few must-haves for web systems
While you’ll definitely want to use this ground-up systems approach to monitoring a complex system, there are some basic best practices you will always want to have in the web development world (sometimes you can find all of these together from one vendor):
- APM — Application Performance Monitoring. This is your tool for monitoring throughput and latency, broken down by page or controller. A good APM tool can give you a waterfall plot for slow requests as well, so you can identify bottlenecks in your system. Examples: New Relic, Datadog, Skylight, Honeybadger
- Logging — Every web system should have a request log. Make sure you log request IDs/correlation IDs, request processing time, signed-in user ID, session information, and any critical decisions plus the factors that led to that decision. Structured logs are best, but text logs are also common. Ideally your log system is fully searchable and filterable. Examples: ELK stack, Cloudwatch
- Tracing — If you have more than one component in your web system that handles a request (e.g. microservices), make sure you can track the request across every component in the system. Examples: Opentracing, simple correlation ID headers
- Error tracking — It’s always useful to have a separate tool for error reporting. At a minimum you’ll want error counts over time, stack traces, and affected users. Examples: Rollbar
- Uptime monitoring — Make sure you have something monitoring your site’s basic health. If your site goes down, you should know before your customers. Examples: UptimeRobot, Pingdom
- Alerting — Finally, when a fault is detected by the previous systems, you need a way of finding out and getting people to respond. Examples: Opsgenie, PagerDuty
Conclusion
We’ve seen some examples of systems and how they are monitored, and a process for turning any system design into an observability plan. These are the techniques I use in my own engineering, and I hope you find this systems approach to observability as useful as I have. If you’ve found any errors in this article, please get in touch.