Observability: What to instrument?

One of the struggles I've had onboarding a team to observability is the chicken-and-egg problem of not having enough code instrumented to get enough use out of our observability tools, versus getting engineers to instrument their code in a meaningful way that would progress them into that virtuous cycle of code-instrument-evaluate-reinstrument.

Auto-instrumentation is helpful for bootstrapping your way out of this problem, but it only takes you so far. More value from observability comes from the manual instrumentation of your code, that makes your traces and spans reflect your system's business logic and its problem domain, instead of the low-level, technical (and non-differentiating aspects) picked up by auto-instrumentation.

As a observability champion in your organisation, you may have already recognised this problem and tried to instrument a large swathe of the codebase yourself before evangelising the use of an observability tool like Honeycomb, but your efforts can only take you so far. The process is only self-sustainable by getting your developers to instrument their code, which will be difficult for them to get right if they are not actively using an observability tool to see what happens when they do.

It takes time to build up an understanding of observability among your developers, and for them to develop a heuristical sensibility about what and what not to instrument when uplifting existing code, something that develops from practice. (This article has come out of my early experience teaching a team how to instrument their code, but I suspect that there is much more than just what I've written below.)

Primitives of instrumentation

If you've played around with OpenTelemetry or any sort of instrumentation tool, you probably already know this, but there seems to be four main ways we can instrument our code:

Add attributes to a span
Wrap an existing block of code in a span
Record an event within a span
Propagate context and baggage to child spans (trace ID) or related spans (links)

Each of these come with relevant strengths and disadvantages, and things which intuitively make sense (or not) as you begin to use them in earnest.

Adding attributes to a span

This is the principal approach taken by auto-instrumentation, and should be your first tool when approaching uninstrumented code. Attributes add context to the activity, and both high-cardinality attributes (which make it easier to narrow your search to a particular user or transaction) and high-dimensionality attributes (which help group transactions with the same characteristics together) are equally worthwhile adding.

It's easy to add everything you can, especially as storage is cheap, but attributes are only useful if they can searched and indexed. Values should be short and either very unique (like user IDs or transaction IDs), or occur frequently, but in buckets (like a HTTP method or path). Blob values and serialised objects are best left for logs or their own events, as they are neither searchable nor indexable, but very useful for looking at a particular instance of an issue.

Building up a hierarchy or schema of attribute names is also critical, especially when it is shared and used consistently in your codebase. A hierarchy makes discoverability of attributes much easier when constructing a search, and consistency means less clauses in your WHERE statement trying to match on every permutation of an attribute.

A good place to start is the Trace Semantic Conventions defined by various OpenTelemetry working groups. Although many of these are in draft, they are usually well-described, and make it easier to align your system to industry standards. Even if you're not using a particular technology, they may provide ideas and context on how to structure your attribute hierarchies.

Wrap an existing block of code in a span

Auto-instrumentation also focuses on this approach, especially on network calls or calls to third-party systems (e.g. DNS instrumentation, socket instrumentation, HTTP instrumentation). For business logic, it can also be helpful to wrap a set of related activity, especially when it groups together 3 or more other spans, to make it easier to distinguish the different parts of a trace from a systems perspective.

What isn't as helpful is wrapping single spans in another span, which is easy to overloook if you're manually instrumenting network requests which have already been auto-instrumented. Unless you get extra value from the auto-instrumented span (e.g. a HTTP call), your wrapping span (which likely contains a lot more context) should be used instead.

For example, if you're wrapping a call to Salesforce, you will add things like AccountIds or the values of fields being changed, which is far closer to the intent and purpose of the API call, so you may want to suppress the instrumentation of your HTTP call (some auto-instrumentation lets you configure this). However, if your downstream system is particularly flakey or something under your management, keeping the low-level span may be useful for more technical debugging.

Another consideration is the value of the spans you are creating, especially from auto-instrumentation. You should focus on adding spans when they help describe things you can improve. For example, in high-managed environments like AWS Lambda, detailed metrics about the underlying system or hostname or process ID are probably unhelpful, as you do not control these aspects of the underlying execution environment and can do little to affect them. Anything more than HTTP or AWS instrumentation will be of little value in an AWS Lambda environment, but conversely, adding network, DNS and process instrumentation would be extremely valuable when managing your own servers or containers.

When you're are running in a highly-managed environment you will get more value out of a far smaller set of auto-instrumentation, so it makes sense to instrument "up the stack" i.e. to focus more on grouping related API calls and instrumenting business logic, rather than focusing on low-level system details. In environments where you manage more of your own infrastructure, both will be important to the different people that rely on them.

Record an event within a span

OpenTelemetry supports the idea of events, which are kind of like 'point-in-time' spans that don't have a duration or end-time, but still have attributes. I've seen them used to describe DOM rendering instrumentation, but another use has been to record key decision making in my code without generating extra span "noise" in my traces, as they still have their own attributes but don't necessarily appear as another line in the trace graph.

Beyound this, I've found them fairly limited, but they may be useful to keep in mind. They have the same capacity for abuse as over-creating spans, but their impact is more limited to your observability tool's usage thresholds (and the associated extra costs).

Propagate context and baggage

One of the more fundamental auto-instrumentation abilities, is those that propagate context (effectively your span ID and trace ID) to child spans across network boundaries, such as in HTTP headers or queue message attributes, ensuring that your downstream spans get added to your parent trace, instead of creating a new one. Sometimes you will need to do this manually, but this is thankfully a low-level and usually one-off task for parenting your spans into the same trace.

Another way to propagate context is to create links instead, which indicates a related span, but in another trace e.g. traces which describe queue message processing will often create a link back to the sending span, instead of adding them to the sending span's trace, as technically the queue message processor is running in its own context, where it might be processing a batch of messages coming from unrelated senders (spans can only have one parent span). Links are better for these scenarios as they still link together related spans, but preserve the meaning of a trace where the code processing those messages may need to be analysed per batch, instead of per message. They can also have their own attributes, a bit like an event.

OpenTelemetry also gives you this concept of baggage, which is extra data you can pass on in context and which are propagated downstream. Because of this, you need to be careful what you propagate; both in terms of volume (baggage obviously adds extra transmission overhead when it is propagated) and sensitivity (how much personal information or technical information do you accidentally want to send to third-party systems from your instrumentation).

The other problem with baggage is that those values aren't automatically added to child spans as attributes - you still need to create a custom span processor that looks for extra baggage in the context and adds it to child spans. With this, baggage is much more useful as your child spans will have the extra attributes. It's great for propagating the high-cardinality ones, so that problems related to a particular user or transaction will surface in your searches directly at the span that caused it, instead of having to navigate into each trace to see if there was an exception.

Solving for what to add to baggage and what to copy from baggage to your spans is key, but the end result is much more searchable spans that makes it quicker to find individual issues at their cause (instead of at the root span).

This article goes into much more detail about the why and how of baggage.

Bringing it together

As suggested, each of the above primitives has their own strengths and weakness, and its only by applying each of the appropriately that we get the full picture of our system in the traces in produces. For example, wrapping everything in custom spans is of no use if we don't add the right cardinality or dimensionality with custom attributes to find and filter them; propagating baggage makes no sense if we haven't got the right spans to structure the description of our system's behaviour into different units of work.

Getting a sense of what is useful and what is not comes from experimenting. Everyone will then get a better sense of what is working, what needs to be improved, and what is delivering little value. Teams where developers investigate and fix problems as they arise in production will find this easier, instead of those that separate operational and development roles, or even when its only senior developers in the team that investigate issues. You should be encouraging the developers on your team to look at the traces produced by their own code, both when developing it and when it is running in production.