Instrument your build pipelines to level up your DevOps maturity

Note: this is part of a series on build pipeline instrumentation and observability

Observability is not just metrics, traces and logs, its your organisation's ability to identify issues in your production systems in a way that matters to your product domain and respond to them efficiently.

However, a common blind spot is to think that observability is just about production systems. Critical development infrastructure like build systems and internal tools should not be left out of the observability story, as they affect your development team's ability to build and test changes and to maintain a regular release cadence.

Why it matters

Failing to identify when your build pipeline is slow or when it experiences flakiness or instability leaves your organisation vulnerable to unexplainable decreases in development velocity. These problems only show up when you realise your planned estimates are continually too short, or developers are adding extra time to estimates to deal with "build issues".

Maybe you examine your cycle time as part of a [regular review of DORA metrics(https://www.atlassian.com/devops/frameworks/dora-metrics) review, but the information will still show up too late unless you have regular metrics on your build system.

Although DevOps has become something of a dirty word and has fallen out of vogue, it identifies a very real need to address the "engineering side of engineering" that would otherwise be forgotten by developers.

Build pipeline issues are one of those issues tnhat can become systemic in organisations where developers work on the same codebase, but no-one takes responsibility for monitoring and improving build pipeline speeds.¹ This issue can also occur in organisations with relatively independent development teams, where they may not be responsible for building and deploying their own code².

Metrics show you what, traces show you why

You may already be in a position where releases are weekly or monthly and the chance of production-level bugs are high, but blocked on increasing release frequency because the "build system is too slow". This is where more attention to your build system's performance, beyond basic metrics, will help you understand why things are slow.

Depending on your build tool, you may already have metrics on what build jobs take the longest, or which ones fail the most. If you're lucky, your build tool keeps historical data allowing you to see if jobs have got slower over time or error rates have increased.

But even if you can measure build times and error rates, and you regularly monitor them for divergence against historical benchmarks, metrics alone will not show you why your builds are slow or constantly falling over.

Spans help with prioritising improvemennts

This is where proper instrumentation of build pipelines and jobs will show you not only what is slow or failing, but why your builds are performing badly. Building traces with spans will not just show you how long each job takes, but will enable you to monitor performance in more actionable ways:

what jobs are on the critical path (i.e. what jobs block other jobs)
how long pipelines take end-to-end e.g. how long it takes to spin up a new review environment for developers to test their changes³
find opportunties to parallelise jobs
identify the slowest or flakiest build jobs that should be targeted first for remediation

Rich attributes help you drill down to the cause

Adding spans is only part of the story with tracing. Ensuring that your spans have a rich set of attributes about their execution context, failures, build tools, and interactions with third-party systems will help you pin down:

the causes of flakey tests, not just which ones are failing the most
common sources of failures across otherwise unrelated build jobs (such as a particular exception or error message)
commonalities between build jobs that may be affecting performance (e.g. the use of a particular package management tool)⁴

Instrumenting your code for builds helps you profile your tests

Lastly, not only instrumenting your builds, but instrumenting your code that runs in development and test systems with information about your builds can help you trace your API implementations back to builds, letting you:

identify calls to third party systems that may be a common cause of test failures (but that doesn't show up in production)
find particularly slow tests that call lots of APIs or particularly slow APIs
build a timeline or profile of your test cases that link your integration or E2E tests to particular API calls or system flows

Next up: how instrumentation helps find performance and error issues, and practical steps to instrument your build system with Opentelemetry

if its everyone's responsibility, then its no-ones responsibility ↩
This can also be the case in highly automous teams that manage their own infrastructure, but where there is not a strong culture of frequent releases and build pipeline maintenance. ↩
A review environment per branch is very common in serverless based development organisations, which can spin up a full replica of production for each branch a developer works on. Mature engineering organisations that have complex inter-system dependencies will also use automated environment initialisation or re-initialisation to ensure that developers do not waste time debugging issues caused by stale testing data or misconfigured local setups. ↩
this may especially be the case in larger organisations where there is very hetrogenous tool sets e.g. individual teams can choose their own build tools or programming languages ↩

Why it matters

Metrics show you what, traces show you why

Spans help with prioritising improvemennts

Rich attributes help you drill down to the cause

Instrumenting your code for builds helps you profile your tests

Footnotes