Observability

Incidents in production environments, bugs we could never understand or reproduce in test environments, unexpected behaviors that alter data states in ways we don't understand. Well, the lack of mature observability can lead us to this.

According to Dynatrace (a giant product in the monitoring sector): "Observability is the degree to which the internal states of a system can be inferred from externally available data. An observable software system offers the ability to understand any problem that arises. Conventionally, the three pillars of observability data are metrics, logs, and traces. Dynatrace extends this with UX and topology information. However, turning data into answers requires more than just an observable system."

As software becomes increasingly complex, the complexity of observation is the same. Starting with the basics, what we need to ensure for the first layer of observability is to have data coming out of our applications and infrastructure. For this challenge, we need to ensure three things: logging, metrics, and tracing.

Logging: the application emits a string or an object detailing something that has hapenned while code is running. These data could have different purpose like informing or warning about something.
Metrics: Metrics are implemented when a service provides a metric key (the what) and a value. This is combined with a timestamp (the when) to make time series data, so that values can be charted over a time interval as a set of data points. For both logging and metrics, though, it’s not just the application that provides insight: fabric (like cloud infrastructure), databases, caches, queues, servers, and all sorts of things will generate telemetry providing varying degrees of insight.
Tracing: Application tracing is about recording execution flow through a piece of software, tracking things like method called details, response times, and so on.

The next layer of observability maturity is having actual monitoring. The purpose of this layer is to infer indicators from metrics derived from those data sources, and ultimately detect anomalies. Some examples of this might include flagging a problem if a given API returns errors more than (say) 1% of the time, or if response time exceeds (say) 100ms.

Last but not least, alerting. The third layer of observability ensures that an event is generated by the monitoring platform each time an anomaly is detected, which is exactly what we want it to do.

There's a bunch of tools in the market that will help you implement that in your software, such as: Dynatrace, Datadog and ELK stack. Along with the tools, there are some design pattern for the implementation itself that could come handy, check the Observability section of microservices.io portal.

This is a brief explanation on how to create a mature observability, we recommend you to dig deeper starting from the references below.

Credits

Automated Testing

DevOps

DevOps Work Processes

Observability

Automated Testing

DevOps Work Processes

Observability ​

Observability