Observability
Incidents in production enviroment, bugs we can never understand or reproduce in test environments, different behaviors than expected changing state of data in a way we don't understand. Well, the lack of a mature observability can lead us there.
According to Dynatrace (a giant product in the monitoring business): "Observability is the extent to which the internal states of a system can be inferred from externally available data. An observable software system provides the ability to understand any issue that arises. Conventionally, the three pillars of observability data are metrics, logs and traces. Dynatrace extends this with UX and topology information. However, turning data into answers requires more than just an observable system."
As software gets more and more complex, observation complexity follows. Starting from the basics what we need to ensure for the first layer of observability, is having data comming out of our apps and infrastructure. For this challenge we need to guarantee 3 things: logging, metrics and tracing.
Logging: the application emits a string or an object detailing something that has hapenned while code is running. These data could have different purpose like informing or warning about something.
Metrics: Metrics are implemented when a service provides a metric key (the what) and a value. This is combined with a timestamp (the when) to make time series data, so that values can be charted over a time interval as a set of data points. For both logging and metrics, though, it’s not just the application that provides insight: fabric (like cloud infrastructure), databases, caches, queues, servers, and all sorts of things will generate telemetry providing varying degrees of insight.
Tracing: Application tracing is about recording execution flow through a piece of software, tracking things like method called details, response times, and so on.
The next layer of observability maturity is having actual monitoring. The purpose of this layer is to infer indicators from metrics derived from those data sources, and ultimately detect anomalies. Some examples of this might include flagging a problem if a given API returns errors more than (say) 1% of the time, or if response time exceeds (say) 100ms.
Last but not least, alerting. The third layer of observability ensures that an event is generated by the monitoring platform each time an anomaly is detected, which is exactly what we want it to do.
There's a bunch of tools in the market that will help you implement that in your software, such as: Dynatrace, Datadog and ELK stack. Along with the tools, there are some design pattern for the implementation itself that could come handy, check the Observability section of microservices.io portal.
This is a brief explanation on how to create a mature observability, we recommend you to dig deeper starting from the references below.
Credits