Over the last couple of years, there has been a tremendous push by data observability companies for adoption within the data stack. Educating the data community is crucial and challenging at the same time.
There is no doubt about the value of data observability in improving efficiency and reliability across the data value chain. However, we also need to start building the best practices across the ecosystem to maximize the value of data observability.
In the application observation ecosystem (such as Datadog), once DevOps and SRE teams are notified of an issue, the subsequent remedial actions are clear and well understood. It may involve well known processes such as code refactoring or spawning new instances, etc.
In comparison, things are quite different with a Data Observability ecosystem. The remedial actions can be quite different, complex and relatively challenging depending on the incident, data-stack and data infrastructure.
So, I am sharing here a few remedial actions as good practices for you to consider post notification. I’m not here to tell you which platform is the best or who notifies better, rather, the focus is more on what’s the next step; as a Data Engineer, what you should do with that notification/alert.
Let us revisit the Data Observability pillars, which aggregately quantify data health.
There are incidents based on the #number of #events that may be greater or lesser than the expected threshold. This is usually notified at a table level.
This event generally occurs at the table level rather than the field level and indicates that a certain table is not being updated as intended. This may cause the firm to raise an alert, especially if they anticipate publishing a report or sending out emails and discover that the data is old.
Volume and Freshness debugging steps may overlap which is why I decided to merge the steps.
Here are a few debugging tips.
This notifies for any changes that had taken place at the table level which may impact your end report or downstream dependencies.
I personally believe this is where a data observability tool makes the difference and drives ROI. Whether it’s validating a field, checking quality for ML models, etc; field-level monitoring is the most critical module companies need to leverage on. Debugging steps may vary with the kind of architecture and things that may be under the purview of the data team.
Let me list down a few possible debugging steps.
One of the most difficult modules/projects are data backfilling. “Upserting events” seems straightforward, but it isn’t. There are a few critical aspects to keep in mind, such as ensuring data/events aren’t duplicated, records don’t leak, and specific handling of active tables.
Our next post will go into detail regarding data backfilling and what technique a company may use based on its data stack.
I hope this post gives a high-level overview of what to do after receiving a notification from a data observability tool. I’m excited to learn more from the comments and feedback.
Interested to learn more about data observability? Reach out to me or you can visit our site for more info.
P.S: We plan to launch our product by end-September — join the waitlist to get a 30-day free trial.