272 reads

What Should I Do After the Data Observability Tool Alerts Me

by Jatin SolankiSeptember 17th, 2022

Too Long; Didn't Read

There is no doubt in the value of data observability in improving efficiency and reliability across the data value chain. However, we also need to start building the best practices across the eco-system to maximise the value. In the application observation ecosystem (such as Datadog), once DevOps and SRE teams are notified of an issue, the subsequent remedial actions are clear and well understood. In comparison, things are quite different with a Data Observability ecosystem. The remedial action can be quite different, complex and relatively challenging depending on the incident.

Company Mentioned

Coin Mentioned

featured image - What Should I Do After the Data Observability Tool Alerts Me

Steps you need to consider after you get an incident notification.

Over the last couple of years, there has been a tremendous push by data observability companies for adoption within the data stack. Educating the data community is crucial and challenging at the same time.

There is no doubt about the value of data observability in improving efficiency and reliability across the data value chain. However, we also need to start building the best practices across the ecosystem to maximize the value of data observability.

In the application observation ecosystem (such as Datadog), once DevOps and SRE teams are notified of an issue, the subsequent remedial actions are clear and well understood. It may involve well known processes such as code refactoring or spawning new instances, etc.

In comparison, things are quite different with a Data Observability ecosystem. The remedial actions can be quite different, complex and relatively challenging depending on the incident, data-stack and data infrastructure.

So, I am sharing here a few remedial actions as good practices for you to consider post notification. I’m not here to tell you which platform is the best or who notifies better, rather, the focus is more on what’s the next step; as a Data Engineer, what you should do with that notification/alert.

Let us revisit the Data Observability pillars, which aggregately quantify data health.

Volume: It ensures all the rows/events are captured or ingested.
Freshness: Recency of the data. If your data gets updated every xx mins, this test will ensure it’s updated and raises an incident if not.
Schema Change: If there is a schema change or a new feature that was launched, your data team needs to be aware to update the scripts.
Distribution: All the events are in an acceptable range. e.g if a critical shouldn’t contain null values, then this test ensures to raise an alert for any null or missing values.
Lineage: This is a must-have module, however, we always underplay these ones. Lineage provides handy info to the data team of the upstream and downstream.

Steps you need to consider if you are notified against each pillar

Volume:

There are incidents based on the #number of #events that may be greater or lesser than the expected threshold. This is usually notified at a table level.

Freshness:

This event generally occurs at the table level rather than the field level and indicates that a certain table is not being updated as intended. This may cause the firm to raise an alert, especially if they anticipate publishing a report or sending out emails and discover that the data is old.

Volume and Freshness debugging steps may overlap which is why I decided to merge the steps.

Here are a few debugging tips.

Understand if there are any changes made to the source table.
The spike in volume could be due to a marketing campaign creating a bigger reach. You should treat this as anomalous if it’s a one-off, but may require readjustment of thresholds if the trend persists long-term.
Check with the engineering team if the site or app has had a downtime. (Tools such as Datadog should have notified you as well!)
Check for orchestration failure on Airflow or pipeline issues in dbt (or relevant tools to your team)
Possibility of data pipeline lag, which may result in the variance. If the lag is expected due to high demand or system issues, mark it as an anomaly.
We also recommend you review the table lineage (upstream/downstream) for any possible errors. For a given table, start from where the issue is and move upstream. The highest upstream error should be investigated and fixed first.
Check for any known table migration which was done during the incident window.

Schema Change:

This notifies for any changes that had taken place at the table level which may impact your end report or downstream dependencies.

Understand the changes deployed, whether it is an addition, modification, or deletion of the table.
Learn from the lineage in which downstream resources are impacted.
Make changes to your scripts whether it’s SQL, Python, Airflow, or dbt models to facilitate the change in the schema.
After a staging table has been changed, you should run data recon or understand the diff before deploying it to production.
There could be the possibility of back-filling the table to reflect data in a new field although there are certain things to keep in mind such as query time, strategy, and table lock consideration. I’ll cover more of this in a future post.

Field or Column level:

I personally believe this is where a data observability tool makes the difference and drives ROI. Whether it’s validating a field, checking quality for ML models, etc; field-level monitoring is the most critical module companies need to leverage on. Debugging steps may vary with the kind of architecture and things that may be under the purview of the data team.

Let me list down a few possible debugging steps.

Let’s assume the incident was raised for a field that should not be null (not-null constraint): Pick up the primary or transaction ID and compare it with the records in the production/source data. This will give you an understanding of whether the problem exists in the production or it’s a pipeline error.
If it’s a pipeline error, fix the script and you may need to perform upserts to correct the event.
If it’s a production/source -DB issue, liaise with the tech/engineering team to see if they can backfill and fix the issue. Once they fix the issue, you will still need to perform upserts to correct the event or possibly backfill it.

One of the most difficult modules/projects are data backfilling. “Upserting events” seems straightforward, but it isn’t. There are a few critical aspects to keep in mind, such as ensuring data/events aren’t duplicated, records don’t leak, and specific handling of active tables.

Our next post will go into detail regarding data backfilling and what technique a company may use based on its data stack.

I hope this post gives a high-level overview of what to do after receiving a notification from a data observability tool. I’m excited to learn more from the comments and feedback.

Interested to learn more about data observability? Reach out to me or you can visit our site for more info.

P.S: We plan to launch our product by end-September — join the waitlist to get a 30-day free trial.