I started my career as a first-generation analyst focusing on writing SQL scripts, learning R, and publishing dashboards. As things progressed, I graduated into Data Science and Data Engineering where my focus shifted to managing the life-cycle of ML models and data pipelines. 2022 is my 16th year in the data industry and I am still learning new ways to be productive and impactful. Today, I am now the head of a data science & data engineering function in one of the unicorns and I would like to share my findings and where I am heading next.
When I look at the big picture, I realized that the problems most companies face are quite similar. Their vision towards being data-driven has turned into a BHAG — pronounced “bee hag” (Big Hairy Audacious Goal).
We data folks like patterns, so here are my findings:
-
During 5 out of 10 review meets, I have witnessed people question the reliability of the data/report/dashboard. Additionally, HODs will also try to convince others that their data is the most accurate or reliable :)
-
A lot of times, HOD comes and says that the data is not updated. The data team is already working to fix the report/data table.
-
A new product got launched the week before, however, we are yet to figure out the performance. The data team is working on a query change and will soon update the CXO team.
-
Everyone has built expertise around writing complicated ML (machine-learned) models, however very few talk about or deploy inference monitoring. There is a high probability of model drift or performance drift in the coming weeks/months if not monitored or observed efficiently.
-
Very few companies deploy solutions or models to detect performance anomalies.
The list is long, I am sure you can relate or add more to this.
In a nutshell, I found that data reliability is a BIG challenge and there is a need for a solution that is easy to use, understand, and deploy, and also not heavy on investment.
I am Jatin Solanki and I am on a mission to build and develop a solution to make your data reliable.
What is needed to make your data more reliable?
Complexities around data infrastructure are surging as companies gear to get a competitive edge and out-of-the-box offerings.
Every company goes through a data maturity matrix. In order to reach a level where you deploy AI models or self-service models, you need to invest in a robust foundation.
In my opinion, the foundation begins with a reliable data source or defining source of truth. Your data models won’t be impactful if it’s ingested with bad data. You know it’s garbage in garbage out.
On a high level, here are a few checks you can implement to ensure data reliability:
-
Volume: It ensures all the row/events are captured or ingested.
-
Freshness: Recency of the data. If your data gets updated every xx mins, this test will ensure its updated and raises an incident if not.
-
Schema Change: If there is a schema change or a new feature that was launched, your data team needs to be aware to update the scripts.
-
Distribution: All the events are in an acceptable range. e.g if a critical shouldn’t contain null values, then this test ensures to raise an alert for any null or missing values.
-
Lineage: This is a must-have module, however, we always underplay these ones. Lineage provides a handy info to the data team of the upstream and downstream.
-
Reconciliation: I would say recon or finding deltas between two given datasets. This could be used to understand the difference between
staging
andproduction
OR betweensource
anddestination
. This could be effective in running some financial recon too, like payment gateway to the sales table. -
What next? How do we implement this?
The most common question people face with:
Build versus Buy
I am a big fan of open-source tech, however, in some critical modules, I prefer buying an out-of-the-box solution because it’s scalable and already tested in the market. Developing in-house might cost you around US2k per month and it includes a few hours of engineer’s time along with cloud cost.
If you are inclined toward buying an out-of-the-box solution, here are a few factors that should be part of your checklist.
- Should be able to connect to popular sources which require minimal config.
- Extract information automatically without the need for additional code.
- No-code or CLI (I leave it to you)
- Lineage and Catalog module.
- Data Reconciliation along with scheduling feature.
- Anomaly detection
- Of course, Of course, all the tests we discussed earlier along with alerts should be in a position to tell where to
debug
.
A robust platform provides easy access to all the incidents and also evaluates the data health.
It should be in a position to automatically detect my critical data assets and apply hygiene checks.
The only platform to group alerts instead of pushing 100+ alerts.
At last, the solution should help you reduce data quality incidents and make your data more reliable.
So, do I need a data observability platform?
If your answer to any of the below questions or scenarios is “Yes”, then you should procure or deploy a data observability solution right away.
- Dashboard not getting updated on a regular basis?
- Don’t know which report is accurate?
- Business stakeholders are the first to learn about data incidents.
- Questions during a meeting on the performance stats.
- Have at least 2 members in the data team.
- Deployed a business intelligence tool.
As software developers have leveraged on DataDog, Dynatrace, etc kind of solutions to ensure web/app uptime, data leaders should invest in data observability solutions to ensure data reliability.
Also published here.