In this article, I will talk about how I improved overall data processing efficiency by optimizing the choice and usage of data warehouses.
A valuable asset for anyone looking to break into the Data Engineering field is understanding the different types of data and the Data Pipeline.
Meet The Entrepreneur: Alon Lev, CEO, Qwak
2021 Noonies Nominee General Interview with Veronika. Read for more on cloud services, data engineering, and python.
This is the first completed webinar of our “Great Expectations 101” series. The goal of this webinar is to show you what it takes to deploy and run Great Expectations successfully.
Location-based information makes the field of geospatial analytics so popular today. Collecting useful data requires some unique tools covered in this blog.
Learning about best data visualisation tools may be the first step in utilising data analytics to your advantage and the benefit of your company
The worst nightmare of analytics managers is accidentally blowing up the data warehouse cost. How can we avoid receiving unexpectedly expensive bills?
mParticle & HackerNoon are excited to host a Growth Marketing Writing Contest. Here’s your chance to win money from a whopping $12,000 prize pool!
Is the data engineer still the "worst seat at the table?" Maxime Beauchemin, creator of Apache Airflow and Apache Superset, weighs in.
Governance is the Gordian Knot to all Your Business Problems.
See mParticle data events and attributes displayed in an eCommerce UI, and experiment with implementing an mParticle data plan yourself.
Put your organization on the path to consistent data quality with by adopting these six habits of highly effective data.
How to become a better data leader that the data engineers love?
In "Towards Open Options Chains", Chris Chow presents his solution for collecting options data: a data pipeline with Airflow, PostgreSQL, and Docker.
Processing large data, e.g. for cleansing, aggregation or filtering is done blazingly fast with the Polars data frame library in python thanks to its design.
Data trust starts and ends with communication. Here’s how best-in-class data teams are certifying tables as approved for use across their organization.
Data teams come in all different shapes and sizes. How do you build data observability into your pipeline in a way that suits your team structure? Read on.
In the previous article, I described the concept and design of the Structured Data Service in the Alluxio 2.1.0 release. This article will go through an example to demonstrate how it helps SQL and structured data workloads.
Noom helps you lose weight. We help you get a job at Noom. In today’s article, we’ll show you one of Noom’s hard SQL interview questions.
With each day, enterprises increasingly rely on data to make decisions.
In this article, we cover how to use pipeline patterns in python data engineering projects. Create a functional pipeline, install fastcore, and other steps.
In the previous article, I described the concept and design of the Structured Data Service in the Alluxio 2.1.0 release. This article will go through an example to demonstrate how it helps SQL and structured data workloads.
Applying machine learning models at scale in production can be hard. Here's the four biggest challenges data teams face and how to solve them.
Self-serve systems are a big priority for data leaders, but what exactly does it mean? And is it more trouble than it's worth?
Here is not really an article, but more some notes about how we use dbt in our team.
Standard Audiences: A product that extends the functionality of regular Audiences, one of the most flexible, powerful, and heavily leveraged tools on mParticle.
Implementing tracking code based on an outdated version of your organization's data plan can result in time-consuming debugging, dirty data pipelines, an
After speaking to hundreds of teams, I discovered ~80% of data issues aren’t covered by testing alone. Here are 4 layers to building a data reliability stack.
Too lazy to scrape nlp data yourself? In this post, I’ll show you a quick way to scrape NLP datasets using Youtube and Python.
Amazon AI/ML Stack
Predictive Modeling in Data Science is more like the answer to the question “What is going to happen in the future, based on known past behaviors?”
From simplifying data collection to enabling data-driven feature development, Customer Data Platforms (CDPs) have far-reaching value for engineers.
Write efficient and flexible data-pipelines in Python that generalise to changing requirements.
In "Towards Open Options Chains", Chris Chow presents his solution for collecting options data: a data pipeline with Airflow, PostgreSQL, and Docker.
Learn the impact of airflow on the data quality checks and why you should look for an alternative solution tool
The 5 things every data analyst should know and why it is not Python, nor SQL
PyTorch Geometric Temporal is a deep learning library for neural spatiotemporal signal processing.
Migrating from Convox to Nomad and some AWS performance issues we encountered along the way thanks to Datadog
This is a collaboration between Baolong Mao's team at JD.com and my team at Alluxio. The original article was published on Alluxio's blog. This article describes how JD built an interactive OLAP platform combining two open-source technologies: Presto and Alluxio.
In "Towards Open Options Chains", Chris Chow presents his solution for collecting options data: a data pipeline with Airflow, PostgreSQL, and Docker.
A brief description of the difference between Data Science and Data Engineering.
Here are six important steps for setting goals for data teams.
Best practices for building a data team at a hypergrowth startup, from hiring your first data engineer to IPO.
This blog post is a refresh of a talk that James and I gave at Strata back in 2017. Why recap a 3-year-old conference talk? Well, the core ideas have aged well, we’ve never actually put them into writing before, and we’ve learned some new things in the meantime. Enjoy!
See how to leverage the Airflow ShortCircuitOperator to create data circuit breakers to prevent bad data from reaching your data pipelines.
In "Towards Open Options Chains", Chris Chow presents his solution for collecting options data: a data pipeline with Airflow, PostgreSQL, and Docker.
Migrating Presto workloads from a fully on-premise environment to cloud infrastructure has numerous benefits, including alleviating resource contention and reducing costs by paying for computation resources on an on-demand basis. In the case of Presto running on data stored in HDFS, the separation of compute in the cloud and storage on-premises is apparent since Presto’s architecture enables the storage and compute components to operate independently. The critical issue in this hybrid environment of Presto in the cloud retrieving HDFS data from an on-premise environment is the network latency between the two clusters.
Find out how to set up and work locally with the most granular demographics dataset that is out there.
I've worked on teams building ML-powered product features, everything from personalization to propensity paywalls. Meetings to find and get access to data consumed my time, other days it was consumed building ETLs to get and clean that data. The worst situations were when I had to deal with existing microservice oriented architectures. I wouldn't advocate that we stop using microservices, but if you want to fit in a ML project in an already in-place strict microservice oriented architecture, you're doomed.
This case study describes how we built a custom library that combines data housed in disparate sources to acquire the insights we needed.
In this article, we’ll investigate use cases for which data engineers may need to interact with NoSQL database, as well as the pros and cons.
See how a hybrid architecture marries the best of the SaaS world and on-prem world for modern data stack software.
As the third largest e-commerce site in China, Vipshop processes large amounts of data collected daily to generate targeted advertisements for its consumers. In this article, guest author Gang Deng from Vipshop describes how to meet SLAs by improving struggling Spark jobs on HDFS by up to 30x, and optimize hot data access with Alluxio to create a reliable and stable computation pipeline for e-commerce targeted advertising.
How I learned to stop using pandas and love SQL.
Metabase is a business intelligence tool for your organisation that plugs in various data-sources so you can explore data and build dashboards. I'll aim to provide a series of articles on provisioning and building this out for your organisation. This article is about getting up and running quickly.
Data lakes are an essential component in building any future-proof data platform. In this article, we round up 7 reasons why you need a data lake.
How to detect, capture, and propagate changes in source databases to target systems in a real-time, event-driven manner with Change Data Capture (CDC).
This article covers 7 data engineering gotchas in an ML project. The list is sorted in descending order based on the number of times I've encountered each one.
HarperDB is more than just a database, and for certain users or projects, HarperDB is not serving as a database at all. How can this be possible?
Delight is an open-source an cross-platform monitoring dashboard for Apache Spark with memory & CPU metrics complementing the Spark UI and Spark History Server.
It doesn’t matter if you are running background tasks, preprocessing jobs or ML pipelines. Writing tasks is the easy part. The hard part is the orchestration— Managing dependencies among tasks, scheduling workflows and monitor their execution is tedious.
Do we need a radical new approach to data warehouse technology? An immutable data warehouse starts with the data consumer SLAs and pipes data in pre-modeled.
This article presents the collaboration of Alibaba, Alluxio, and Nanjing University in tackling the problem of Deep Learning model training in the cloud. Various performance bottlenecks are analyzed with detailed optimizations of each component in the architecture. This content was previously published on Alluxio's Engineering Blog, featuring Alibaba Cloud Container Service Team's case study (White Paper here). Our goal was to reduce the cost and complexity of data access for Deep Learning training in a hybrid environment, which resulted in over 40% reduction in training time and cost.
Tiered Locality is a feature led by my colleague Andrew Audibert at Alluxio. This article dives into the details of how tiered locality helps provide optimized performance and lower costs. The original article was published on Alluxio’s engineering blog
The benefits that come with using Docker containers are well known: they provide consistent and isolated environments so that applications can be deployed anywhere - locally, in dev / testing / prod environments, across all cloud providers, and on-premise - in a repeatable way.
Ask anyone in the data industry what’s hot and chances are “data mesh” will rise to the top of the list. But what is a data mesh and is it right for you?
Bridging the gap between Application Developers and Data Scientists, the demand for Data Engineers rose up to 50% in 2020, especially due to increase in investments in AI-based SaaS products.
Since the big bang in the data technology landscape happened a decade and a half ago, giving rise to technologies like Hadoop, which cater to the four ‘V’s. — volume, variety, velocity, and veracity there has been an uptick in the use of databases with specialized capabilities to cater to different types of data and usage patterns. You can now see companies using graph databases, time-series databases, document databases, and others for different customer and internal workloads.
This article introduces Structured Data Management (Developer Preview) available in the latest Alluxio 2.1.0 release, a new effort to provide further benefits to SQL and structured data workloads using Alluxio. The original concept was discussed on Alluxio’s engineering blog. This article is part one of the two articles on the Structured Data Management feature my team worked on.
A Data Pipeline Solution - Part I](https://hackernoon.com/towards-open-options-chains-a-data-pipeline-solution-for-options-data-part-i) In "Towards Open Options Chains", Chris Chow presents his solution for collecting options data: a data pipeline with Airflow, PostgreSQL, and Docker.
Today, I am going to cover why I consider data science as a team sport?
In this first post in our 2-part ML Ops series, we are going to look at ML Ops and highlight how and why data quality is key to ML Ops workflows.
In this listicle, you'll find some of the best data engineering courses, and career paths that can help you jumpstart your data engineering journey!
The art of building a large catalog of connectors is thinking in onion layers.
What's Deep Data Observability and how it's different from Shallow.
In this post, I discuss the algorithms of a nested loop, hash join, and merge join in Python.
In this blog, guest writer Derek Tan, Executive Director of Infra & Simulation at WeRide, describes how engineers leverage Alluxio as a hybrid cloud data gateway for applications on-premises to access public cloud storage like AWS S3.
Writing ML code as pipelines from the get-go reduces technical debt and increases velocity of getting ML in production.
Goldman Will Dominate Consumer Banking
When it comes to Big Data infrastructure on Google Cloud Platform , the most popular choices Data architects need to consider today are Google BigQuery – A serverless, highly scalable and cost-effective cloud data warehouse, Apache Beam based Cloud Dataflow and Dataproc – a fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way.
Why we chose to finally buy a unified data workspace (Atlan), after spending 1.5 years building our own internal solution with Amundsen and Atlas
Sometimes, we might not be able to afford a paid subscription on Slack. Here's a tutorial on how you can save and search through your Slack history for free.
Multi-part series that will take you from beginner to expert in Delta Lake
This tutorial shows how Alibaba Cloud Container team runs PyTorch on HDFS using Alluxio under Kubernetes environment. The original Chinese article was published on Alibaba Cloud's engineering blog, then translated and published on Alluxio's Engineering Blog
handoff is a serverless data pipeline orchestration framework simplifies the process of deploying ETL/ELT tasks to AWS Fargate.
Is Astronomy data science?
Influenza Vaccines and Data Science in Biology
Congratulations, you’ve successfully implemented data testing in your pipeline!
Maximizing efficiency is about knowing how the data science puzzles fit together and then executing them.
This blog covers real-world use cases of businesses embracing machine learning and data engineering revolution to optimize their marketing efforts.
This post explains what a data connector is and provides a framework for building connectors that replicate data from different sources into your data warehouse
Learn how to build an n8n workflow that processes text, stores data in two databases, and sends messages to Slack.
Overview of the modern data stack after interview 200+ data leaders. Decision Matrix for Benchmark (DW, ETL, Governance, Visualisation, Documentation, etc)
Data Version Control (DVC) is a data-focused version of Git. In fact, it’s almost exactly like Git in terms of features and workflows associated with it.
Data augmentation is a technique used by practitioners to increase the data by creating modified data from the existing data.
Visit the /Learn Repo to find the most read stories about any technology.