The art of building a large catalog of connectors is thinking in onion layers.
We’re building an open-source data integration platform at Airbyte. We launched our MVP about a month ago. We were thrilled by the amount of feedback and support we got from the community. We even got our first big pull request from a contributor this week (2,000+ lines of code). But during this full month, we didn’t release any new connectors. You might wonder why we didn’t build on that momentum. If people were excited with our MVP even though it had only 6 connectors, you might think we should have ramped up on the number of connectors as fast as possible. We didn’t do that for two very important and differentiating reasons.
First, we were defining exactly what the best data protocol would be if we wanted to solve data integration once and for all, and this for all companies. You can learn more about our specification here. Even though it’s not final yet, you will have a glimpse of our vision for the future.
Second, and just as important, we were building a real manufacturing plant for data integration connectors. See, our team led data integration at Liveramp, which has more than 1,000 data ingestion connectors and 1,000+ distribution connectors. So we have the experience of abstracting what can be abstracted and simplifying the manufacturing of new integration (very often without code). We haven’t fully built our manufacturing plant, but engineers can already add one new connector every day.
This article describes how we built this connector manufacturing plant.
When building a large catalog of connectors, there are several things that you need to think through.
Initial build
This is when you start from a blank page. This step usually requires a little bit of planning since it involves communication with external teams/companies.
The initial build step involves:
Tests
Tests are essential to make sure that any code or protocol change won’t affect the connectors. They need to run before every merge.
They also ensure that the connector behaves as you expect. For that you need to run your connector against the actual production service. For example, if you’re working on the Salesforce connector, you must make sure that Salesforce actually behaves the way you expect. It is not unusual that an API or service documentation doesn’t fully reflect the reality.
We currently have the foundation of our test framework; it allows developers to focus solely on providing inputs and outputs, and the rest is taken care of by the framework.
These tests give us 90% certainty that the connector is fully functional. If there are edge cases, it is always possible to add more custom tests.
Liveliness & Change detection
It is essential to ensure that the source or destination continues to behave as it was encoded during the initial build phase and to ensure that the source or destination is still alive for monitoring purposes.
These verifications must be run at a cadence, and any failure needs to be investigated and fixed, leading to the maintenance phase.
Maintenance
We need to define how we are going to update the connector, push changes and propagate the changes to all the running instances of Airbyte.
Segmenting cattle code
To make a parallel with the pet/cattle concept that is well known in DevOps/Infrastructure, a connector is cattle code, and you want to spend as little time on it as possible. Anything you can do to prevent yourself from doing work in the future, you need to do. This will accelerate your production tremendously.
Abstractions as onion layers
Maximizing high-leverage work leads you to build your architecture with an onion-esque structure:
The center defines the lowest level of the API. Implementing a connector at that level requires a lot of engineering time. But, it is your escape hatch for very complex connectors where you need a lot of control.
Then, you build new layers of abstraction that help tackle families of connectors very quickly.
Today, we’ve built one of these abstractions to support existing Singer integration. Building an integration leveraging Singer takes us less than 3 hours, and our goal is to bring it down to less than 10 minutes.
We have the same ambition for every other family of sources and destinations.
As we continue to improve our manufacturing plant for connectors, we will build tools that will allow us to handle 95% of integrations with no or very little code.
This is how we are going to address the long tail of integrations and how we’re going to make integrations a commodity.
We’ve built the following:
We want to reach a rate of 5 connectors per day and accelerate even beyond that.
We also want to provide the community with more tools to build and contribute their own connectors. Ideally, 95% of connectors can be added to Airbyte with no code.
We hope this gives you a better understanding of what we’ve been up to and what our real ambitions are. If you see any ways to improve this architecture, we’re all ears. Don’t hesitate to join our Slack to discuss any questions or suggestions with the team.
Previously published at https://airbyte.io/articles/data-engineering-thoughts/how-to-build-thousands-of-connectors/