paint-brush
Open-Sourcing Code from a Private Monorepoby@joemckenney
103 reads

Open-Sourcing Code from a Private Monorepo

by Joe McKenneyMay 9th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This post is about open-sourcing modules from a private monorepo. It’s a hard problem with a decent number of gotchas. We want to share our experience wiring this up to help others who venture down the same path.
featured image - Open-Sourcing Code from a Private Monorepo
Joe McKenney HackerNoon profile picture


This post is about open-sourcing modules from a private monorepo. It’s a hard problem with a decent number of gotchas. We want to share our experience wiring this up to help others who venture down the same path.

How we structure our monorepo

A bit of context about Dopt before we dive in: Dopt is a web application for designing user-state machines on a canvas, paired with APIs and SDKs for utilizing those machines at runtime. The idea is that you can instantiate these machines per user of your product.


We let you progress the user through the machine and handle the persistence of the user’s state in each machine for you. You can iterate on and version your machines, and we’ll handle migrating your users across machines’ versions. This should be deep enough to contextualize any Dopt-specific bits in this article (but if you’re interested in diving deeper, you can check out our docs).


We develop in a monorepo at Dopt. Our monorepo is home to the apps that live on dopt.com and the packages/services they share. It’s the source of truth for all things Dopt.


We use pnpm and pnpm workspaces at Dopt. The pnpm-workspace.yaml defines the root of our workspace and allows us to define/constrain where packages can live in the monorepo.


It looks something like this:

packages:
  - "apps/**/*"
  - "packages/**/*"
  - "apis/**/*"
  - "services/**/*"


Our monorepo's structure is app-centric in nature, e.g., we are building products as opposed to reusable libraries.


This is reflected in our folder structure:

├── apps          (apps that live on dopt.com)
├── services      (internal services used by app(s) and API(s))
├── apis          (public APIs hosted on dopt subdomains)
└── packages      (packages shared by apps, services, and APIs)


Children directories in these folders correspond to package scopes e.g.

├── apps
│   ├── @app
│   ├── @www
│   ├── @blog
│   └── @docs
├── services
│   ├── @gateway
│   └── @transitions
├── apis
│   ├── @users
│   └── @blocks
└── packages
    ├── @bodo
    └── @dopt


and their children correspond to packages themselves

├── apps
│   ├── @app
│   ├─────── client
│   ├─────── server
│   ├─────── database
│   ├── @www
│   ├─────── app
│   ├── @blog
│   ├─────── app
│   ├── @docs
│   └─────── app
├── services
│   ├── @gateway
│   ├─────── service
│   ├─────── definition
│   ├── @transitions
│   ├─────── service
│   └─────── definition
├── apis
│   ├── @users
│   ├─────── service
│   ├─────── definition
│   ├── @blocks
│   ├─────── service
│   └─────── definition
└── packages
    ├── @bodo
    ├─────── alert
    ├─────── box
    ├─────── ...
    ├── @dopt
    ├─────── app-client-sdk
    ├─────── app-middleware
    ├─────── block-client-sdk
    └─────── ...


Scopes in the app directory corresponds directly to apps on subdomains of dopt.com

└─── apps
   ├── @app         (app.dopt.com)
   ├── @www         (www.dopt.com)
   ├── @blog        (blog.dopt.com)
   └── @docs        (docs.dopt.com)


Scopes in the apis directory correspond Public APIs hosted on subdomains of dopt

└─── apis
   ├── @blocks        (blocks.dopt.com)
   └── @users         (users.dopt.com)


Problem: open sourcing from a private monorepo means filtering and syncing commits

The apis directory is home to services that power our public APIs. These are the APIs folks use to drive our user-state machines.


There are really three ways to use said APIs:


  1. Directly (e.g. curl, fetch, or some other language-specific tool for making HTTP requests)
  2. API Clients (i.e., generated language-specific abstractions for talking to our REST API)
  3. SDKs (i.e., framework-specific libraries that go beyond basic REST requests, e.g., they create socket connections).


For API clients and SDKs, we need to get that code into folks’ hands. That means publishing already built versions of those packages to language-specific registries (e.g., npm, pip, etc.) for usage in their codebase and open-sourcing the raw, uncompiled source so folks can peruse, build, debug, contribute, and file issues.


Packages in a monorepo can depend on other packages in a monorepo. This is the benefit of a monorepo, e.g., you can easily create reusable/shareable code by extracting things out into their own modules.


Open-sourcing a package from a monorepo means open-sourcing everything that the package depends on as well. This concept is likely intuitive, but it’s worth pointing out a few implications of this that are perhaps less intuitive.


  • The open-source repository will itself need to be a monorepo
  • Any change to an open-source package OR one of its dependent packages (directly or indirectly) requires code to be synced.
  • Uni-directional (e.g., private to public) syncing is straightforward. Bi-directional syncing (e.g, private to public and public to private) adds some cognitive overhead.


I started to talk about “syncing” above. To expand a bit, given that we use git as our version control system, we will be syncing commits.


This is the problem within the problem, e.g., how do you safely sync commits between two repositories?


The rest of this post will explore the required steps for us to make this happen.

  • Coming up with a convention for configuring and statically identifying open-source packages
  • Identification of an open-source package’s dependent packages (direct and indirect)
  • Syncing commits between repositories
  • Automating the process

Marking packages to be open-sourced

Given that we use pnpm’s workspace concept, our monorepo’s packages are all node modules, independent of whether their src code is written in JavaScript. Said another way, every package in our monorepo has a package.json, the dependencies of which define our workspace's topology.


The schema/type definition for a package.json file is quite flexible, allowing us to add arbitrary fields to the top-level JSON.


To mark packages as open source we introduced an optional boolean field named openSource.

export DoptPackageJson extends PackageJson {
  openSource?: boolean;
}


You can see the usage of this field in our JavaScript Block API client.

{
 "name": "@dopt/blocks-javascript-client",
  "version": "1.0.0",
  "private": false,
  "description": "A generated JavaScript API client for Dopt's blocks API",
  ...
  "openSource": true, // <= new field configuring this package as open source
}


Having marked packages, we need a way to identify them programmatically. Whether using pnpm, yarn, or npm to configure workspaces, they all offer some tooling, albeit primitive, for listing packages in the workspace, etc.


We ended up wrapping pnpm’s functionality in a package called @dopt/wutils, which is conveniently open-sourced from our private monorepo. The code to locate open-source packages in the monorepo looks something like this.


import fs from 'node:fs';
import path from 'node:path';

import { getPackageLocationsSync } from '@dopt/wutils';

export const openSourcePackages = getPackageLocationsSync()
  .map((workspacePath) =>
    JSON.parse(
      fs.readFileSync(
        path.resolve(process.cwd(), `${workspacePath}/package.json`),
        { encoding: 'utf8' }
      )
    )
  )
  .filter((pkg) => pkg.openSource);


Static analysis: filtering to open source packages and their dependents

Now that we’ve marked open source packages as such and came up with a way of filtering down the monorepo to that set of packages, we should be able to script a solution for identifying a package's dependent packages (both direct and indirect).


We have two options for tools that can help us with this part, pnpm or turborepo. We use pnpm as our package manager and turbo as our monorepo build tool. Since both solve monorepo-related problems, they both offer tools for filtering the monorepo and doing static analysis related to dependencies. See the links below.


Turbo's Filter API design is heavily inspired by pnpm’s. In this case, we ended up using pnpm, but either would have worked just fine.


Attached below is an extension of the example above that illustrates how to use pnpm’s ... filtering syntax to select a package and its dependencies (direct and non-direct).


import { execSync } from 'child_process';

// Use the function above to filter workspace packages to open source packages
const openSourcePackages = openSourcePackages();

// Use the open source packages to template pnpm filtering
const templatedPnpmFilter = `pnpm ls -r --depth -1 ${openSourcePackages
  .map(({ name }) => ` --filter ${name}... `)
  .join('')} --json`;

// Execute the templated pnpm command
const targetPackages = JSON.parse(execSync(templatedPnpmFilter).toString());


Git 🪄: filtering and syncing commits

We have open-source packages and their dependent packages. The next step is to identify changes that impacted those packages. In git terms, “changes” are going to be git commits. So we are going to want to operate on the git commit history.


Operations like this are aptly termed “history rewrites” and there are a few tools that are meant to help with this, given how complicated and dangerous it can become.



The former warns against its usage and suggests using the latter instead, which is exactly what we did.


Our goal with this tool will be to extract the git history for each path associated with the open-source packages and their dependents. Building from the code example in the previous section, we can leverage the collection of targetPackages (which contains the path to that package) to form the git filter-repo query.


import { getPackagesSync } from '@dopt/wutils';

type PackageDef = {
  name: string;
  path: string;
};

// Get all the packages in the monorepo
const packages: PackageDef[] = getPackagesSync();

// Filter all packages by target packages and form query
const templatedGitFilterRepoQuery = `git filter-repo ${packages
    .filter(({ name }) => targetPackages.includes(name))
    .map(({ path }) => `--path ${path}`)
    .join(' ')}`;

// Execute the query, rewriting your git history on the
// current branch to only the desired commits.
execSync(templatedGitFilterRepoQuery)).toString();


This all happened on a clean clone of our private monorepo and we now have a history that has been filtered down to what I’ll call open-source commits.


How do we sync these commits over to our other repository?

# Create a fresh clone of the private monorepo
git clone [email protected]:<org>/<private_monorepo>.git
cd $private_monorepo;

# Run the git commit filter script
node ./filter-to-open-source-commits.mjs

# Move up a directory so the repose are siblings
cd ..;

# Create a fresh clone of the open source monorepo
git clone [email protected]:<org>/<opensource_monorepo>.git
cd $opensource_monorepo;

git checkout -b sync-commits-from-private-monorepo;
git remote add source ../private_monorepo;
# Replace "main" with the branch name in the private monorepo
git pull source main --allow-unrelated-histories --strategy-option theirs --no-edit;

# Finally, set your remote and push!
git remote set-url --push  origin https://github.com/<org>/<opensource_monorepo>.git
git push --set-upstream origin sync-commits-from-private-monorepo;


Automating this workflow

With everything working manually, our next goal is to “set and forget” this workflow/pipeline. We solved this in CI/CD, creating a GitHub action that is invoked on any merge to the main branch of our private monorepo.


Check it out the code for the GitHub Action below. Additionally, here’s a recent link to one of the pull requests this action created in the open-source repository.

# Example CI/CD
name: Sync OSS Packages
on:
  merge: main
jobs:
    steps:
      - name: Check out the monorepo
        with:
          repository: dopt/monorepo
          path: ./monorepo

      - name: Check out the open monorepo
        with:
          repository: dopt/odopt
          path: ./odopt

      - name: Filter the open-source packages
        working-directory: monorepo/
        run: node ./filter.js;

      - name: Sync commits
        working-directory: odopt/
        run: |
          git checkout -b sync/${{ github.run_id }};
          git remote add source ../monorepo;
          git pull source main --allow-unrelated-histories --strategy-option theirs --no-edit;

      - name: Push commits
        working-directory: odopt/
        run: |
          git remote set-url --push  origin https://github.com/dopt/odopt.git
          git push --set-upstream origin sync/${{ github.run_id }};


Learnings

First and foremost, we learned that you can have your cake and eat it too, i.e., enjoy all the benefits of a monorepo while still open-sourcing parts of it.


It definitely requires a decent bit of setup and cognitive overhead to understand the problems you need to solve, but by following a similar pattern you can also share pieces of your monorepo with the community with little continued maintenance. We’ve not touched this pipeline since putting it in place!


Lastly, our use case was primarily open-source API clients and SDKs, but having this pipeline has promoted a healthy habit of open-sourcing packages/modules that we think the community would benefit from e.g.,


  • please (GitHub, npm)

  • mercator (GitHub, npm)

  • wutils (GitHub)


Check out odopt to see what else we are building and sharing!



Also published here.


The featured image for this piece was generated with Kadinsky v2

Prompt: Illustrate code locks.