Filesystems, version control, and the blockchain
Image courtesy of Adarsh Kummur
This is the fourth, and possibly final, article in my series on the blockchain without blockchain.
Iâve always had a warm place in my heart for filesystems.
I taught myself shell scripting while automating the installation of Disksuite, Sunâs free but sadistic disk mirroring software. I barely recall the actual work, instead remembering a hallway. I undertook a literal journey to learn programming, a repeated pilgrimage to the desk of a friend who took visible pleasure in explaining to me what I was doing wrong.1 Itâs fair to say that if filesystems were less painful in the 90s, I would not be where I am today.
When Sun started advertising ZFS as the (finally!) successor to Disksuite and the filesystem it was built around, UFS, most of its functionality seemed obviously goodâââmake the computers manage the disks, donât demand people know up front how big a filesystem should be, donât fail miserably when the server crashes, little things like that. But what was this data integrity thing? Iâm embarrassed to say it took me a while to realize I needed itâââwho really cares if your filesystem is good at storing data, amirite?âââand even longer to understand how it worked.
To explain it, Iâm going to have to teach you cryptography. Just a little. Youâre welcome to skip ahead if youâve already got this part covered, but I expect most could use a little, ah, refresher. Step 1 in cryptography guides is usually: âGet a masters in mathematics from MIT.â Iâm hoping to do a bit better than that. Cryptography really is just a form of math, and while we canât all understand the details (I certainly donât) we can at least understand the âalgorithms happen hereâ flow diagrams2.
Cryptography is most famous for its privacy utility: You use it to ensure you and only you can read your files and chat messages. It gets more complex once we need to read them on all of our different devices, but most of it is pretty similar in concept. Even more useful is ensuring both you and I can read some text, but no one else can. Itâs more complex, but is essentially an extension of that first use.
Privacy is not the only use case for cryptography. Itâs also useful for efficient validation. That is, it can be used to see if a file you have today is the same one you had yesterday. I sent you a document, you think it looks wrong; how do we make sure it did not get changed somehow in transit?
Obviously one way to do that is to just send it again. This is not a great solution, because if you did not trust it the first time, why would you trust it the second? That might also be a bad idea if bandwidth is expensive. You generally want a verification mechanism that takes less space than the original file, and less CPU power than directly comparing the two files.
Cryptography provides just such a capability, usually called a âhash functionâ. Itâs an algorithm that converts, say, a large text file into a much shorter string. If you want to ensure the file is not changed in some way, just run it again and compare the output. The short strings are easier to compare than the long documents, and you could even read them over the phone to someone so they can check the file on their end. These algorithms generally produce a string of a fixed length, regardless of inputâââthis makes them efficient for long term storage and comparison, and safe to run on any size file. Hereâs an example hash from my files:
03f39f4bfad04f6f2cfe09ced161ab740094905c
As you can see, itâs just a long string of gibberish. Itâs not only useful for comparison, not meaningful in itâs own right.
Whatâs critical about these algorithms is that given a unique input they always provide a unique output. If you and I each have a file that hashes to a given string, then we can be confident we have exactly the same file. Of course, this canât literally be true: We could design a hash function that only had 256 possible outputs, and there are obviously more than 256 possible inputs. This would produce a lot of what are called collisions, when two files hash to the same output, and, ah, is not terribly useful.
All of the modern hash functions are incredibly long. It is possible in theory but not in practice that a collision would happen. Youâd need to execute the function 2ÂčÂČâž times. Thatâs 3.4 with 38 zeros after it. So, mathematically possible, but you can expect the sun to swallow the earth before the most secure hash functions get compromised. I mean, you canât. Youâll be gone by then. But your files will still be safe.
Now that youâre at least as much an expert on cryptography as most of the bitcoin hodlers, why does any of this matter?
We were talking about data integrity.
Youâd be right to guess that ZFS uses these hash functions to provide it. It goes further than just validating individual files. A little bit of cryptographic genius called a Merkle tree is the key. These donât just hash the content on disk for later validation; they build a tree of hashes, where the leaf nodes are hashed by the nodes above them in the tree, which are themselves hashed by the root node. If any part of this system is corruptedâââbecause the disk is broken, or someone changed the content some other wayâââitâs easy to detect. Itâs not just that the individual hash will be different; remember each parent hashes all of its children, so now the parent is wrong. And its parent is wrong, too.
If the content is changed by any mechanism that does not also also update the Merkle tree, then it is easy to detect by rehashing all of the content and comparing the results to the stored tree.
This is how ZFS validates data integrity. It can write a block to disk, then pull the block and ensure it still matches the hash. When it writes a block, it updates the parallel tree, and when you ask for the block later, it can tell you if the block is still correct. If itâs not, it throws an error instead of handing it back to you.
When I first learned of this, it seemed overkill, but over time I remembered just how many ways there are for data to get corrupted. The most obvious one is someone changes it for nefarious reasons, but far more commonly you have a failure somewhere in the writing or reading process. The old spinning disks were error-prone, and the new SSD drives degrade eventually. Itâs the complexity of reading and writing that really gets you, though: There are multiple layers of caches, drivers, and connections, any of which could introduce corruption.
For the first time on a normal production system, you could at least detect any of those problems. Itâs too bad no one ever used it.3
I know, I know, you came to hear about how you could get all the awesomeness of blockchain without using the blockchain and instead Iâm giving lessons on two things you could literally not care less about, cryptography and filesystems. Donât worry. It gets worse from here.
Long after I learned about and promptly forgot ZFS (after all, itâs not like I was using it), I adopted Git. Itâs a version control system, used for storing and managing source code. Every geek knows about it, but most of the world only recently learned of it when Microsoft bought Github for $7.5B with a âbâ. I was an early adopter, switching Puppet to Git in 20084. Eventually I even learned how it works. I was titillated and a bit horrified that I had duplicated in Puppet one of the key features that made Git work: A system of storing files that allowed them to be looked up by their content (or rather, a hash of their content). Normally you store files by a name, but if lots of people (or, in Puppetâs case, computers) store the same file, they might not call it the same thing, so Git and Puppet instead stored them by their hash. This ensured we never backed up more than one copy of a file, saving a lot of space, and made it easy to check for changes in files.
For Puppet, we just used this to back up files we changed, in case people later wanted to revert.
Git did a lot more than that.
Like ZFS, it builds a Merkle tree of the entire file repository, with a similar goal: To understand what files have changed and how. After all, git is used to track and share changes to a collection of files. The sharing is a critical component; you can easily copy an entire git repository to another computer, or another person, and itâs important that they be able to confirm that they have a faithful copy.
Git stores the hash tree alongside all of the files. At any point, you can use the tree to validate every file in your tree. If there are changes (which is pretty much the whole point of a version control system), it can automatically store the new files and update the related tree.
Just like ZFS, one of the key features here is that the Merkle tree allows us to validate every file stored. We can walk the file tree and compare each file to its hash, and then compare the file listing to its own hash, all the way up. Any discrepancy is easily spotted.
This is my favorite kind of cleverness: Itâs simple in implementation, yet makes Git more flexible and useful. It has power that other version control systems are missing, just because it relies on this basic mechanism for storage and validation.
Ok. Now we get to the point.
Again, Iâm not actually interested in the blockchain. Iâm interested in peeling it apart, putting the useful bits to work while avoiding the whole anarcho-capitalist aspect.
It would be easy to see the blockchain as a sudden revolution, a dramatic change in whatâs possible. Viewed this way, itâs hard to separate the pieces from the whole. If all you see is the big picture, itâs easy not to notice that every individual component has its own history, its own value.
The blockchain was gradual, for both me and the industry. It was not one giant leap forward. It was part of a story, a sequence, and the most interesting aspectâââMerkle treesâââis decades old in math and now pushing decades old even in popular usage. Most of the interesting features touted in the blockchain come directly from them. Immutability (which isnât) and trustless systems derive directly.
Itâs worth understanding that history, to see which stages and steps apply to problems you have. The current cryptocurrency tech stack is built to solve problems I donât think exist. Certainly they arenât problems I have.
Unlike the blockchain as a whole, though, the individual technical components have been used for years, even decades, in production. Focusing on the current trend can blind you to the opportunity history demonstrates. I think youâre a lot more likely to find broadly applicable solutions there than in trying to replace currency.
Because I got here from the world of filesystems and version control, I see different benefits than you might if you approach thinking of currencies or exchanges. Or chat messages. That does not make me right or wrong, but it does, at least, mean weâre going to work on different problems.
I expect most of you think this is boring. Thatâs great. It will give me that much more time to build something.
- My brightest memory is learning that of course the âechoâ command resets the exit code variable. This was a critical early lesson in how your own debugging can dramatically change the behavior of a program. â©
- When people talk about the futility of trying to ban cryptography, this is what they mean: You canât ban math. â©
- Yes, I know some people use and love ZFS. But never to the extent it should be. â©
- Resulting in one of our critical community members abandoning Puppet in protest, for some reason. â©
Originally published at Writing by Luke Kanies.