paint-brush
How serious is Sam?by@edwardstanfield

How serious is Sam?

by EdwardJuly 9th, 2017
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

In response to Ayende’s code review of <a href="https://github.com/kreeben/resin">Resin</a>, <a href="https://ayende.com/blog/178947/reviewing-resin-part-iii?key=c73f964b561b4bfaab80a72624d4e568">part&nbsp;III</a>.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - How serious is Sam?
Edward HackerNoon profile picture

In response to Ayende’s code review of Resin, part III.

You may ask how seriously should one charge Resin. I’m curious, how serious are you willing to take me?

I’m going to try to create a OSS web scale search engine. I find relevance over a web scale corpus to be disgustingly intriguing. I’m so fascinated by it I’ve made it into my life’s quest. It’s what I shall be doing right up until the day I die. It’s what I want to be good at.

As of now, what’s the scale at which Resin can perform? My test data has been the English version of Wikimedia and ~20K novels from the Gutenberg project. Since Resin is a trie, it can take a lot of data. Lots and lots and lots and then all of a sudden, once your data reaches a certain scale, if it’s represented as a Unicode trie, it ceases to expand.

The point where all terms known to man or woman are contained within one data structure, is where I’m trying to steer this newly crafted ship.

Microsoft and Google says to use this:

Microsoft DocumentDB. It’s a little boring. Can I get a drink to cabin 237 please? http://i2.cdn.cnn.com/cnnnext/dam/assets/160108142736-regents-seven-seas-explorer-super-169.jpg

I’m having a lovely time with this at the moment:

Cheap and with no distractions. Just me, my plastic boat and the sea. Wait, someone’s already here. http://www.jeffkellerphotography.com/wordpress/wp-content/uploads/2012/08/Im-on-a-boat-Sea-kayaking-is-hard-work.jpg

The critique

  1. There is cause for concern for hash collisions and the effects of these collisions are unclear.
  2. Code readability is a little bit sub par.
  3. Corpus-wide compression vs row or column based. Interesting discussion.
  4. Serialization allocates way to much memory than is strictly neccessary, causing writing to be a slower and more energy hungry operation than it needs to be.

There is a field name restriction mentioned that seems off. I do believe the name pattern for a tree is “{indexVersion}-{fieldNameHash}.tri”. For a glimpse into the commit Oren is reviewing click here.

See you in the comments section.