File format of a search engine

Resin can analyze, index and store documents and has querying support for term, fuzzy and prefix. Here are the six different files, with their layouts explained, needed to support its indexing, querying and document storage capabilities.

Document addresses

[index_version].da

Documents are stored in a doc file. To find one in a file one must first look up its starting location in the file and find out its length. The da file contains the addresses of documents and their sizes, each record ordered by document ID, a integer that is incremented from zero and assigned to each document at every upsert session.

One da file per upsert session.

Size of a document address: sizeof(long) + sizeof(int)

Documents

[index_version].doc

Documents of variable length are stored along side and without delimiter. QuickLZ compression in fast mode is used by default, gzip by supplying the compression flag.

One doc file per upsert session.

Size of a document: sizeof(int) + variable sized string * numOfFields

Index info

[index_version].ix

Contains the document count and a compression flag.

One per upsert session.

Size: sizeof(int) + sizeof(byte)

Primary keys

[index_name].pk

Contains the hashed value of the primary key of each document, together with a flag stating if the document is deleted or live. When writing an index supplying a primary key is required only if you need to delete or update by that key. Documents without primary key can only be deleted by term and can never be updated.

One file per upsert session.

Size of a document record: sizeof(UInt64) + sizeof(byte)

Postings

[index_version].pos

A posting is a term count and a reference to a document. Postings are grouped by term.

One postings file per upsert session.

Size of a posting: 2 * sizeof(int)

Search trees

[index_version]-[field_hash].tri

At the heart (brain?) of each search engine, the search tree, here in the form of a doubly-chained prefix tree, the perfect merger between a trie and a binary search tree. Enough about trees.

Each trie node, its char value and its meta-data is stored in the file.

Meta-data

bool EndOfWordbool HasChildbool HasSiblingshort Depth *int Weight (numOfChildren + numOfSiblings + 1)PostingsAddress Address (only if EndOfWord, same size as a document address)

It’s a lot of information but needed to be able to scan the file from disk as if it was a left-child right-sibling search tree that exists in-memory.

The weight of a node is needed in order to determine how many nodes to skip once a scan has reached a dead end but need to move on to the next leg.

The posting address found at each EndOfWord has a size and leads to a location in the postings file that marks the beginning of the postings list associated with that particular term.

One trie file per document field per upsert session.

Size of a node: sizeof(char) + 3 * sizeof(byte) + sizeof(int) + sizeof(short) + SizeOfPostingsAddress()Size of a end of word node: SizeOfNode() + SizeOfPostingsAddress()

File updates

No file updates are necessary because of a versioning scheme that at querying time detects which files are live and which are dead.

Conclusion

If these are the files Resin, Lucene and, in one form or another, Google needs, no more and no less, then it is far from rocket science, what they do there, over at Google’s.

Cheers!

File format of a search engine

Too Long; Didn't Read

Companies Mentioned

Document addresses

Documents

Index info

Primary keys

Postings

Search trees

File updates

Conclusion

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

File format of a search engine

Too Long; Didn't Read

Companies Mentioned

Document addresses

Documents

Index info

Primary keys

Postings

Search trees

File updates

Conclusion

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics