Well, Vinoth at Linked IN is waiting for, eh, 6 months for this post? Sorry for this. So, this one’s for you!
We’ll be busy with working on getting connectors to run Y3 on other scalable key values, and on AWS. So finally here it is: what is below the dynamo ring, and why?
First, let’s see the TOP 3 reasons to have our own low level key value store!
Number 3: Quick deployment & licensing
Y3 is full JAVA, so a built-in, full JAVA storage chain makes sense, allowing a fully integrated processing + storage node. Less competences to maintain: simpler is better. Customers might also want to integrate the platform with their own products: we had to eliminate any third party licensing needs for the NoSQL store.
Number 2: Ready for the future Y3 extensions
Future Y3 extensions, such as synchronous publishing for Internet Of Things require more from the KV than just storage. Also, some applications like legal archiving require a full control on how information is written and deleted.
Number 1: Cost of operations ( You knew it, didn’t you ?)
- When service production cost is the key, every cent counts. Covering the full chain with a single support contract will always be cheaper.
- The built-in storage has been optimized for delivering the job at the lowest possible cost. We wanted to be able to run on cheap CPUs with low wattage, cheap motherboards with limited memory and no RAID , cheap “green” disks not necessarily optimized for multi-user access. And all that with decent performance and some unusual fast rebuild options in case of disk crash.
So, how does it work? Behind the ring we have added the DiskMap low level key value storage. BTree based Key Value stores have decent performance with the full index in memory: you have to choose between performance and memory footprint.
Y3 is building elastic B+Trees on top of the KV ring, so range queries are not required from the NoSQL store, there are managed at the Y3 API level. This unique feature allows us to also encrypt collaboratively indexes with the Y3 built-in PKI, for easy compliance (data segregation by encryption). So all we needed was the most simple key value storage structure: the great hash table.
DiskMap is a “chained” hash table, combining a bucket table (the hash table itself), and log forward needle files containing chained data blocks of information related to the same “hash”.
From the data key, a hash is computed and links to a slot in a bucket table (in RED on the picture). If no Data is stored yet on this bucket, a new Needle is created at the end of the last Needle log file, or a new file is created if the last one is full. Then, the needle is chained by the bucket. By playing with the needle addressing you can reach a footprint of 6 to 10 bytes per key, depending on how many keys you need on the server, the minimum data block size, the log file size ( files are sequentially numbered ).
If there are already Needles behind the bucket, the new needle is built and chained to the previous one. Updates simply chain a new one or create a deletion mark needle that will be found before the obsolete one.
The needle log files are complemented by a logged bucket table, representing the in-memory bucket table at one moment in time.
There are two kind of commits: log commit and bucket commit. Like any log forward database, DiskMap does log commit on a configurable, regular basis.
Depending on activity, bucket commit occurs, transferring the bucket table to disk. This is done by marking a commit point and transferring all bucket writes to a temporary table during commit write, which can be slow depending on device type ( we use a small SSD, but it works on a HDD, anyway a separate axis is mandatory ).
Upon restart, log is parsed from commit point to rebuild the bucket table. Should the bucket table be missing, this is a cold start, reading all needles. This forward logging structure allows quick rebuilds with a simple rsync. With a dedicated dynamo ring routing, you can even pair triples of disks and rebuild quickly by attaching the disk on a empty slot of a valid server and rsync the logs, eliminating the network bottleneck.
Caches, routing, cleaning, and more
Of course, this is a simplified vision. There are needle caches and needle pointer caches to speed up access, smart routings to make sure related blocks of data fall one after the other on the same bucket chain (e.g. blocks from the same file), eventual consistency resolving at write time ( to save disk space ), and a cost/benefit cleaner for getting back unused space. Those little tricks make the store versatile and suited for small blocks aimed at real time apps, and for crazy large objects as well. But that will be for other posts…
At least you know how we can reach extremely low costs per GB, and provide competitive prices for large file sending and online file sharing. Did I say that already ? Well, yes, but we also have to boost SEO on those subjects.