diff options
Diffstat (limited to 'doc/drafts')
-rw-r--r-- | doc/drafts/k2v-spec.md | 717 |
1 files changed, 717 insertions, 0 deletions
diff --git a/doc/drafts/k2v-spec.md b/doc/drafts/k2v-spec.md new file mode 100644 index 00000000..175bb02e --- /dev/null +++ b/doc/drafts/k2v-spec.md @@ -0,0 +1,717 @@ +# Specification of the Garage K2V API (K2V = Key/Key/Value) + +- We are storing triplets of the form `(partition key, sort key, value)` -> no + user-defined fields, the client is responsible of writing whatever he wants + in the value (typically an encrypted blob). Values are binary blobs, which + are always represented as their base64 encoding in the JSON API. Partition + keys and sort keys are utf8 strings. + +- Triplets are stored in buckets; each bucket stores a separate set of triplets + +- Bucket names and access keys are the same as for accessing the S3 API + +- K2V triplets exist separately from S3 objects. K2V triplets don't exist for + the S3 API, and S3 objects don't exist for the K2V API. + +- Values stored for triplets have associated causality information, that enables + Garage to detect concurrent writes. In case of concurrent writes, Garage + keeps the concurrent values until a further write supersedes the concurrent + values. This is the same method as Riak KV implements. The method used is + based on DVVS (dotted version vector sets), described in the paper "Scalable + and Accurate Causality Tracking for Eventually Consistent Data Stores", as + well as [here](https://github.com/ricardobcl/Dotted-Version-Vectors) + + +## Data format + +### Triple format + +Triples in K2V are constituted of three fields: + +- a partition key (`pk`), an utf8 string that defines in what partition the + triplet is stored; triplets in different partitions cannot be listed together + in a ReadBatch command, or deleted together in a DeleteBatch command: a + separate command must be included in the ReadBatch/DeleteBatch call for each + partition key in which the client wants to read/delete lists of items + +- a sort key (`sk`), an utf8 string that defines the index of the triplet inside its + partition; triplets are uniquely idendified by their partition key + sort key + +- a value (`v`), an opaque binary blob associated to the partition key + sort key; + they are transmitted as binary when possible but in most case in the JSON API + they will be represented as strings using base64 encoding; a value can also + be `null` to indicate a deleted triplet (a `null` value is called a tombstone) + +### Causality information + +K2V supports storing several concurrent values associated to a pk+sk, in the +case where insertion or deletion operations are detected to be concurrent (i.e. +there is not one that was aware of the other, they are not causally dependant +one on the other). In practice, it even looks more like the opposite: to +overwrite a previously existing value, the client must give a "causality token" +that "proves" (not in a cryptographic sense) that it had seen a previous value. +Otherwise, the value written will not overwrite an existing value, it will just +create a new concurrent value. + +The causality token is a binary/b64-encoded representation of a context, +specified below. + +A set of concurrent values looks like this: + +``` +(node1, tdiscard1, (v1, t1), (v2, t2)) ; tdiscard1 < t1 < t2 +(node2, tdiscard2, (v3, t3) ; tdiscard2 < t3 +``` + +`tdiscard` for a node `i` means that all values inserted by node `i` with times +`<= tdiscard` are obsoleted, i.e. have been read by a client that overwrote it +afterwards. + +The associated context would be the following: `[(node1, t2), (node2, t3)]`, +i.e. if a node reads this set of values and inserts a new values, we will now +have `tdiscard1 = t2` and `tdiscard2 = t3`, to indicate that values v1, v2 and v3 +are obsoleted by the new write. + +**Basic insertion.** To insert a new value `v4` with context `[(node1, t2), (node2, t3)]`, in a +simple case where there was no insertion in-between reading the value +mentionned above and writing `v4`, and supposing that node2 receives the +InsertItem query: + +- `node2` generates a timestamp `t4` such that `t4 > t3`. +- the new state is as follows: + +``` +(node1, tdiscard1', ()) ; tdiscard1' = t2 +(node2, tdiscard2', (v4, t4)) ; tdiscard2' = t3 +``` + +**A more complex insertion example.** In the general case, other intermediate values could have +been written before `v4` with context `[(node1, t2), (node2, t3)]` is sent to the system. +For instance, here is a possible sequence of events: + +1. First we have the set of values v1, v2 and v3 described above. + A node reads it, it obtains values v1, v2 and v3 with context `[(node1, t2), (node2, t3)]`. + +2. A node writes a value `v5` with context `[(node1, t1)]`, i.e. `v5` is only a + successor of v1 but not of v2 or v3. Suppose node1 receives the write, it + will generate a new timestamp `t5` larger than all of the timestamps it + knows of, i.e. `t5 > t2`. We will now have: + +``` +(node1, tdiscard1'', (v2, t2), (v5, t5)) ; tdiscard1'' = t1 < t2 < t5 +(node2, tdiscard2, (v3, t3) ; tdiscard2 < t3 +``` + +3. Now `v4` is written with context `[(node1, t2), (node2, t3)]`, and node2 + processes the query. It will generate `t4 > t3` and the state will become: + +``` +(node1, tdiscard1', (v5, t5)) ; tdiscard1' = t2 < t5 +(node2, tdiscard2', (v4, t4)) ; tdiscard2' = t3 +``` + +**Generic algorithm for handling insertions:** A certain node n handles the +InsertItem and is responsible for the correctness of this procedure. + +1. Lock the key (or the whole table?) at this node to prevent concurrent updates of the value that would mess things up +2. Read current set of values +3. Generate a new timestamp that is larger than the largest timestamp for node n +4. Add the inserted value in the list of values of node n +5. Update the discard times to be the times set in the context, and accordingly discard overwritten values +6. Release lock +7. Propagate updated value to other nodes +8. Return to user when propagation achieved the write quorum (propagation to other nodes continues asynchronously) + +**Encoding of contexts:** + +Contexts consist in a list of (node id, timestamp) pairs. +They are encoded in binary as follows: + +``` +checksum: u64, [ node: u64, timestamp: u64 ]* +``` + +The checksum is just the XOR of all of the node IDs and timestamps. + +Once encoded in binary, contexts are written and transmitted in base64. + + +### Indexing + +K2V keeps an index, a secondary data structure that is updated asynchronously, +that keeps tracks of the number of triplets stored for each partition key. +This allows easy listing of all of the partition keys for which triplets exist +in a bucket, as the partition key becomes the sort key in the index. + +How indexing works: + +- Each node keeps a local count of how many items it stores for each partition, + in a local Sled tree that is updated atomically when an item is modified. +- These local counters are asynchronously stored in the index table which is + a regular Garage table spread in the network. Counters are stored as LWW values, + so basically the final table will have the following structure: + +``` +- pk: bucket +- sk: partition key for which we are counting +- v: lwwmap (node id -> number of items) +``` + +The final number of items present in the partition can be estimated by taking +the maximum of the values (i.e. the value for the node that announces having +the most items for that partition). In most cases the values for different node +IDs should all be the same; more precisely, three node IDs should map to the +same non-zero value, and all other node IDs that are present are tombstones +that map to zeroes. Note that we need to filter out values from nodes that are +no longer part of the cluster layout, as when nodes are removed they won't +necessarily have had the time to set their counters to zero. + +## Important details + +**THIS SECTION CONTAINS A FEW WARNINGS ON THE K2V API WHICH ARE IMPORTANT +TO UNDERSTAND IN ORDER TO USE IT CORRECTLY.** + +- **Internal server errors on updates do not mean that the update isn't stored.** + K2V will return an internal server error when it cannot reach a quorum of nodes on + which to save an updated value. However the value may still be stored on just one + node, which will then propagate it to other nodes asynchronously via anti-entropy. + +- **Batch operations are not transactions.** When calling InsertBatch or DeleteBatch, + items may appear partially inserted/deleted while the operation is being processed. + More importantly, if InsertBatch or DeleteBatch returns an internal server error, + some of the items to be inserted/deleted might end up inserted/deleted on the server, + while others may still have their old value. + +- **Concurrent values are deduplicated.** When inserting a value for a key, + Garage might internally end up + storing the value several times if there are network errors. These values will end up as + concurrent values for a key, with the same byte string (or `null` for a deletion). + Garage fixes this by deduplicating concurrent values when they are returned to the + user on read operations. Importantly, *Garage does not differentiate between duplicate + concurrent values due to the user making the same call twice, or Garage having to + do an internal retry*. This means that all duplicate concurrent values are deduplicated + when an item is read: if the user inserts twice concurrently the same value, they will + only read it once. + +## API Endpoints + +**Remark.** Example queries and responses here are given in JSON5 format +for clarity. However the actual K2V API uses basic JSON so all examples +and responses need to be translated. + +### Operations on single items + +**ReadItem: `GET /<bucket>/<partition key>?sort_key=<sort key>`** + + +Query parameters: + +| name | default value | meaning | +| - | - | - | +| `sort_key` | **mandatory** | The sort key of the item to read | + +Returns the item with specified partition key and sort key. Values can be +returned in either of two ways: + +1. a JSON array of base64-encoded values, or `null`'s for tombstones, with + header `Content-Type: application/json` + +2. in the case where there are no concurrent values, the single present value + can be returned directly as the response body (or an HTTP 204 NO CONTENT for + a tombstone), with header `Content-Type: application/octet-stream` + +The choice between return formats 1 and 2 is directed by the `Accept` HTTP header: + +- if the `Accept` header is not present, format 1 is always used + +- if `Accept` contains `application/json` but not `application/octet-stream`, + format 1 is always used + +- if `Accept` contains `application/octet-stream` but not `application/json`, + format 2 is used when there is a single value, and an HTTP error 409 (HTTP + 409 CONFLICT) is returned in the case of multiple concurrent values + (including concurrent tombstones) + +- if `Accept` contains both, format 2 is used when there is a single value, and + format 1 is used as a fallback in case of concurrent values + +- if `Accept` contains none, HTTP 406 NOT ACCEPTABLE is raised + +Example query: + +``` +GET /my_bucket/mailboxes?sort_key=INBOX HTTP/1.1 +``` + +Example response: + +```json +HTTP/1.1 200 OK +X-Garage-Causality-Token: opaquetoken123 +Content-Type: application/json + +[ + "b64cryptoblob123", + "b64cryptoblob'123" +] +``` + +Example response in case the item is a tombstone: + +``` +HTTP/1.1 200 OK +X-Garage-Causality-Token: opaquetoken999 +Content-Type: application/json + +[ + null +] +``` + +Example query 2: + +``` +GET /my_bucket/mailboxes?sort_key=INBOX HTTP/1.1 +Accept: application/octet-stream +``` + +Example response if multiple concurrent versions exist: + +``` +HTTP/1.1 409 CONFLICT +X-Garage-Causality-Token: opaquetoken123 +Content-Type: application/octet-stream +``` + +Example response in case of single value: + +``` +HTTP/1.1 200 OK +X-Garage-Causality-Token: opaquetoken123 +Content-Type: application/octet-stream + +cryptoblob123 +``` + +Example response in case of a single value that is a tombstone: + +``` +HTTP/1.1 204 NO CONTENT +X-Garage-Causality-Token: opaquetoken123 +Content-Type: application/octet-stream +``` + + +**PollItem: `GET /<bucket>/<partition key>?sort_key=<sort key>&causality_token=<causality token>`** + +This endpoint will block until a new value is written to a key. + +The GET parameter `causality_token` should be set to the causality +token returned with the last read of the key, so that K2V knows +what values are concurrent or newer than the ones that the +client previously knew. + +This endpoint returns the new value in the same format as ReadItem. +If no new value is written and the timeout elapses, +an HTTP 304 NOT MODIFIED is returned. + +Query parameters: + +| name | default value | meaning | +| - | - | - | +| `sort_key` | **mandatory** | The sort key of the item to read | +| `causality_token` | **mandatory** | The causality token of the last known value or set of values | +| `timeout` | 300 | The timeout before 304 NOT MODIFIED is returned if the value isn't updated | + +The timeout can be set to any number of seconds, with a maximum of 600 seconds (10 minutes). + + +**InsertItem: `PUT /<bucket>/<partition key>?sort_key=<sort_key>`** + +Inserts a single item. This request does not use JSON, the body is sent directly as a binary blob. + +To supersede previous values, the HTTP header `X-Garage-Causality-Token` should +be set to the causality token returned by a previous read on this key. This +header can be ommitted for the first writes to the key. + +Example query: + +``` +PUT /my_bucket/mailboxes?sort_key=INBOX HTTP/1.1 +X-Garage-Causality-Token: opaquetoken123 + +myblobblahblahblah +``` + +Example response: + +``` +HTTP/1.1 200 OK +``` + +**DeleteItem: `DELETE /<bucket>/<partition key>?sort_key=<sort_key>`** + +Deletes a single item. The HTTP header `X-Garage-Causality-Token` must be set +to the causality token returned by a previous read on this key, to indicate +which versions of the value should be deleted. The request will not process if +`X-Garage-Causality-Token` is not set. + +Example query: + +``` +DELETE /my_bucket/mailboxes?sort_key=INBOX HTTP/1.1 +X-Garage-Causality-Token: opaquetoken123 +``` + +Example response: + +``` +HTTP/1.1 204 NO CONTENT +``` + +### Operations on index + +**ReadIndex: `GET /<bucket>?start=<start>&end=<end>&limit=<limit>`** + +Lists all partition keys in the bucket for which some triplets exist, and gives +for each the number of triplets, total number of values (which might be bigger +than the number of triplets in case of conflicts), total number of bytes of +these values, and number of triplets that are in a state of conflict. +The values returned are an approximation of the true counts in the bucket, +as these values are asynchronously updated, and thus eventually consistent. + +Query parameters: + +| name | default value | meaning | +| - | - | - | +| `prefix` | `null` | Restrict listing to partition keys that start with this prefix | +| `start` | `null` | First partition key to list, in lexicographical order | +| `end` | `null` | Last partition key to list (excluded) | +| `limit` | `null` | Maximum number of partition keys to list | +| `reverse` | `false` | Iterate in reverse lexicographical order | + +The response consists in a JSON object that repeats the parameters of the query and gives the result (see below). + +The listing starts at partition key `start`, or if not specified at the +smallest partition key that exists. It returns partition keys in increasing +order, or decreasing order if `reverse` is set to `true`, +and stops when either of the following conditions is met: + +1. if `end` is specfied, the partition key `end` is reached or surpassed (if it + is reached exactly, it is not included in the result) + +2. if `limit` is specified, `limit` partition keys have been listed + +3. no more partition keys are available to list + +In case 2, and if there are more partition keys to list before condition 1 +triggers, then in the result `more` is set to `true` and `nextStart` is set to +the first partition key that couldn't be listed due to the limit. In the first +case (if the listing stopped because of the `end` parameter), `more` is not set +and the `nextStart` key is not specified. + +Note that if `reverse` is set to `true`, `start` is the highest key +(in lexicographical order) for which values are returned. +This means that if an `end` is specified, it must be smaller than `start`, +otherwise no values will be returned. + +Example query: + +``` +GET /my_bucket HTTP/1.1 +``` + +Example response: + +```json +HTTP/1.1 200 OK + +{ + prefix: null, + start: null, + end: null, + limit: null, + reverse: false, + partitionKeys: [ + { + pk: "keys", + entries: 3043, + conflicts: 0, + values: 3043, + bytes: 121720, + }, + { + pk: "mailbox:INBOX", + entries: 42, + conflicts: 1, + values: 43, + bytes: 142029, + }, + { + pk: "mailbox:Junk", + entries: 2991 + conflicts: 0, + values: 2991, + bytes: 12019322, + }, + { + pk: "mailbox:Trash", + entries: 10, + conflicts: 0, + values: 10, + bytes: 32401, + }, + { + pk: "mailboxes", + entries: 3, + conflicts: 0, + values: 3, + bytes: 3019, + }, + ], + more: false, + nextStart: null, +} +``` + + +### Operations on batches of items + +**InsertBatch: `POST /<bucket>`** + +Simple insertion and deletion of triplets. The body is just a list of items to +insert in the following format: +`{ pk: "<partition key>", sk: "<sort key>", ct: "<causality token>"|null, v: "<value>"|null }`. + +The causality token should be the one returned in a previous read request (e.g. +by ReadItem or ReadBatch), to indicate that this write takes into account the +values that were returned from these reads, and supersedes them causally. If +the triplet is inserted for the first time, the causality token should be set to +`null`. + +The value is expected to be a base64-encoded binary blob. The value `null` can +also be used to delete the triplet while preserving causality information: this +allows to know if a delete has happenned concurrently with an insert, in which +case both are preserved and returned on reads (see below). + +Partition keys and sort keys are utf8 strings which are stored sorted by +lexicographical ordering of their binary representation. + +Example query: + +```json +POST /my_bucket HTTP/1.1 + +[ + { pk: "mailbox:INBOX", sk: "001892831", ct: "opaquetoken321", v: "b64cryptoblob321updated" }, + { pk: "mailbox:INBOX", sk: "001892912", ct: null, v: "b64cryptoblob444" }, + { pk: "mailbox:INBOX", sk: "001892932", ct: "opaquetoken654", v: null }, +] +``` + +Example response: + +``` +HTTP/1.1 200 OK +``` + + +**ReadBatch: `POST /<bucket>?search`**, or alternatively<br/> +**ReadBatch: `SEARCH /<bucket>`** + +Batch read of triplets in a bucket. + +The request body is a JSON list of searches, that each specify a range of +items to get (to get single items, set `singleItem` to `true`). A search is a +JSON struct with the following fields: + +| name | default value | meaning | +| - | - | - | +| `partitionKey` | **mandatory** | The partition key in which to search | +| `prefix` | `null` | Restrict items to list to those whose sort keys start with this prefix | +| `start` | `null` | The sort key of the first item to read | +| `end` | `null` | The sort key of the last item to read (excluded) | +| `limit` | `null` | The maximum number of items to return | +| `reverse` | `false` | Iterate in reverse lexicographical order on sort keys | +| `singleItem` | `false` | Whether to return only the item with sort key `start` | +| `conflictsOnly` | `false` | Whether to return only items that have several concurrent values | +| `tombstones` | `false` | Whether or not to return tombstone lines to indicate the presence of old deleted items | + + +For each of the searches, triplets are listed and returned separately. The +semantics of `prefix`, `start`, `end`, `limit` and `reverse` are the same as for ReadIndex. The +additionnal parameter `singleItem` allows to get a single item, whose sort key +is the one given in `start`. Parameters `conflictsOnly` and `tombstones` +control additional filters on the items that are returned. + +The result is a list of length the number of searches, that consists in for +each search a JSON object specified similarly to the result of ReadIndex, but +that lists triplets within a partition key. + +The format of returned tuples is as follows: `{ sk: "<sort key>", ct: "<causality +token>", v: ["<value1>", ...] }`, with the following fields: + +- `sk` (sort key): any unicode string used as a sort key + +- `ct` (causality token): an opaque token served by the server (generally + base64-encoded) to be used in subsequent writes to this key + +- `v` (list of values): each value is a binary blob, always base64-encoded; + contains multiple items when concurrent values exists + +- in case of concurrent update and deletion, a `null` is added to the list of concurrent values + +- if the `tombstones` query parameter is set to `true`, tombstones are returned + for items that have been deleted (this can be usefull for inserting after an + item that has been deleted, so that the insert is not considered + concurrent with the delete). Tombstones are returned as tuples in the + same format with only `null` values + +Example query: + +```json +POST /my_bucket?search HTTP/1.1 + +[ + { + partitionKey: "mailboxes", + }, + { + partitionKey: "mailbox:INBOX", + start: "001892831", + limit: 3, + }, + { + partitionKey: "keys", + start: "0", + singleItem: true, + }, +] +``` + +Example associated response body: + +```json +HTTP/1.1 200 OK + +[ + { + partitionKey: "mailboxes", + prefix: null, + start: null, + end: null, + limit: null, + reverse: false, + conflictsOnly: false, + tombstones: false, + singleItem: false, + items: [ + { sk: "INBOX", ct: "opaquetoken123", v: ["b64cryptoblob123", "b64cryptoblob'123"] }, + { sk: "Trash", ct: "opaquetoken456", v: ["b64cryptoblob456"] }, + { sk: "Junk", ct: "opaquetoken789", v: ["b64cryptoblob789"] }, + ], + more: false, + nextStart: null, + }, + { + partitionKey: "mailbox::INBOX", + prefix: null, + start: "001892831", + end: null, + limit: 3, + reverse: false, + conflictsOnly: false, + tombstones: false, + singleItem: false, + items: [ + { sk: "001892831", ct: "opaquetoken321", v: ["b64cryptoblob321"] }, + { sk: "001892832", ct: "opaquetoken654", v: ["b64cryptoblob654"] }, + { sk: "001892874", ct: "opaquetoken987", v: ["b64cryptoblob987"] }, + ], + more: true, + nextStart: "001892898", + }, + { + partitionKey: "keys", + prefix: null, + start: "0", + end: null, + conflictsOnly: false, + tombstones: false, + limit: null, + reverse: false, + singleItem: true, + items: [ + { sk: "0", ct: "opaquetoken999", v: ["b64binarystuff999"] }, + ], + more: false, + nextStart: null, + }, +] +``` + + + +**DeleteBatch: `POST /<bucket>?delete`** + +Batch deletion of triplets. The request format is the same for `POST +/<bucket>?search` to indicate items or range of items, except that here they +are deleted instead of returned, but only the fields `partitionKey`, `prefix`, `start`, +`end`, and `singleItem` are supported. Causality information is not given by +the user: this request will internally list all triplets and write deletion +markers that supersede all of the versions that have been read. + +This request returns for each series of items to be deleted, the number of +matching items that have been found and deleted. + +Example query: + +```json +POST /my_bucket?delete HTTP/1.1 + +[ + { + partitionKey: "mailbox:OldMailbox", + }, + { + partitionKey: "mailbox:INBOX", + start: "0018928321", + singleItem: true, + }, +] +``` + +Example response: + +``` +HTTP/1.1 200 OK + +[ + { + partitionKey: "mailbox:OldMailbox", + prefix: null, + start: null, + end: null, + singleItem: false, + deletedItems: 35, + }, + { + partitionKey: "mailbox:INBOX", + prefix: null, + start: "0018928321", + end: null, + singleItem: true, + deletedItems: 1, + }, +] +``` + + +## Internals: causality tokens + +The method used is based on DVVS (dotted version vector sets). See: + +- the paper "Scalable and Accurate Causality Tracking for Eventually Consistent Data Stores" +- <https://github.com/ricardobcl/Dotted-Version-Vectors> + +For DVVS to work, write operations (at each node) must take a lock on the data table. |