aboutsummaryrefslogtreecommitdiff
path: root/doc/drafts/k2v-spec.md
blob: 3956fa3172fbb8143103db0013915cac4109b72d (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
# Specification of the Garage K2V API (K2V = Key/Key/Value)

- We are storing triplets of the form `(partition key, sort key, value)` -> no
  user-defined fields, the client is responsible of writing whatever he wants
  in the value (typically an encrypted blob). Values are binary blobs, which
  are always represented as their base64 encoding in the JSON API. Partition
  keys and sort keys are utf8 strings.

- Triplets are stored in buckets; each bucket stores a separate set of triplets

- Bucket names and access keys are the same as for accessing the S3 API

- K2V triplets exist separately from S3 objects. K2V triplets don't exist for
  the S3 API, and S3 objects don't exist for the K2V API.

- Values stored for triplets have associated causality information, that enables
  Garage to detect concurrent writes. In case of concurrent writes, Garage
  keeps the concurrent values until a further write supersedes the concurrent
  values. This is the same method as Riak KV implements. The method used is
  based on DVVS (dotted version vector sets), described in the paper "Scalable
  and Accurate Causality Tracking for Eventually Consistent Data Stores", as
  well as [here](https://github.com/ricardobcl/Dotted-Version-Vectors)


## Data format

### Triple format

Triples in K2V are constituted of three fields:

- a partition key (`pk`), an utf8 string that defines in what partition the
  triplet is stored; triplets in different partitions cannot be listed together
  in a ReadBatch command, or deleted together in a DeleteBatch command: a
  separate command must be included in the ReadBatch/DeleteBatch call for each
  partition key in which the client wants to read/delete lists of items

- a sort key (`sk`), an utf8 string that defines the index of the triplet inside its
  partition; triplets are uniquely idendified by their partition key + sort key

- a value (`v`), an opaque binary blob associated to the partition key + sort key;
  they are transmitted as binary when possible but in most case in the JSON API
  they will be represented as strings using base64 encoding; a value can also
  be `null` to indicate a deleted triplet (a `null` value is called a tombstone)

### Causality information

K2V supports storing several concurrent values associated to a pk+sk, in the
case where insertion or deletion operations are detected to be concurrent (i.e.
there is not one that was aware of the other, they are not causally dependant
one on the other).  In practice, it even looks more like the opposite: to
overwrite a previously existing value, the client must give a "causality token"
that "proves" (not in a cryptographic sense) that it had seen a previous value.
Otherwise, the value written will not overwrite an existing value, it will just
create a new concurrent value.

The causality token is a binary/b64-encoded representation of a context,
specified below.

A set of concurrent values looks like this:

```
(node1, tdiscard1, (v1, t1), (v2, t2)) ; tdiscard1 < t1 < t2
(node2, tdiscard2, (v3, t3)            ; tdiscard2 < t3
```

`tdiscard` for a node `i` means that all values inserted by node `i` with times
`<= tdiscard` are obsoleted, i.e. have been read by a client that overwrote it
afterwards.

The associated context would be the following: `[(node1, t2), (node2, t3)]`,
i.e. if a node reads this set of values and inserts a new values, we will now
have `tdiscard1 = t2` and `tdiscard2 = t3`, to indicate that values v1, v2 and v3
are obsoleted by the new write.

**Basic insertion.** To insert a new value `v4` with context `[(node1, t2), (node2, t3)]`, in a
simple case where there was no insertion in-between reading the value
mentionned above and writing `v4`, and supposing that node2 receives the
InsertItem query:

- `node2` generates a timestamp `t4` such that `t4 > t3`.
- the new state is as follows:

```
(node1, tdiscard1', ())       ; tdiscard1' = t2
(node2, tdiscard2', (v4, t4)) ; tdiscard2' = t3
```

**A more complex insertion example.** In the general case, other intermediate values could have
been written before `v4` with context `[(node1, t2), (node2, t3)]` is sent to the system.
For instance, here is a possible sequence of events:

1. First we have the set of values v1, v2 and v3 described above.
   A node reads it, it obtains values v1, v2 and v3 with context `[(node1, t2), (node2, t3)]`.

2. A node writes a value `v5` with context `[(node1, t1)]`, i.e. `v5` is only a
   successor of v1 but not of v2 or v3. Suppose node1 receives the write, it
   will generate a new timestamp `t5` larger than all of the timestamps it
   knows of, i.e. `t5 > t2`. We will now have:

```
(node1, tdiscard1'', (v2, t2), (v5, t5)) ; tdiscard1'' = t1 < t2 < t5
(node2, tdiscard2, (v3, t3)              ; tdiscard2 < t3
```

3. Now `v4` is written with context `[(node1, t2), (node2, t3)]`, and node2
   processes the query. It will generate `t4 > t3` and the state will become:

```
(node1, tdiscard1', (v5, t5))       ; tdiscard1' = t2 < t5
(node2, tdiscard2', (v4, t4))       ; tdiscard2' = t3
```

**Generic algorithm for handling insertions:** A certain node n handles the
InsertItem and is responsible for the correctness of this procedure.

1. Lock the key (or the whole table?) at this node to prevent concurrent updates of the value that would mess things up
2. Read current set of values
3. Generate a new timestamp that is larger than the largest timestamp for node n
4. Add the inserted value in the list of values of node n
5. Update the discard times to be the times set in the context, and accordingly discard overwritten values
6. Release lock
7. Propagate updated value to other nodes
8. Return to user when propagation achieved the write quorum (propagation to other nodes continues asynchronously)

**Encoding of contexts:**

Contexts consist in a list of (node id, timestamp) pairs.
They are encoded in binary as follows:

```
checksum: u64, [ node: u64, timestamp: u64 ]*
```

The checksum is just the XOR of all of the node IDs and timestamps.

Once encoded in binary, contexts are written and transmitted in base64.


### Indexing

K2V keeps an index, a secondary data structure that is updated asynchronously,
that keeps tracks of the number of triplets stored for each partition key.
This allows easy listing of all of the partition keys for which triplets exist
in a bucket, as the partition key becomes the sort key in the index.

How indexing works:

- Each node keeps a local count of how many items it stores for each partition,
  in a local database tree that is updated atomically when an item is modified.
- These local counters are asynchronously stored in the index table which is
  a regular Garage table spread in the network. Counters are stored as LWW values,
  so basically the final table will have the following structure:

```
- pk: bucket
- sk: partition key for which we are counting
- v:  lwwmap (node id -> number of items)
```

The final number of items present in the partition can be estimated by taking
the maximum of the values (i.e. the value for the node that announces having
the most items for that partition). In most cases the values for different node
IDs should all be the same; more precisely, three node IDs should map to the
same non-zero value, and all other node IDs that are present are tombstones
that map to zeroes.  Note that we need to filter out values from nodes that are
no longer part of the cluster layout, as when nodes are removed they won't
necessarily have had the time to set their counters to zero.

## Important details

**THIS SECTION CONTAINS A FEW WARNINGS ON THE K2V API WHICH ARE IMPORTANT
TO UNDERSTAND IN ORDER TO USE IT CORRECTLY.**

- **Internal server errors on updates do not mean that the update isn't stored.**
  K2V will return an internal server error when it cannot reach a quorum of nodes on
  which to save an updated value. However the value may still be stored on just one
  node, which will then propagate it to other nodes asynchronously via anti-entropy.

- **Batch operations are not transactions.** When calling InsertBatch or DeleteBatch,
  items may appear partially inserted/deleted while the operation is being processed.
  More importantly, if InsertBatch or DeleteBatch returns an internal server error,
  some of the items to be inserted/deleted might end up inserted/deleted on the server,
  while others may still have their old value.

- **Concurrent values are deduplicated.** When inserting a value for a key,
  Garage might internally end up
  storing the value several times if there are network errors. These values will end up as
  concurrent values for a key, with the same byte string (or `null` for a deletion).
  Garage fixes this by deduplicating concurrent values when they are returned to the
  user on read operations. Importantly, *Garage does not differentiate between duplicate
  concurrent values due to the user making the same call twice, or Garage having to
  do an internal retry*. This means that all duplicate concurrent values are deduplicated
  when an item is read: if the user inserts twice concurrently the same value, they will
  only read it once.

## API Endpoints

**Remark.** Example queries and responses here are given in JSON5 format
for clarity. However the actual K2V API uses basic JSON so all examples
and responses need to be translated.

### Operations on single items

**ReadItem: `GET /<bucket>/<partition key>?sort_key=<sort key>`**


Query parameters:

| name       | default value | meaning                          |
|------------|---------------|----------------------------------|
| `sort_key` | **mandatory** | The sort key of the item to read |

Returns the item with specified partition key and sort key. Values can be
returned in either of two ways:

1. a JSON array of base64-encoded values, or `null`'s for tombstones, with
   header `Content-Type: application/json`

2. in the case where there are no concurrent values, the single present value
   can be returned directly as the response body (or an HTTP 204 NO CONTENT for
   a tombstone), with header `Content-Type: application/octet-stream`

The choice between return formats 1 and 2 is directed by the `Accept` HTTP header:

- if the `Accept` header is not present, format 1 is always used

- if `Accept` contains `application/json` but not `application/octet-stream`,
  format 1 is always used

- if `Accept` contains `application/octet-stream` but not `application/json`,
  format 2 is used when there is a single value, and an HTTP error 409 (HTTP
  409 CONFLICT) is returned in the case of multiple concurrent values
  (including concurrent tombstones)

- if `Accept` contains both, format 2 is used when there is a single value, and
  format 1 is used as a fallback in case of concurrent values

- if `Accept` contains none, HTTP 406 NOT ACCEPTABLE is raised

Example query:

```
GET /my_bucket/mailboxes?sort_key=INBOX HTTP/1.1
```

Example response:

```json
HTTP/1.1 200 OK
X-Garage-Causality-Token: opaquetoken123
Content-Type: application/json

[
  "b64cryptoblob123",
  "b64cryptoblob'123"
]
```

Example response in case the item is a tombstone:

```
HTTP/1.1 200 OK
X-Garage-Causality-Token: opaquetoken999
Content-Type: application/json

[
  null
]
```

Example query 2:

```
GET /my_bucket/mailboxes?sort_key=INBOX HTTP/1.1
Accept: application/octet-stream
```

Example response if multiple concurrent versions exist:

```
HTTP/1.1 409 CONFLICT
X-Garage-Causality-Token: opaquetoken123
Content-Type: application/octet-stream
```

Example response in case of single value:

```
HTTP/1.1 200 OK
X-Garage-Causality-Token: opaquetoken123
Content-Type: application/octet-stream

cryptoblob123
```

Example response in case of a single value that is a tombstone:

```
HTTP/1.1 204 NO CONTENT
X-Garage-Causality-Token: opaquetoken123
Content-Type: application/octet-stream
```


**PollItem: `GET /<bucket>/<partition key>?sort_key=<sort key>&causality_token=<causality token>`**

This endpoint will block until a new value is written to a key.

The GET parameter `causality_token` should be set to the causality
token returned with the last read of the key, so that K2V knows
what values are concurrent or newer than the ones that the
client previously knew.

This endpoint returns the new value in the same format as ReadItem.
If no new value is written and the timeout elapses,
an HTTP 304 NOT MODIFIED is returned.

Query parameters:

| name              | default value | meaning                                                                    |
|-------------------|---------------|----------------------------------------------------------------------------|
| `sort_key`        | **mandatory** | The sort key of the item to read                                           |
| `causality_token` | **mandatory** | The causality token of the last known value or set of values               |
| `timeout`         | 300           | The timeout before 304 NOT MODIFIED is returned if the value isn't updated |

The timeout can be set to any number of seconds, with a maximum of 600 seconds (10 minutes).


**InsertItem: `PUT /<bucket>/<partition key>?sort_key=<sort_key>`**

Inserts a single item. This request does not use JSON, the body is sent directly as a binary blob.

To supersede previous values, the HTTP header `X-Garage-Causality-Token` should
be set to the causality token returned by a previous read on this key. This
header can be ommitted for the first writes to the key.

Example query:

```
PUT /my_bucket/mailboxes?sort_key=INBOX HTTP/1.1
X-Garage-Causality-Token: opaquetoken123

myblobblahblahblah
```

Example response:

```
HTTP/1.1 204 No Content
```

**DeleteItem: `DELETE /<bucket>/<partition key>?sort_key=<sort_key>`**

Deletes a single item. The HTTP header `X-Garage-Causality-Token` must be set
to the causality token returned by a previous read on this key, to indicate
which versions of the value should be deleted. The request will not process if
`X-Garage-Causality-Token` is not set.

Example query:

```
DELETE /my_bucket/mailboxes?sort_key=INBOX HTTP/1.1
X-Garage-Causality-Token: opaquetoken123
```

Example response:

```
HTTP/1.1 204 NO CONTENT
```

### Operations on index

**ReadIndex: `GET /<bucket>?start=<start>&end=<end>&limit=<limit>`**

Lists all partition keys in the bucket for which some triplets exist, and gives
for each the number of triplets, total number of values (which might be bigger
than the number of triplets in case of conflicts), total number of bytes of
these values, and number of triplets that are in a state of conflict.
The values returned are an approximation of the true counts in the bucket,
as these values are asynchronously updated, and thus eventually consistent.

Query parameters:

| name      | default value | meaning                                                        |
|-----------|---------------|----------------------------------------------------------------|
| `prefix`  | `null`        | Restrict listing to partition keys that start with this prefix |
| `start`   | `null`        | First partition key to list, in lexicographical order          |
| `end`     | `null`        | Last partition key to list (excluded)                          |
| `limit`   | `null`        | Maximum number of partition keys to list                       |
| `reverse` | `false`       | Iterate in reverse lexicographical order                       |

The response consists in a JSON object that repeats the parameters of the query and gives the result (see below).

The listing starts at partition key `start`, or if not specified at the
smallest partition key that exists.  It returns partition keys in increasing
order, or decreasing order if `reverse` is set to `true`,
and stops when either of the following conditions is met:

1. if `end` is specfied, the partition key `end` is reached or surpassed (if it
   is reached exactly, it is not included in the result)

2. if `limit` is specified, `limit` partition keys have been listed

3. no more partition keys are available to list

In case 2, and if there are more partition keys to list before condition 1
triggers, then in the result `more` is set to `true` and `nextStart` is set to
the first partition key that couldn't be listed due to the limit. In the first
case (if the listing stopped because of the `end` parameter), `more` is not set
and the `nextStart` key is not specified.

Note that if `reverse` is set to `true`, `start` is the highest key
(in lexicographical order) for which values are returned.
This means that if an `end` is specified, it must be smaller than `start`,
otherwise no values will be returned.

Example query:

```
GET /my_bucket HTTP/1.1
```

Example response:

```json
HTTP/1.1 200 OK

{
  prefix: null,
  start: null,
  end: null,
  limit: null,
  reverse: false,
  partitionKeys: [
    {
      pk: "keys",
      entries: 3043,
      conflicts: 0,
      values: 3043,
      bytes: 121720,
    },
    {
      pk: "mailbox:INBOX",
      entries: 42,
      conflicts: 1,
      values: 43,
      bytes: 142029,
    },
    {
      pk: "mailbox:Junk",
      entries: 2991
      conflicts: 0,
      values: 2991,
      bytes: 12019322,
    },
    {
      pk: "mailbox:Trash",
      entries: 10,
      conflicts: 0,
      values: 10,
      bytes: 32401,
    },
    {
      pk: "mailboxes",
      entries: 3,
      conflicts: 0,
      values: 3,
      bytes: 3019,
    },
  ],
  more: false,
  nextStart: null,
}
```


### Operations on batches of items

**InsertBatch: `POST /<bucket>`**

Simple insertion and deletion of triplets. The body is just a list of items to
insert in the following format:
`{ pk: "<partition key>", sk: "<sort key>", ct: "<causality token>"|null, v: "<value>"|null }`.

The causality token should be the one returned in a previous read request (e.g.
by ReadItem or ReadBatch), to indicate that this write takes into account the
values that were returned from these reads, and supersedes them causally. If
the triplet is inserted for the first time, the causality token should be set to
`null`.

The value is expected to be a base64-encoded binary blob. The value `null` can
also be used to delete the triplet while preserving causality information: this
allows to know if a delete has happenned concurrently with an insert, in which
case both are preserved and returned on reads (see below).

Partition keys and sort keys are utf8 strings which are stored sorted by
lexicographical ordering of their binary representation.

Example query:

```json
POST /my_bucket HTTP/1.1

[
  { pk: "mailbox:INBOX", sk: "001892831", ct: "opaquetoken321", v: "b64cryptoblob321updated" },
  { pk: "mailbox:INBOX", sk: "001892912", ct: null, v: "b64cryptoblob444" },
  { pk: "mailbox:INBOX", sk: "001892932", ct: "opaquetoken654", v: null },
]
```

Example response:

```
HTTP/1.1 204 NO CONTENT
```


**ReadBatch: `POST /<bucket>?search`**, or alternatively<br/>
**ReadBatch: `SEARCH /<bucket>`**

Batch read of triplets in a bucket.

The request body is a JSON list of searches, that each specify a range of
items to get (to get single items, set `singleItem` to `true`). A search is a
JSON struct with the following fields:

| name            | default value | meaning                                                                                |
|-----------------|---------------|----------------------------------------------------------------------------------------|
| `partitionKey`  | **mandatory** | The partition key in which to search                                                   |
| `prefix`        | `null`        | Restrict items to list to those whose sort keys start with this prefix                 |
| `start`         | `null`        | The sort key of the first item to read                                                 |
| `end`           | `null`        | The sort key of the last item to read (excluded)                                       |
| `limit`         | `null`        | The maximum number of items to return                                                  |
| `reverse`       | `false`       | Iterate in reverse lexicographical order on sort keys                                  |
| `singleItem`    | `false`       | Whether to return only the item with sort key `start`                                  |
| `conflictsOnly` | `false`       | Whether to return only items that have several concurrent values                       |
| `tombstones`    | `false`       | Whether or not to return tombstone lines to indicate the presence of old deleted items |


For each of the searches, triplets are listed and returned separately. The
semantics of `prefix`, `start`, `end`, `limit` and `reverse` are the same as for ReadIndex. The
additionnal parameter `singleItem` allows to get a single item, whose sort key
is the one given in `start`. Parameters `conflictsOnly` and `tombstones`
control additional filters on the items that are returned.

The result is a list of length the number of searches, that consists in for
each search a JSON object specified similarly to the result of ReadIndex, but
that lists triplets within a partition key.

The format of returned tuples is as follows: `{ sk: "<sort key>", ct: "<causality
token>", v: ["<value1>", ...] }`, with the following fields:

- `sk` (sort key): any unicode string used as a sort key

- `ct` (causality token): an opaque token served by the server (generally
  base64-encoded) to be used in subsequent writes to this key

- `v` (list of values): each value is a binary blob, always base64-encoded;
  contains multiple items when concurrent values exists

- in case of concurrent update and deletion, a `null` is added to the list of concurrent values

- if the `tombstones` query parameter is set to `true`, tombstones are returned
  for items that have been deleted (this can be usefull for inserting after an
  item that has been deleted, so that the insert is not considered
  concurrent with the delete). Tombstones are returned as tuples in the
  same format with only `null` values

Example query:

```json
POST /my_bucket?search HTTP/1.1

[
  {
    partitionKey: "mailboxes",
  },
  {
    partitionKey: "mailbox:INBOX",
    start: "001892831",
    limit: 3,
  },
  {
    partitionKey: "keys",
    start: "0",
    singleItem: true,
  },
]
```

Example associated response body:

```json
HTTP/1.1 200 OK

[
  {
    partitionKey: "mailboxes",
    prefix: null,
    start: null,
    end: null,
    limit: null,
    reverse: false,
    conflictsOnly: false,
    tombstones: false,
    singleItem: false,
    items: [
      { sk: "INBOX", ct: "opaquetoken123", v: ["b64cryptoblob123", "b64cryptoblob'123"] },
      { sk: "Trash", ct: "opaquetoken456", v: ["b64cryptoblob456"] },
      { sk: "Junk", ct: "opaquetoken789", v: ["b64cryptoblob789"] },
    ],
    more: false,
    nextStart: null,
  },
  {
    partitionKey: "mailbox::INBOX",
    prefix: null,
    start: "001892831",
    end: null,
    limit: 3,
    reverse: false,
    conflictsOnly: false,
    tombstones: false,
    singleItem: false,
    items: [
      { sk: "001892831", ct: "opaquetoken321", v: ["b64cryptoblob321"] },
      { sk: "001892832", ct: "opaquetoken654", v: ["b64cryptoblob654"] },
      { sk: "001892874", ct: "opaquetoken987", v: ["b64cryptoblob987"] },
    ],
    more: true,
    nextStart: "001892898",
  },
  {
    partitionKey: "keys",
    prefix: null,
    start: "0",
    end: null,
    conflictsOnly: false,
    tombstones: false,
    limit: null,
    reverse: false,
    singleItem: true,
    items: [
      { sk: "0", ct: "opaquetoken999", v: ["b64binarystuff999"] },
    ],
    more: false,
    nextStart: null,
  },
]
```



**DeleteBatch: `POST /<bucket>?delete`**

Batch deletion of triplets. The request format is the same for `POST
/<bucket>?search` to indicate items or range of items, except that here they
are deleted instead of returned, but only the fields `partitionKey`, `prefix`, `start`,
`end`, and `singleItem` are supported. Causality information is not given by
the user: this request will internally list all triplets and write deletion
markers that supersede all of the versions that have been read.

This request returns for each series of items to be deleted, the number of
matching items that have been found and deleted.

Example query:

```json
POST /my_bucket?delete HTTP/1.1

[
  {
    partitionKey: "mailbox:OldMailbox",
  },
  {
    partitionKey: "mailbox:INBOX",
    start: "0018928321",
    singleItem: true,
  },
]
```

Example response:

```json
HTTP/1.1 200 OK

[
  {
    partitionKey: "mailbox:OldMailbox",
    prefix: null,
    start: null,
    end: null,
    singleItem: false,
    deletedItems: 35,
  },
  {
    partitionKey: "mailbox:INBOX",
    prefix: null,
    start: "0018928321",
    end: null,
    singleItem: true,
    deletedItems: 1,
  },
]
```

**PollRange: `POST /<bucket>/<partition key>?poll_range`**, or alternatively<br/>
**PollRange: `SEARCH /<bucket>/<partition key>?poll_range`**

Polls a range of items for changes.

The query body is a JSON object consisting of the following fields:

| name            | default value | meaning                                                                                |
|-----------------|---------------|----------------------------------------------------------------------------------------|
| `prefix`        | `null`        | Restrict items to poll to those whose sort keys start with this prefix                 |
| `start`         | `null`        | The sort key of the first item to poll                                                 |
| `end`           | `null`        | The sort key of the last item to poll (excluded)                                       |
| `timeout`       | 300           | The timeout before 304 NOT MODIFIED is returned if no value in the range is updated    |
| `seenMarker`    | `null`        | An opaque string returned by a previous PollRange call, that represents items already seen |

The timeout can be set to any number of seconds, with a maximum of 600 seconds (10 minutes).

The response is either:

- A HTTP 304 NOT MODIFIED response with an empty body, if the timeout expired and no changes occurred

- A HTTP 200 response, indicating that some changes have occurred since the last PollRange call, in which case a JSON object is returned in the body with the following fields:

| name            | meaning                                                                                |
|-----------------|----------------------------------------------------------------------------------------|
| `seenMarker`    | An opaque string that represents items already seen for future PollRange calls         |
| `items`         | The list of items that have changed since last PollRange call, in the same format as ReadBatch |

If no seen marker is known by the caller, it can do a PollRange call
without specifying `seenMarker`. In this case, the PollRange call will
complete immediately, and return the current content of the range (which
can be empty) and a seen marker to be used in further PollRange calls. This
is the only case in which PollRange might return an HTTP 200 with an empty
set of items.

A seen marker returned as a response to a PollRange query can be used for further PollRange
queries on the same range, or for PollRange queries in a subrange of the initial range.
It may not be used for PollRange queries on ranges larger or outside of the initial range.

Example query:

```json
SEARCH /my_bucket?poll_range HTTP/1.1

{
  "prefix": "0391.",
  "start": "0391.000001973107",
  "seenMarker": "opaquestring123",
}
```


Example response:

```json
HTTP/1.1 200 OK
Content-Type: application/json

{
  "seenMarker": "opaquestring456",
  "items": [
    { sk: "0391.000001973221", ct: "opaquetoken123", v: ["b64cryptoblob123", "b64cryptoblob'123"] },
    { sk: "0391.000001974191", ct: "opaquetoken456", v: ["b64cryptoblob456", "b64cryptoblob'456"] },
  ]
}
```


## Internals: causality tokens

The method used is based on DVVS (dotted version vector sets). See:

- the paper "Scalable and Accurate Causality Tracking for Eventually Consistent Data Stores"
- <https://github.com/ricardobcl/Dotted-Version-Vectors>

For DVVS to work, write operations (at each node) must take a lock on the data table.