From 7011b71fbd782e199417ce9afa44a8c220885b4a Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Tue, 18 Apr 2023 12:14:13 +0200 Subject: jepsen: wip --- script/jepsen.garage/README.md | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) create mode 100644 script/jepsen.garage/README.md (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md new file mode 100644 index 00000000..e1d1a555 --- /dev/null +++ b/script/jepsen.garage/README.md @@ -0,0 +1,22 @@ +# jepsen.garage + +A Clojure library designed to ... well, that part is up to you. + +## Usage + +FIXME + +## License + +Copyright © 2023 FIXME + +This program and the accompanying materials are made available under the +terms of the Eclipse Public License 2.0 which is available at +http://www.eclipse.org/legal/epl-2.0. + +This Source Code may also be made available under the following Secondary +Licenses when the conditions for such availability set forth in the Eclipse +Public License, v. 2.0 are satisfied: GNU General Public License as published by +the Free Software Foundation, either version 2 of the License, or (at your +option) any later version, with the GNU Classpath Exception which is available +at https://www.gnu.org/software/classpath/license.html. -- cgit v1.2.3 From dc5245ce65e6acc4c2b1f81dfdf38fc76fe06d3f Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Tue, 18 Apr 2023 17:47:53 +0200 Subject: even without nemesis, s3 get/put/delete is not linearizable (is this normal?) --- script/jepsen.garage/README.md | 34 ++++++++++++++++++++++------------ 1 file changed, 22 insertions(+), 12 deletions(-) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index e1d1a555..ed956830 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -1,22 +1,32 @@ # jepsen.garage -A Clojure library designed to ... well, that part is up to you. +Jepsen checking of Garage consistency properties. ## Usage -FIXME +Requirements: + +- vagrant +- VirtualBox, configured so that nodes can take an IP in a private network `192.168.56.0/24` +- a user that can create VirtualBox VMs +- leiningen +- gnuplot + +Set up VMs: + +``` +vagrant up +``` + +Run tests: + +``` +lein run test --nodes-file nodes.vagrant +``` ## License -Copyright © 2023 FIXME +Copyright © 2023 Alex Auvolat This program and the accompanying materials are made available under the -terms of the Eclipse Public License 2.0 which is available at -http://www.eclipse.org/legal/epl-2.0. - -This Source Code may also be made available under the following Secondary -Licenses when the conditions for such availability set forth in the Eclipse -Public License, v. 2.0 are satisfied: GNU General Public License as published by -the Free Software Foundation, either version 2 of the License, or (at your -option) any later version, with the GNU Classpath Exception which is available -at https://www.gnu.org/software/classpath/license.html. +terms of the GNU General Public License v3.0. -- cgit v1.2.3 From 80d7b7d8582171d7ecd0e7745893792d10dd3038 Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Wed, 19 Apr 2023 12:56:40 +0200 Subject: remove useless files --- script/jepsen.garage/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index ed956830..460f0b9e 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -29,4 +29,4 @@ lein run test --nodes-file nodes.vagrant Copyright © 2023 Alex Auvolat This program and the accompanying materials are made available under the -terms of the GNU General Public License v3.0. +terms of the GNU Affero General Public License v3.0. -- cgit v1.2.3 From 6eb26be548c08707b59473e6086f3f5eee89fe47 Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Wed, 19 Apr 2023 15:27:26 +0200 Subject: Add garage set test (this one works :p) --- script/jepsen.garage/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index 460f0b9e..800dde94 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -18,10 +18,10 @@ Set up VMs: vagrant up ``` -Run tests: +Run tests (this one should fail): ``` -lein run test --nodes-file nodes.vagrant +lein run test --nodes-file nodes.vagrant --time-limit 64 --concurrency 50 --rate 50 --workload reg ``` ## License -- cgit v1.2.3 From 55eb4e87c42bf0da88186eb5b2fe1fbbbdf9ed43 Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Wed, 19 Apr 2023 16:16:34 +0200 Subject: set tests with independant tests together --- script/jepsen.garage/README.md | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index 800dde94..1bba32ec 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -24,6 +24,13 @@ Run tests (this one should fail): lein run test --nodes-file nodes.vagrant --time-limit 64 --concurrency 50 --rate 50 --workload reg ``` +These ones are working: + +``` +lein run test --nodes-file nodes.vagrant --time-limit 64 --rate 50 --concurrency 50 --workload set1 +lein run test --nodes-file nodes.vagrant --time-limit 64 --rate 50 --concurrency 50 --workload set2 +``` + ## License Copyright © 2023 Alex Auvolat -- cgit v1.2.3 From 74e50eddddf319ce1a32a9b57b3825ea40db3a6c Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Thu, 19 Oct 2023 14:34:19 +0200 Subject: jepsen: refactoring --- script/jepsen.garage/README.md | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index 1bba32ec..5cb98e4d 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -31,6 +31,39 @@ lein run test --nodes-file nodes.vagrant --time-limit 64 --rate 50 --concurrenc lein run test --nodes-file nodes.vagrant --time-limit 64 --rate 50 --concurrency 50 --workload set2 ``` +## Results + +**Register linear, without timestamp patch** + +Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 20 --concurrency 20 --workload reg --ops-per-key 100` + +Results: fails with a simple clock-scramble nemesis. + +Explanation: without the timestamp patch, nodes will create objects using their +local clock only as a timestamp, so the ordering will be all over the place if +clocks are scrambled. + +**Register linear, with timestamp patch** + +Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 20 --concurrency 20 --workload reg --ops-per-key 100 -I` + +Results: + +- No failure with clock-scramble nemesis +- Fails with clock-scramble nemesis + partition nemesis + +Explanation: S3 objects are not meant to behave like linearizable registers. TODO explain using a counter-example + +**Read-after-write CRDT register model**: TODO: determine the expected semantics of such a register, code a checker and show that results are correct + +**Set, basic test** + +Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 20 --concurrency 20 --workload set1 --ops-per-key 100` + +Results: + +- ListObjects returns objects not within prefix???? + ## License Copyright © 2023 Alex Auvolat -- cgit v1.2.3 From da8b1707489b70c25395ee49383ecbbd8c9f9404 Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Thu, 19 Oct 2023 16:45:24 +0200 Subject: jepsen: investigating listobjects error --- script/jepsen.garage/README.md | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index 5cb98e4d..f6fb3a59 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -62,7 +62,16 @@ Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 20 -- Results: -- ListObjects returns objects not within prefix???? +- ListObjects returns objects not within prefix???? -> BAD, definitely a bug, but maybe it's in the instrumentation code? + +In `store/garage set1/20231019T163358.615+0200`: + +``` +INFO [2023-10-19 16:35:20,977] clojure-agent-send-off-pool-207 - jepsen.garage.set list results for prefix set20/ : (set13/0 set13/1 set13/10 set13/11 set13/12 set13/13 set13/14 set13/15 set13/16 set13/17 set13/18 set13/19 set13/2 set13/20 set13/21 set13/22 set13/23 set13/24 set13/25 set13/26 set13/27 set13/28 set13/29 set13/3 set13/30 set13/31 set13/32 set13/33 set13/34 set13/35 set13/36 set13/37 set13/38 set13/39 set13/4 set13/40 set13/41 set13/42 set13/43 set13/44 set13/45 set13/46 set13/47 set13/48 set13/49 set13/5 set13/50 set13/51 set13/52 set13/53 set13/54 set13/55 set13/56 set13/57 set13/58 set13/59 set13/6 set13/60 set13/61 set13/62 set13/63 set13/64 set13/65 set13/66 set13/67 set13/68 set13/69 set13/7 set13/70 set13/71 set13/72 set13/73 set13/74 set13/75 set13/76 set13/77 set13/78 set13/79 set13/8 set13/80 set13/81 set13/82 set13/83 set13/84 set13/85 set13/86 set13/87 set13/88 set13/89 set13/9 set13/90 set13/91 set13/92 set13/93 set13/94 set13/95 set13/96 set13/97 set13/98 set13/99) (node: http://192.168.56.25:3900 ) + +``` + +- Sometimes ListObjects returns an empty list???? -> BAD, quorums should ensure this doesn't happen ## License -- cgit v1.2.3 From ef662822c9e48ff7cfd9300590617e089c0a9498 Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Thu, 19 Oct 2023 23:40:55 +0200 Subject: jepsen: fix the list-objects call (?) --- script/jepsen.garage/README.md | 37 +++++++++++++++++++++++++++++-------- 1 file changed, 29 insertions(+), 8 deletions(-) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index f6fb3a59..8dcd3766 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -33,7 +33,7 @@ lein run test --nodes-file nodes.vagrant --time-limit 64 --rate 50 --concurrenc ## Results -**Register linear, without timestamp patch** +### Register linear, without timestamp patch Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 20 --concurrency 20 --workload reg --ops-per-key 100` @@ -43,7 +43,7 @@ Explanation: without the timestamp patch, nodes will create objects using their local clock only as a timestamp, so the ordering will be all over the place if clocks are scrambled. -**Register linear, with timestamp patch** +### Register linear, with timestamp patch Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 20 --concurrency 20 --workload reg --ops-per-key 100 -I` @@ -54,24 +54,45 @@ Results: Explanation: S3 objects are not meant to behave like linearizable registers. TODO explain using a counter-example -**Read-after-write CRDT register model**: TODO: determine the expected semantics of such a register, code a checker and show that results are correct +### Read-after-write CRDT register model -**Set, basic test** +TODO: determine the expected semantics of such a register, code a checker and show that results are correct -Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 20 --concurrency 20 --workload set1 --ops-per-key 100` +### Set, basic test (write some items, then read) + +Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload set1 --ops-per-key 100` Results: -- ListObjects returns objects not within prefix???? -> BAD, definitely a bug, but maybe it's in the instrumentation code? +- For now, no failures with clock-scramble nemesis + partition nemesis + +### Set, continuous test (interspersed reads and writes) + +TODO + +TODO: nemesis that reconfigures the cluster with a different subset of nodes, to have requests that occur during a resync period. + + +## Investigating (and fixing) wierd behavior + +### Segfaults + +They are due to the download being interrupted in the middle (^C during first launch on clean VMs), the `garage` binary is truncated. +Add `:force?` to the `cached-wget!` call in `daemon.clj` to re-download the binary. + +### In `jepsen.garage`: prefix wierdness In `store/garage set1/20231019T163358.615+0200`: ``` INFO [2023-10-19 16:35:20,977] clojure-agent-send-off-pool-207 - jepsen.garage.set list results for prefix set20/ : (set13/0 set13/1 set13/10 set13/11 set13/12 set13/13 set13/14 set13/15 set13/16 set13/17 set13/18 set13/19 set13/2 set13/20 set13/21 set13/22 set13/23 set13/24 set13/25 set13/26 set13/27 set13/28 set13/29 set13/3 set13/30 set13/31 set13/32 set13/33 set13/34 set13/35 set13/36 set13/37 set13/38 set13/39 set13/4 set13/40 set13/41 set13/42 set13/43 set13/44 set13/45 set13/46 set13/47 set13/48 set13/49 set13/5 set13/50 set13/51 set13/52 set13/53 set13/54 set13/55 set13/56 set13/57 set13/58 set13/59 set13/6 set13/60 set13/61 set13/62 set13/63 set13/64 set13/65 set13/66 set13/67 set13/68 set13/69 set13/7 set13/70 set13/71 set13/72 set13/73 set13/74 set13/75 set13/76 set13/77 set13/78 set13/79 set13/8 set13/80 set13/81 set13/82 set13/83 set13/84 set13/85 set13/86 set13/87 set13/88 set13/89 set13/9 set13/90 set13/91 set13/92 set13/93 set13/94 set13/95 set13/96 set13/97 set13/98 set13/99) (node: http://192.168.56.25:3900 ) - ``` -- Sometimes ListObjects returns an empty list???? -> BAD, quorums should ensure this doesn't happen +After inspecting, the actual S3 call made was with prefix "set13/", so at least this is not an error in Garage itself but in the jepsen code. + +Finally found out that this was due to closures not correctly capturing their context in the list function in s3api.clj (wtf clojure?) +Not sure exactly where it came from but it seems to have been fixed by making list-inner a separate function and not a sub-function, +and passing all values that were previously in the context (creds and prefix) as additional arguments. ## License -- cgit v1.2.3 From 4b93ce179a3777c8461f3b5843dc3802bddc739c Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Fri, 20 Oct 2023 12:56:45 +0200 Subject: jepsen: errors in reg2 workload under investigation --- script/jepsen.garage/README.md | 25 +++++++++++++++++++------ 1 file changed, 19 insertions(+), 6 deletions(-) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index 8dcd3766..762901fe 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -35,7 +35,7 @@ lein run test --nodes-file nodes.vagrant --time-limit 64 --rate 50 --concurrenc ### Register linear, without timestamp patch -Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 20 --concurrency 20 --workload reg --ops-per-key 100` +Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 20 --concurrency 20 --workload reg1 --ops-per-key 100` Results: fails with a simple clock-scramble nemesis. @@ -45,7 +45,7 @@ clocks are scrambled. ### Register linear, with timestamp patch -Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 20 --concurrency 20 --workload reg --ops-per-key 100 -I` +Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 20 --concurrency 20 --workload reg1 --ops-per-key 100 -I` Results: @@ -54,9 +54,23 @@ Results: Explanation: S3 objects are not meant to behave like linearizable registers. TODO explain using a counter-example -### Read-after-write CRDT register model +### Read-after-write CRDT register model, without timestamp patch + +Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload reg2 --ops-per-key 100` + +Results: fails with a simple clock-scramble nemesis. + +Explanation: old values are not overwritten correctly when their timestamps are in the future. + +### Read-after-write CRDT register model, with timestamp patch + +Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload reg2 --ops-per-key 100 -I` + +Results: + +- Failures with clock-scramble nemesis + partition nemesis ???? TODO INVESTIGATE +- TODO: layout reconfiguration nemesis -TODO: determine the expected semantics of such a register, code a checker and show that results are correct ### Set, basic test (write some items, then read) @@ -65,13 +79,12 @@ Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 - Results: - For now, no failures with clock-scramble nemesis + partition nemesis +- TODO: layout reconfiguration nemesis ### Set, continuous test (interspersed reads and writes) TODO -TODO: nemesis that reconfigures the cluster with a different subset of nodes, to have requests that occur during a resync period. - ## Investigating (and fixing) wierd behavior -- cgit v1.2.3 From d148b83d4f440dc79b2ed08eaa171aca0e2037b0 Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Fri, 20 Oct 2023 13:36:48 +0200 Subject: jepsen: reg2 failure seems to happen only with deleteobject --- script/jepsen.garage/README.md | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index 762901fe..da6f0b77 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -69,6 +69,8 @@ Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 - Results: - Failures with clock-scramble nemesis + partition nemesis ???? TODO INVESTIGATE + -> the issue seems to be only after DeleteObject (deletions are not always taken into account), + the issue does not appear if we are using only PutObject with an actual object content - TODO: layout reconfiguration nemesis @@ -86,7 +88,7 @@ Results: TODO -## Investigating (and fixing) wierd behavior +## Investigating (and fixing) errors ### Segfaults @@ -107,6 +109,22 @@ Finally found out that this was due to closures not correctly capturing their co Not sure exactly where it came from but it seems to have been fixed by making list-inner a separate function and not a sub-function, and passing all values that were previously in the context (creds and prefix) as additional arguments. +### `reg2` test inconsistency, even with timestamp fix + +The reg2 test is our custom checker for CRDT read-after-write on individual object keys, acting as registers which can be updated. +The test fails without the timestamp fix, which is expected as the clock scrambler will prevent nodes from having a correct ordering of objects. + +With the timestamp fix, the happenned-before relationship should at least be respected, meaning that when a PutObject call starts +after another PutObject call has ended, the second call should overwrite the value of the first call, and that value should not be +readable by future GetObject calls. +However, we observed inconsistencies even with the timestamp fix. + +The inconsistencies seemed to always happenned after writing a nil value, which translates to a DeleteObject call +instead of a PutObject. By removing the possibility of writing nil values, therefore only doing +PutObject calls, the issue disappears. There is therefore an issue to fix in DeleteObject. + + + ## License Copyright © 2023 Alex Auvolat -- cgit v1.2.3 From f5b09727815523a1bd4ba5f62d892b2b45b5bed6 Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Fri, 20 Oct 2023 15:00:10 +0200 Subject: jepsen: register crdt read-after-write is fixed with deleteobject patch --- script/jepsen.garage/README.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index da6f0b77..4c3c70b3 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -45,7 +45,7 @@ clocks are scrambled. ### Register linear, with timestamp patch -Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 20 --concurrency 20 --workload reg1 --ops-per-key 100 -I` +Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 20 --concurrency 20 --workload reg1 --ops-per-key 100 --patch tsfix1` Results: @@ -62,15 +62,13 @@ Results: fails with a simple clock-scramble nemesis. Explanation: old values are not overwritten correctly when their timestamps are in the future. -### Read-after-write CRDT register model, with timestamp patch +### Read-after-write CRDT register model, with timestamp patch (v2 with DeleteObject fix as well) -Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload reg2 --ops-per-key 100 -I` +Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload reg2 --ops-per-key 100 --patch tsfix2` Results: -- Failures with clock-scramble nemesis + partition nemesis ???? TODO INVESTIGATE - -> the issue seems to be only after DeleteObject (deletions are not always taken into account), - the issue does not appear if we are using only PutObject with an actual object content +- No failures with clock-scramble nemesis + partition nemesis - TODO: layout reconfiguration nemesis @@ -123,6 +121,7 @@ The inconsistencies seemed to always happenned after writing a nil value, which instead of a PutObject. By removing the possibility of writing nil values, therefore only doing PutObject calls, the issue disappears. There is therefore an issue to fix in DeleteObject. +The issue in DeleteObject seems to have been fixed by commit `c82d91c6bccf307186332b6c5c6fc0b128b1b2b1` ## License -- cgit v1.2.3 From fb6c9a1243bd561d2a0de6b49c8debf37d566473 Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Fri, 20 Oct 2023 15:55:09 +0200 Subject: jepsen: update readme --- script/jepsen.garage/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index 4c3c70b3..684bce87 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -69,7 +69,7 @@ Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 - Results: - No failures with clock-scramble nemesis + partition nemesis -- TODO: layout reconfiguration nemesis +- Fails with layout reconfiguration nemesis (TODO: test more and investigate) ### Set, basic test (write some items, then read) @@ -79,7 +79,7 @@ Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 - Results: - For now, no failures with clock-scramble nemesis + partition nemesis -- TODO: layout reconfiguration nemesis +- TODO: layout reconfiguration nemesis (does not fail yet! but it should) ### Set, continuous test (interspersed reads and writes) -- cgit v1.2.3 From d2c365767b0a4cb70dcbb1d20b75f41e0f9c20c8 Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Tue, 24 Oct 2023 11:39:45 +0200 Subject: jepsen: more testing --- script/jepsen.garage/README.md | 71 ++++++++++++++++++++++++++---------------- 1 file changed, 45 insertions(+), 26 deletions(-) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index 684bce87..06379d25 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -35,55 +35,74 @@ lein run test --nodes-file nodes.vagrant --time-limit 64 --rate 50 --concurrenc ### Register linear, without timestamp patch -Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 20 --concurrency 20 --workload reg1 --ops-per-key 100` +Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 20 --workload reg1 --ops-per-key 100` -Results: fails with a simple clock-scramble nemesis. +Results without timestamp patch: -Explanation: without the timestamp patch, nodes will create objects using their -local clock only as a timestamp, so the ordering will be all over the place if -clocks are scrambled. +- Fails with a simple clock-scramble nemesis (`--scenario c`). + Explanation: without the timestamp patch, nodes will create objects using their + local clock only as a timestamp, so the ordering will be all over the place if + clocks are scrambled. -### Register linear, with timestamp patch +Results with timestamp patch (`--patch tsfix2`): -Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 20 --concurrency 20 --workload reg1 --ops-per-key 100 --patch tsfix1` +- No failure with clock-scramble nemesis -Results: +- Fails with clock-scramble nemesis + partition nemesis (`--scenario cp`). -- No failure with clock-scramble nemesis -- Fails with clock-scramble nemesis + partition nemesis +**This test is expected to fail.** +Indeed, S3 objects are not meant to behave like linearizable registers. +TODO explain using a counter-example -Explanation: S3 objects are not meant to behave like linearizable registers. TODO explain using a counter-example -### Read-after-write CRDT register model, without timestamp patch +### Read-after-write CRDT register model Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload reg2 --ops-per-key 100` -Results: fails with a simple clock-scramble nemesis. +Results without timestamp patch: -Explanation: old values are not overwritten correctly when their timestamps are in the future. +- Fails with a simple clock-scramble nemesis (`--scenario c`). + Explanation: old values are not overwritten correctly when their timestamps are in the future. -### Read-after-write CRDT register model, with timestamp patch (v2 with DeleteObject fix as well) +Results with timestamp patch (`--patch tsfix2`): -Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload reg2 --ops-per-key 100 --patch tsfix2` +- No failures with clock-scramble nemesis + partition nemesis (`--scenario cp`). + This proves that `tsfix2` (PR#543) does improve consistency. -Results: - -- No failures with clock-scramble nemesis + partition nemesis -- Fails with layout reconfiguration nemesis (TODO: test more and investigate) +- **Fails with layout reconfiguration nemesis** (`--scenario r`) + (TODO: note down the run id of a failed run) + (TODO: test more and investigate). + This is the failure mode we are looking for and trying to fix for NLnet task 3. ### Set, basic test (write some items, then read) -Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload set1 --ops-per-key 100` +Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload set1 --ops-per-key 100 --patch tsfix2` Results: -- For now, no failures with clock-scramble nemesis + partition nemesis -- TODO: layout reconfiguration nemesis (does not fail yet! but it should) +- For now, no failures with clock-scramble nemesis + partition nemesis -> TODO long test run + +- Failures were not yet achieved with only the layout reconfiguration nemesis, although they should be. + +- **Fails with partition + layout reconfiguration nemesis** (`--scenario pr`) + (TODO: note down the run id of a failed run) + (TODO: test more and investigate). + This is the failure mode we are looking for and trying to fix for NLnet task 3. + ### Set, continuous test (interspersed reads and writes) -TODO +Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload set2 --ops-per-key 100 --patch tsfix2` + +Results: + +- For now, no failures with clock-scramble nemesis + partition nemesis -> TODO long test run + +- Failures were not yet achieved with only the layout reconfiguration nemesis, although they should be. + +- TODO: failures should be achieved with `--scenario pr`? Even with 4 or 5 consecutive test runs, no failures were achieved, why? + (TODO: note down the run id of a failed run) ## Investigating (and fixing) errors @@ -112,7 +131,7 @@ and passing all values that were previously in the context (creds and prefix) as The reg2 test is our custom checker for CRDT read-after-write on individual object keys, acting as registers which can be updated. The test fails without the timestamp fix, which is expected as the clock scrambler will prevent nodes from having a correct ordering of objects. -With the timestamp fix, the happenned-before relationship should at least be respected, meaning that when a PutObject call starts +With the timestamp fix (`--patch tsfix1`), the happenned-before relationship should at least be respected, meaning that when a PutObject call starts after another PutObject call has ended, the second call should overwrite the value of the first call, and that value should not be readable by future GetObject calls. However, we observed inconsistencies even with the timestamp fix. @@ -121,7 +140,7 @@ The inconsistencies seemed to always happenned after writing a nil value, which instead of a PutObject. By removing the possibility of writing nil values, therefore only doing PutObject calls, the issue disappears. There is therefore an issue to fix in DeleteObject. -The issue in DeleteObject seems to have been fixed by commit `c82d91c6bccf307186332b6c5c6fc0b128b1b2b1` +The issue in DeleteObject seems to have been fixed by commit `c82d91c6bccf307186332b6c5c6fc0b128b1b2b1`, which can be used using `--patch tsfix2`. ## License -- cgit v1.2.3 From d13bde5e26098313e789dd3793368a635cf1cc16 Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Tue, 24 Oct 2023 15:44:05 +0200 Subject: jepsen: set1 and set2 don't fail anymore ?? --- script/jepsen.garage/README.md | 20 +++++++++----------- 1 file changed, 9 insertions(+), 11 deletions(-) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index 06379d25..e1dc6953 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -69,9 +69,9 @@ Results with timestamp patch (`--patch tsfix2`): - No failures with clock-scramble nemesis + partition nemesis (`--scenario cp`). This proves that `tsfix2` (PR#543) does improve consistency. -- **Fails with layout reconfiguration nemesis** (`--scenario r`) - (TODO: note down the run id of a failed run) - (TODO: test more and investigate). +- **Fails with layout reconfiguration nemesis** (`--scenario r`). + Example of a failed run: `garage reg2/20231024T120806.899+0200`. + TODO: investigate. This is the failure mode we are looking for and trying to fix for NLnet task 3. @@ -83,12 +83,11 @@ Results: - For now, no failures with clock-scramble nemesis + partition nemesis -> TODO long test run -- Failures were not yet achieved with only the layout reconfiguration nemesis, although they should be. +- Does not seem to fail with only the layout reconfiguation nemesis (>20 runs), although theoretically it could -- **Fails with partition + layout reconfiguration nemesis** (`--scenario pr`) - (TODO: note down the run id of a failed run) - (TODO: test more and investigate). - This is the failure mode we are looking for and trying to fix for NLnet task 3. +- Does not seem to fail with the layout reconfiguation + partition nemesis (<10 runs), although theoretically it could + +TODO: make it fail!!! ### Set, continuous test (interspersed reads and writes) @@ -99,10 +98,9 @@ Results: - For now, no failures with clock-scramble nemesis + partition nemesis -> TODO long test run -- Failures were not yet achieved with only the layout reconfiguration nemesis, although they should be. +- Does not seem to fail with the clock scrambler + partition + layout reconfiguation nemesis (>10 runs), although theoretically it could -- TODO: failures should be achieved with `--scenario pr`? Even with 4 or 5 consecutive test runs, no failures were achieved, why? - (TODO: note down the run id of a failed run) +TODO: make it fail!!! ## Investigating (and fixing) errors -- cgit v1.2.3 From 4fa2646a75ed9b4823bf36ae6218a18cca11c471 Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Tue, 24 Oct 2023 17:45:22 +0200 Subject: jepsen: got a failure with set1 --- script/jepsen.garage/README.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index e1dc6953..5d407b6a 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -77,17 +77,16 @@ Results with timestamp patch (`--patch tsfix2`): ### Set, basic test (write some items, then read) -Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload set1 --ops-per-key 100 --patch tsfix2` +Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 200 --concurrency 200 --workload set1 --ops-per-key 100 --patch tsfix2` Results: - For now, no failures with clock-scramble nemesis + partition nemesis -> TODO long test run -- Does not seem to fail with only the layout reconfiguation nemesis (>20 runs), although theoretically it could +- Does not seem to fail with only the layout reconfiguation nemesis (<10 runs), although theoretically it could -- Does not seem to fail with the layout reconfiguation + partition nemesis (<10 runs), although theoretically it could - -TODO: make it fail!!! +- **Fails with the partition + layout reconfiguration nemesis** (`--scenario pr`). + EXample of a failed run: `garage set1/20231024T172214.488+0200` (1 failure in 4 runs). ### Set, continuous test (interspersed reads and writes) -- cgit v1.2.3 From db921cc05f8bcfccd0d0ba1d90b6dcd77f06dcdd Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Wed, 25 Oct 2023 11:41:34 +0200 Subject: jepsen: reconfigure nemesis + add db nemesis --- script/jepsen.garage/README.md | 2 ++ 1 file changed, 2 insertions(+) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index 5d407b6a..ced8ebb5 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -97,6 +97,8 @@ Results: - For now, no failures with clock-scramble nemesis + partition nemesis -> TODO long test run +- Does not seem to fail with partition + layout reconfiguration nemesis (>100 runs) + - Does not seem to fail with the clock scrambler + partition + layout reconfiguation nemesis (>10 runs), although theoretically it could TODO: make it fail!!! -- cgit v1.2.3 From fd85010a403775bbb18030ae2d9d3689b34f3e8a Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Wed, 25 Oct 2023 12:13:27 +0200 Subject: jepsen: failures with set2 test in --scenario r --- script/jepsen.garage/README.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index ced8ebb5..5e50a0f4 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -86,7 +86,7 @@ Results: - Does not seem to fail with only the layout reconfiguation nemesis (<10 runs), although theoretically it could - **Fails with the partition + layout reconfiguration nemesis** (`--scenario pr`). - EXample of a failed run: `garage set1/20231024T172214.488+0200` (1 failure in 4 runs). + Example of a failed run: `garage set1/20231024T172214.488+0200` (1 failure in 4 runs). ### Set, continuous test (interspersed reads and writes) @@ -97,11 +97,10 @@ Results: - For now, no failures with clock-scramble nemesis + partition nemesis -> TODO long test run -- Does not seem to fail with partition + layout reconfiguration nemesis (>100 runs) - -- Does not seem to fail with the clock scrambler + partition + layout reconfiguation nemesis (>10 runs), although theoretically it could - -TODO: make it fail!!! +- **Fails with layout reconfiguration nemesis** (`--scenario r`). + Example of a failed run: `garage set2/20231025T115033.553+0200` (2 failures in 2 runs). + TODO: investigate. + This is the failure mode we are looking for and trying to fix for NLnet task 3. ## Investigating (and fixing) errors -- cgit v1.2.3 From 9df7fa0bcd8b00dee5926fe7778853d857b5636d Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Wed, 25 Oct 2023 14:04:39 +0200 Subject: jepsen: use 7 nodes --- script/jepsen.garage/README.md | 2 ++ 1 file changed, 2 insertions(+) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index 5e50a0f4..0d647c72 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -87,6 +87,8 @@ Results: - **Fails with the partition + layout reconfiguration nemesis** (`--scenario pr`). Example of a failed run: `garage set1/20231024T172214.488+0200` (1 failure in 4 runs). + TODO: investigate. + This is the failure mode we are looking for and trying to fix for NLnet task 3. ### Set, continuous test (interspersed reads and writes) -- cgit v1.2.3 From 5b1f50be65c251a1dc0a4358c706c409f17a82c0 Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Wed, 25 Oct 2023 14:43:24 +0200 Subject: jepsen: testing --- script/jepsen.garage/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index 0d647c72..464da4da 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -97,10 +97,10 @@ Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 - Results: -- For now, no failures with clock-scramble nemesis + partition nemesis -> TODO long test run +- No failures with clock-scramble nemesis + db nemesis + partition nemesis (`--scenario cdp`) (0 failures in 10 runs). -- **Fails with layout reconfiguration nemesis** (`--scenario r`). - Example of a failed run: `garage set2/20231025T115033.553+0200` (2 failures in 2 runs). +- **Fails with just layout reconfiguration nemesis** (`--scenario r`). + Example of a failed run: `garage set2/20231025T141940.198+0200` (10 failures in 10 runs). TODO: investigate. This is the failure mode we are looking for and trying to fix for NLnet task 3. -- cgit v1.2.3 From 92dd2bbe15357a24eb68a3d3d6220c4758bb81a7 Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Thu, 16 Nov 2023 18:09:13 +0100 Subject: jepsen: nlnet task3a seems to fix things --- script/jepsen.garage/README.md | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index 464da4da..f7479a3d 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -71,9 +71,12 @@ Results with timestamp patch (`--patch tsfix2`): - **Fails with layout reconfiguration nemesis** (`--scenario r`). Example of a failed run: `garage reg2/20231024T120806.899+0200`. - TODO: investigate. This is the failure mode we are looking for and trying to fix for NLnet task 3. +- Changes brought by NLnet task 3 code (commit 707442f5de): + no failures with `--scenario r` (0 of 10 runs), `--scenario pr` (0 of 10 runs), + `--scenario cpr` (0 of 10 runs) and `--scenario dpr` (0 of 10 runs). + ### Set, basic test (write some items, then read) @@ -101,9 +104,12 @@ Results: - **Fails with just layout reconfiguration nemesis** (`--scenario r`). Example of a failed run: `garage set2/20231025T141940.198+0200` (10 failures in 10 runs). - TODO: investigate. This is the failure mode we are looking for and trying to fix for NLnet task 3. +- Changes brought by NLnet task 3 code (commit 707442f5de): + no failures with `--scenario r` (0 of 10 runs), `--scenario pr` (0 of 10 runs). + `--scenario cpr` (0 of 10 runs) and `--scenario dpr` (0 of 10 runs). + ## Investigating (and fixing) errors -- cgit v1.2.3 From fa9247f11b89c960dffe82d6bf990ed4335788e3 Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Thu, 14 Dec 2023 16:23:48 +0100 Subject: jepsen: updated results, confirming that task3 works --- script/jepsen.garage/README.md | 55 ++++++++++++++++++++++++------------------ 1 file changed, 32 insertions(+), 23 deletions(-) (limited to 'script/jepsen.garage/README.md') diff --git a/script/jepsen.garage/README.md b/script/jepsen.garage/README.md index f7479a3d..50c7eb38 100644 --- a/script/jepsen.garage/README.md +++ b/script/jepsen.garage/README.md @@ -7,29 +7,19 @@ Jepsen checking of Garage consistency properties. Requirements: - vagrant -- VirtualBox, configured so that nodes can take an IP in a private network `192.168.56.0/24` +- VirtualBox, configured so that nodes can take an IP in a private network `192.168.56.0/24` (it's the default) - a user that can create VirtualBox VMs - leiningen - gnuplot -Set up VMs: +Set up VMs before running tests: ``` vagrant up ``` -Run tests (this one should fail): +Run tests: see commands below. -``` -lein run test --nodes-file nodes.vagrant --time-limit 64 --concurrency 50 --rate 50 --workload reg -``` - -These ones are working: - -``` -lein run test --nodes-file nodes.vagrant --time-limit 64 --rate 50 --concurrency 50 --workload set1 -lein run test --nodes-file nodes.vagrant --time-limit 64 --rate 50 --concurrency 50 --workload set2 -``` ## Results @@ -73,16 +63,19 @@ Results with timestamp patch (`--patch tsfix2`): Example of a failed run: `garage reg2/20231024T120806.899+0200`. This is the failure mode we are looking for and trying to fix for NLnet task 3. -- Changes brought by NLnet task 3 code (commit 707442f5de): - no failures with `--scenario r` (0 of 10 runs), `--scenario pr` (0 of 10 runs), +Results with NLnet task 3 code (commit 707442f5de, `--patch task3a`): + +- No failures with `--scenario r` (0 of 10 runs), `--scenario pr` (0 of 10 runs), `--scenario cpr` (0 of 10 runs) and `--scenario dpr` (0 of 10 runs). +- Same with `--patch task3c` (commit `0041b013`, the final version). + ### Set, basic test (write some items, then read) -Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 200 --concurrency 200 --workload set1 --ops-per-key 100 --patch tsfix2` +Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 200 --concurrency 200 --workload set1 --ops-per-key 100` -Results: +Results without NLnet task3 code (`--patch tsfix2`): - For now, no failures with clock-scramble nemesis + partition nemesis -> TODO long test run @@ -90,15 +83,22 @@ Results: - **Fails with the partition + layout reconfiguration nemesis** (`--scenario pr`). Example of a failed run: `garage set1/20231024T172214.488+0200` (1 failure in 4 runs). - TODO: investigate. This is the failure mode we are looking for and trying to fix for NLnet task 3. +Results with NLnet task 3 code (commit 707442f5de, `--patch task3a`): + +- The tests are buggy and often result in an "unknown" validity status, which + is caused by some requests not returning results during network partitions or + other nemesis-induced broken cluster states. However, when the tests were + able to finish, there were no failures with scenarios `r`, `pr`, `cpr`, + `dpr`. + ### Set, continuous test (interspersed reads and writes) -Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload set2 --ops-per-key 100 --patch tsfix2` +Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload set2 --ops-per-key 100` -Results: +Results without NLnet task3 code (`--patch tsfix2`): - No failures with clock-scramble nemesis + db nemesis + partition nemesis (`--scenario cdp`) (0 failures in 10 runs). @@ -106,17 +106,26 @@ Results: Example of a failed run: `garage set2/20231025T141940.198+0200` (10 failures in 10 runs). This is the failure mode we are looking for and trying to fix for NLnet task 3. -- Changes brought by NLnet task 3 code (commit 707442f5de): - no failures with `--scenario r` (0 of 10 runs), `--scenario pr` (0 of 10 runs). +Results with NLnet task3 code (commit 707442f5de, `--patch task3a`): + +- No failures with `--scenario r` (0 of 10 runs), `--scenario pr` (0 of 10 runs), `--scenario cpr` (0 of 10 runs) and `--scenario dpr` (0 of 10 runs). +- Same with `--patch task3c` (commit `0041b013`, the final version). + + +## NLnet task 3 final results + +- With code from task3 (`--patch task3c`): [reg2 and set2](results/Results-2023-12-13-task3c.png), [set1](results/Results-2023-12-14-task3-set1.png). +- Without (`--patch tsfix2`): [reg2 and set2](results/Results-2023-12-13-tsfix2.png), set1 TBD. ## Investigating (and fixing) errors ### Segfaults They are due to the download being interrupted in the middle (^C during first launch on clean VMs), the `garage` binary is truncated. -Add `:force?` to the `cached-wget!` call in `daemon.clj` to re-download the binary. +Add `:force?` to the `cached-wget!` call in `daemon.clj` to re-download the binary, +or restar the VMs to clear temporary files. ### In `jepsen.garage`: prefix wierdness -- cgit v1.2.3