Smashing CouchBase

You might remember year 2k for Bill Gates stepping down as CEO (and promoting Steve Balmer), American Beauty winning 5 Oscars, liberation of Lebanon after 22 years of Israeli occupation, end of RSA patent, Bill Clinton becoming the first U.S. president to visit Vietnam, and I will certainly remember it for the Bulldozer Revolution in my country which lead to the resignation of Slobodan Milošević

cloud / big dataThat year marks a big milestone in the world of software engineering: in a conference keynote, prof. Eric Brewer postulated the now famous and much discussed CAP theorem (btw it was proven 2 years later by Gilbert and Lynch), and paved the way for the NoSQL revolution.

Another big factor for NoSQL was getting away from the relational model (proposed way back in 1969) and ACID to achieve scale, performance. Having felt the pain of working with object-relational mappers, developing a webapp on MongoDB for the first time was such a pleasant experience.

Unfortunately (most of) the current NoSQL systems (read MongoDB, Riak, Cassandra, …) force developers to deal with messy aspects of consistency, concurrency, non-zero latencies, weird corner cases of failure scenarios, not to mention designing CRDTs, or asking some philosophical questions like how-eventual-is-eventual-consistency (there’re some promising developments like Google’s Spanner and HyperDex, but haven’t had time to have a look).

I’ve successfully managed to avoid dealing with these thorny issues for a few years now, but this fantastic series of articles called Jepsen by Kyle Kingsbury got me motivated to address these using CouchBase. His series of articles is not only eye-opening from the developer’s point of view, but it also reveals some shocking discoveries, namely some databases claim things they don’t provide (Redis being a CP system, MongoDB being safe, etc).

I’m not smart enough to prove that a database is broken by design, and even if I was it would require more time than I afford to invest, so Kyle’s experimental approach is the preferred way.  Since I don’t speak Clojure, the implementation language of his choice for Jepsen, this will be done in Java (yeah, I know, it’s not that cool anymore, but it gets the job done).

Cluster setup

The first cluster setup was just like Kyle’s using LXC, but I ran into some networking issues, one node would just deny connection attempts, so I ended up using 5 VMs on my Mac laptop.

For a detailed explanation of what and how this test is supposed to work have a look at http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions. Here’s a quick recap:

A five node cluster of Ubuntu 14.04 machines, running in VMs (VmWare or VirtualBox). VMs – 192.168.1.131 – 192.168.1.135 (named n1-n5 – add these entries to /etc/hosts on your computer and on each of the nodes themselves).

To cause a partition, Kyle makes “nodes n1 and n2 to drop IP traffic from n3, n4, and n5–essentially by running”:

In my experiments, I wanted to have ssh access all the time on all VMs so I drop a port range that covers all CouchBase ports (they should be above 1024), like this:

To resolve the partition, we’ll reset iptables chains to their defaults:

A client will be a simple app that appends integers to a single document – read & append number & use cas to update, and record if the db acknowledged the write or there was an error. At the end I’ll compare if the db and client apps agree on the state, or acknowledged writes are missing and/or unacknowledged writes are present.

A simple time-out test first

Before going with the full cluster, let’s test the code against a single node cluster running CouchBase 2.5.1: 256MB ram; default bucket is type Couchbase, 256MB ram quota, 0 replicas, 2 workers, flush enabled.

The client app will use Java SDK 1.4.3 and Oracle JDK 1.7.0_55.

CouchTimeoutTest first opens a connection to CouchBase – CouchbaseClient with the OP timeout of 10sec, sets an empty JSON document with a single array element, appends 50 numbers to the array, then makes the node drop all traffic for CouchBase ports, and tries to add 150 more elements to the array:

All operations are synchronous. The Java SDK driver docs describe how to deal with time-outs: “When a time-out occurs, most of the synchronous methods on the client return a RuntimeException showing a time-out as the root cause.”, so the appendToDocument implementation looks like this:

Test result

The client is configured to timeout after 10sec, and the partition lasts for 15s so I expect to see some timeouts and retries. In the end the document should have 200 integers in the array. Manually checking the iptables rules I see the code correctly heals the node. What do you think happened when I ran this code (using maven exec plugin, mvn test -Pcouch-timeout)?

  1. a few operations timed out and then succeeded
  2. a few operations timed out and then failed
  3. CouchbaseClient failed to reconnect so only 50 integers were written
  4. an infinite loop

Believe it or not it’s the 4th option. CouchbaseClient never timeouts or throws anything, just keeps running forever broadcasting noops to probably keep the connection alive:

The cluster is healthy, node up and running; if I spin another client it reads/writes without any problems.

What happens if the server is unavailable for some amount of time less than the configured timeout (10sec in this test)? The magic number seems to be 7, for anything longer the test deadlocks.

The SDK docs are pretty clear in that “when a time-out occurs, most of the synchronous methods on the client return a RuntimeException“.

It turned out (thanks Michael) the problem was the test code – getting the doc was outside any try/catch and somehow the exceptions were getting swallowed (not even showing in the log?!). The fix is:

couch_timeout_test_bugfix

With this change, the test runs as expected, a few ops time out but eventually all succeed:

The next article will explore how CouchBase behaves when this test runs against a 5 node cluster.

  • http://www.nitschinger.at Michael Nitschinger

    Hi,

    can you share your builder configuration? You said its configured for 10 seconds but on gets / cas you use the default timeout which is set to 2.5s if not overriden.

    Also, it’s not clear from the article: did you just run it against a healthy cluster or did you already induce the partition?

    Thanks,
    Michael

    • mbsimonovic

      you mean this (timeoutMillis is 10000):

      builder.setOpTimeout(timeoutMillis);//2.5 by default

      builder.setObsTimeout(timeoutMillis);

      https://github.com/mbsimonovic/jepsen-couchbase/blob/master/src/main/java/com/milansimonovic/jepsen/CouchbaseClientFactory.java#L33

      I start the test by writing to a healthy 1-node cluster, and after 50 writes, make the node drop all packets for 15sec.

      • http://www.nitschinger.at Michael Nitschinger

        thanks,

        and at which point exactly does the infinite loop show up? when you start dropping the packets? After how many seconds do you reset the firewall?

        • mbsimonovic

          1. open CouchbaseClient with opTimeout = 10sec
          2. write 50 docs (works fine)
          3. make the server drop packets for 15sec
          4. try to write another doc -> blocks forever

          trying different timeout values: opTimeout =2.5sec, drop packets for 5sec, but it also never times-out

  • mbsimonovic

    you mean this (timeoutMillis is 10000):
    builder.setOpTimeout(timeoutMillis);//2.5 by default
    builder.setObsTimeout(timeoutMillis);
    https://github.com/mbsimonovic/jepsen-couchbase/blob/master/src/main/java/com/milansimonovic/jepsen/CouchbaseClientFactory.java#L33

    I start the test by writing to a healthy 1-node cluster, and after 50 writes, make the node drop all packets for 15sec.

  • http://blogs.ingenthron.org/matt Matt Ingenthron

    First off, thanks for filing the issue and I’m sure we’ll work to repro it soon. I can’t immediately think of a reason you’d not get an exception back. It’ll be interesting to look at a thread dump in that state when we repro.

    As the guy who opened SPY-30, note there that I mention the continuous operation timeout. That is a heuristic which relies on (IIRC) 1000 continuous timeouts to a given node. While that sounds kind of high, it was from an Open Source contribution by a guy who runs it at a large site and we usually see thousands of ops/s. It’s tune-able though.

    The interesting thing is the timeout under the synchronous operation is based on a java.util.concurrent latch. My suspicion is that this is exposing some un-handled error situation on the async IO thread. Even if that does happen though, how could that break java.util.concurrent’s latch? Your thread is really just creating objects, putting them on queues and waiting on latches.

    One suspicion, and I’ll try to read the code more carefully here, but you’re doing lots of retries with 10s timeouts and with synchronous failures and retries, it may just be that it’s spending a long time waiting between iterations. But, no log output. We’ll figure it out pretty quickly I’m sure.

    • http://milansimonovic.com/ Milan Simonovic

      just one remark: it also happens with 2.5sec timeouts

  • http://www.nitschinger.at Michael Nitschinger

    I think I found a bug in your code while running the sample. It’s good stuff btw! But you forgot to also move the gets call into the try/catch and since it can also time out the exception gets swallowed somewhere and your executing thread dies.

    With my PR things go through nicely: https://github.com/mbsimonovic/jepsen-couchbase/pull/2

  • Mao Geng

    Hi Milan,
    Your article/test is as same as interesting as Kyle Kingsbury’s. Thanks for the sharing.
    Did you proceed to test against a 5 node couchbase cluster? I am curious what’s the result.

    • http://milansimonovic.com/ Milan Simonovic

      Hi Mao, no still haven’t gotten back to it, at first was waiting for couch 3.0, but then had to switch back to mongo for other reasons. Hope to get back and finish this later this year.