Jepsen-testing RabbitMQ [with python]

UPDATE: the code is now available on GitHub https://github.com/mbsimonovic/jepsen-python

In this post I’ll tell you about trying to reproduce the Jepsen RabbitMQ test (using python, not clojure).
It’s been more than 2 years since the test, and rabbitmq went from 3.3 to 3.6 meanwhile, so I was wondering if anything’s different these days (end of 2016).

Messaging is a legit communication pattern, as nicely documented in the Enterprise Integration Patterns (2003) book, and with the rise of microservices it’s even more relevant that it was 10-15yrs ago.

Seems like there’s a consensus about preferring smart endpoints and dumb pipes, so when it comes to picking up a messaging provider, there’s a few options: RabbitMQ, Apache Kafka and Apache ActiveMQ. Google Trends do back up this claim about the rise of interest:

 google-trends-rabbitmq-vs-kafka-vs-activemq

Back in 2013 Kyle Kingsbury started publishing a fantastic series of blog posts called Jepsen where he tested how databases behave in a distributed environment. Turned out bad for most of them, RabbitMQ included.

In short, under certain failure scenarios, it’s possible to lose data, and losing here means losing a rabbitmq acknowledged write.

I was wondering if today (December 2016) things are any different.

Setting up the environment (docker)

I’ve been using docker for almost 2 years now, so that was a natural choice for this test. I use Vmware Fusion to run ubuntu xenial, and run docker on ubuntu (you can grab an image from http://www.osboxes.org/ubuntu/).


# enable ssh
sudo apt-get install openssh-server
# login from your laptop
# update & upgrade
#
$ sudo apt-get -y install python-pip python-virtualenv python-dev
$ python --version Python 2.7.12
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial
$ uname -a
Linux osboxes 4.4.0-53-generic #74-Ubuntu SMP Fri Dec 2 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
#
# install docker https://docs.docker.com/engine/installation/linux/ubuntulinux/

Jepsen has meanwhile gotten a docker setup so I tried that first:

$ ssh osboxes@192.168.54.136
$ git clone https://github.com/jepsen-io/jepsen.git && cd jepsen/docker
$ ./up.sh # this runs docker-compose build and up, so in a new terminal:
$ sudo docker exec -it jepsen-control bash
root@control $ cd rabbitmq && lein test :only jepsen.rabbitmq-test

This failed with JSchException: Packet corrupt. ssh-ing once to each node and accepting the host key fixed this problem.
Trying lein test again failed with the message “rabbitmq-server: unrecognized service”. Looking at the console output, the test tried to install rabbitmq only on 3 nodes (out of 5). So I manually installed rabbit on the remaining 2 nodes and ran the test again. It now hangs while starting rabbit. I tried changing the rabbit-test.clj and adding :nodes [:n1 :n2 :n3 :n4 :n5] but it tries to start rabbit on 4 nodes and hangs again. Not familiar that much with clojure enough to debug so a Jepsen bug submitted.

Running a RabbitMQ cluster in Docker

First thing I need a rabbitmq cluster in docker, will start with bijukunjummen/docker-rabbitmq-cluster:

$ git clone https://github.com/bijukunjummen/docker-rabbitmq-cluster.git
$ cd docker-rabbitmq-cluster/base
$ sudo docker build -t bijukunjummen/rabbitmq-base .
# this hangs while installing plugins, seems like erlang freaks out when it's pid 1, so just add && true ad the end:
# RUN /usr/sbin/rabbitmq-plugins enable rabbitmq_mqtt rabbitmq_stomp rabbitmq_management rabbitmq_management_agent rabbitmq_management_visualiser rabbitmq_federation rabbitmq_federation_management sockjs && true
$ cd ../server
$ wget 'https://github.com/jepsen-io/jepsen/raw/master/rabbitmq/resources/rabbitmq/rabbitmq.config'
$ sudo docker build -t bijukunjummen/rabbitmq-server .
$ cd ../cluster
# edit docker-compose.yml

Starting up a cluster using links didn’t work for me, some nodes would hang because they didn’t know
all other nodes (rabbit3 doesn’t know about rabbit5 because it was created earlier), so had to go with a DNS.

# need new docker-compose for version 2 syntax:
$ sudo apt-get install docker-compose
$ sudo docker-compose up

 

After a few seconds the cluster should be up and running so open http://192.168.54.136:15672/ and login with guest/guest.

screen-shot-2016-12-26-at-12-02-37

Let’s run a simple hello world test:

Blockade – Jepsen port in python

There’s a python port of Jepsen called blockade, so while waiting for jepsen-160, I’ve decided to write my own rabbitmq test using blockade.

With the cluster up and running let’s setup blockade to manage it:

 

The original jepsen test uses triple-mirrored writes so need to configure that:

Follow Milan Simonovic on Linkedin

screen-shot-2016-12-19-at-12-46-25

 

rabbitmq randomly chooses two more slaves for the queue, and it my case they were rabbit3 and rabbit5, you can find this on the queue tab, under Details.

Finally testing RabbitMQ

Let’s rewrite jepsen’s rabbit.clj test in python:

The code uses separate threads to send messages to rabbitmq, then waits for all to finish before collecting all messages. To introduce network problems while the test is running, I use blockade in another terminal:

 

Results

The clients connects and starts sending messages:

/usr/bin/python2.7 src/jepsen_test.py
[INFO] (MainThread) rabbitmq client at 5672
[INFO] (MainThread) Connecting to 192.168.54.136:5672
[INFO] (MainThread) Created channel=1
[INFO] (MainThread) rabbitmq client at 5673
[INFO] (MainThread) Connecting to 192.168.54.136:5673
[INFO] (MainThread) Created channel=1
[INFO] (MainThread) rabbitmq client at 5674
[INFO] (MainThread) Connecting to 192.168.54.136:5674
[INFO] (MainThread) Created channel=1
[INFO] (MainThread) rabbitmq client at 5675
[INFO] (MainThread) Connecting to 192.168.54.136:5675
[INFO] (MainThread) Created channel=1
[INFO] (MainThread) rabbitmq client at 5676
[INFO] (MainThread) Connecting to 192.168.54.136:5676
[INFO] (MainThread) Created channel=1
[INFO] (MainThread) starting producer
[INFO] (rabbit 5672) sending messages
[INFO] (MainThread) starting producer
[INFO] (rabbit 5673) sending messages
[INFO] (MainThread) starting producer
[INFO] (rabbit 5674) sending messages
[INFO] (MainThread) starting producer
[INFO] (rabbit 5675) sending messages
[INFO] (MainThread) starting producer
[INFO] (rabbit 5676) sending messages

then when rabbitmq detects a partition (pause minority), clients loses a connection:

[WARNING] (rabbit 5675) Published message was returned: _delivery_confirmation=True; channel=1; method=<Basic.Return(['exchange=', 'reply_code=312', 'reply_text=NO_ROUTE', 'routing_key=jepsen.queue'])>; properties=<BasicProperties(['delivery_mode=2'])>; body_size=5; body_prefix='15310'
[WARNING] (rabbit 5676) Published message was returned: _delivery_confirmation=True; channel=1; method=<Basic.Return(['exchange=', 'reply_code=312', 'reply_text=NO_ROUTE', 'routing_key=jepsen.queue'])>; properties=<BasicProperties(['delivery_mode=2'])>; body_size=5; body_prefix='20312'
[WARNING] (rabbit 5676) Socket closed when connection was open
[WARNING] (rabbit 5676) Disconnected from RabbitMQ at 192.168.54.136:5676 (320): CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
[CRITICAL] (rabbit 5676) Connection close detected; result=BlockingConnection__OnClosedArgs(connection=, reason_code=320, reason_text="CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'")

screen-shot-2016-12-21-at-10-39-03

Then after 60sec the partition is healed and the clients connects again:

[INFO] (rabbit 5675) Connecting to 192.168.54.136:5675
[INFO] (rabbit 5675) Created channel=1

End result, all 25000 messages sent:

[INFO] (rabbit 5672) sent: 5000, failed: 0, total: 5000
[INFO] (rabbit 5673) sent: 5000, failed: 0, total: 5000
[INFO] (rabbit 5674) sent: 5000, failed: 0, total: 5000
[INFO] (rabbit 5675) sent: 5000, failed: 0, total: 5000
[INFO] (rabbit 5676) sent: 5000, failed: 0, total: 5000

Note that rabbitmq reports 25020 messages:

screen-shot-2016-12-26-at-13-02-41

That’s ok, some are duplicates.

[WARNING] (MainThread) RECEIVED: 25000, DUPLICATE: 20. [20123, 15121, 15630, 20644..., 22713], LOST MESSAGES 0, []

Bottom line: 0 messages lost. I’ve found a great post by Balint Pato where he explained how he managed to reproduce the Jepsen results. Unfortunately, his code and setup are not publicly available, so until he gets it out, I’ll try to reproduce his results. Stay tuned, more posts coming..

5.1.17 UPDATE: second post available where I try different partitioning schemes.