Mesos, Marathon, and Docker

June 23, 2015

How Yodel solved its deployment problems using Mesos and Marathon with Docker

Goals

Enpower the developers no ops person
Developer can set the deployment into the pipeline and go to lunch and come back

Technologies used

Some of the technology used at yodel

Service Discovery

HA Proxy
Qubit Bamboo

Container

Docker

Scheduler/keepalive

Marathon - A Mesos scheduler and starts the process if mesos kill it
Monitoring when processes are down and self-heal to bring up another instance

Virtual layer

Mesos

Master slave topology
Clustering for realms
- PROD mesos cluster
- QA mesos clsuter
- Staging mesos cluster
Contains zookeeper

CI

Atlassian Bamboo - CI

Monitoring

New Relics

Deploy scripts

In house microservice called “Cerebro” - microservice in docker container deployed by marathon in mesos

Configuration

docker file and puppet

Logs

syslogs (or other sensible log solution) - trade off no console log

Scheduling

MPI and Hadoop schedulers

Problems faced and solutions

Problem: Bad Actors

Some processes cause starvation

Solution: dynamic scaling/customizable services

Scaling

Scales from 3 to 3000 dyanmically, through an UI for example

A process is not allowed to consume more than the resource CPU/Memory that is allocated
No starvation

Resource constraints

CPU (fall back to swap if set up) meso will kill and marathon will take over and startup
Memory constraint is soft - if process is the only running on host it can consume all memory

Problem: explicit service discovery

Doesn’t need to contact zookeeper discovery to figure out QA and PROD

Solution: implicit service discovery

look up port 8080 and it distributes to the slaves that are service-instance-1 Listening on 31003 service-instance-2 Lisetning on 31090

Problem: All or nothing deployment is a problem

If there are problems in production you have to roll back

Solution: Canary Deployments

Canary Isolated - Step your toe into PROD and see if it’s working - deploy to PROD tests are passing

Canary Partial - Leg into PROD and see if it’s working - decree 10% of the traffic, if error rollback

Full Production - deploy to PROD with confidence

Configuration management

Configuration management for infrastructure and application

Infrastructure level

puppet for configuration to manage (ex: a system admin) mesos master and slave for infrastructure

Application level

developer configure their own docker file (ex: a java developer)

Clean up

3 newers versions brought up, mesos will tear down the old versions and taken out of rotation once marathon verifies the new versions are up and running
deployment finished the ‘cerebro’ deamon stops watching it, new relic take over

Deployed outside of mesos clusters

databases are mapped outside of mesos
Legacy services ex: using curator over zookeeper for thrift/legacy stack

Alternatives to Mesos

CoreOS Fleet

Developer environment

Docker container
No mesos in DEV but deploy the container over to mesos cluster

Q & A

Why Qubit Bamboo and HA Proxy?

HA Proxy for canary

waiting for canary partial
header for canary isolation

Alternatives to Qubit Bamboo

Consul template instead Qubit Bamboo

Production topology
- 4 Mesos slaves and 4 mesos masters
- 4 machines
- 20 services 3 instances each
- 60 docker containers
Chronos
- being evaluated for scheduling to replace cron
- leader election is a horrible problem with shell script
Right now only long-running services
- Short running batch processes shouldn’t conflict (hadoop) Constraint is resource CPU/memory
Reliable non-flaky deployments
- “Cerebro” - in house runs as a docker container through Mesos
- micro service architecture - shared state is not a thing that happens as each service has its own database
Case study: Production process was falling over (memory issue)
- Used canary isolated to continuously deploy and rollback

Architecture

Architecture 1 Architecture 2 Architecture 3 Architecture 4