Mesos, Marathon, and Docker
How Yodel solved its deployment problems using Mesos and Marathon with Docker
Goals
- Enpower the developers no ops person
- Developer can set the deployment into the pipeline and go to lunch and come back
Technologies used
Some of the technology used at yodel
Service Discovery
- HA Proxy
- Qubit Bamboo
Container
Docker
Scheduler/keepalive
- Marathon - A Mesos scheduler and starts the process if mesos kill it
- Monitoring when processes are down and self-heal to bring up another instance
Virtual layer
Mesos
- Master slave topology
- Clustering for realms
- PROD mesos cluster
- QA mesos clsuter
- Staging mesos cluster
- Contains zookeeper
CI
Atlassian Bamboo - CI
Monitoring
New Relics
Deploy scripts
In house microservice called “Cerebro” - microservice in docker container deployed by marathon in mesos
Configuration
docker file and puppet
Logs
syslogs (or other sensible log solution) - trade off no console log
Scheduling
MPI and Hadoop schedulers
Problems faced and solutions
Problem: Bad Actors
Some processes cause starvation
Solution: dynamic scaling/customizable services
Scaling
Scales from 3 to 3000 dyanmically, through an UI for example
- A process is not allowed to consume more than the resource CPU/Memory that is allocated
- No starvation
Resource constraints
- CPU (fall back to swap if set up) meso will kill and marathon will take over and startup
- Memory constraint is soft - if process is the only running on host it can consume all memory
Problem: explicit service discovery
- Doesn’t need to contact zookeeper discovery to figure out QA and PROD
Solution: implicit service discovery
look up port 8080 and it distributes to the slaves that are service-instance-1 Listening on 31003 service-instance-2 Lisetning on 31090
Problem: All or nothing deployment is a problem
If there are problems in production you have to roll back
Solution: Canary Deployments
Canary Isolated - Step your toe into PROD and see if it’s working - deploy to PROD tests are passing
Canary Partial - Leg into PROD and see if it’s working - decree 10% of the traffic, if error rollback
Full Production - deploy to PROD with confidence
Configuration management
Configuration management for infrastructure and application
Infrastructure level
puppet for configuration to manage (ex: a system admin) mesos master and slave for infrastructure
Application level
developer configure their own docker file (ex: a java developer)
Clean up
- 3 newers versions brought up, mesos will tear down the old versions and taken out of rotation once marathon verifies the new versions are up and running
- deployment finished the ‘cerebro’ deamon stops watching it, new relic take over
Deployed outside of mesos clusters
- databases are mapped outside of mesos
- Legacy services ex: using curator over zookeeper for thrift/legacy stack
Alternatives to Mesos
CoreOS Fleet
Developer environment
- Docker container
- No mesos in DEV but deploy the container over to mesos cluster
Q & A
- Why Qubit Bamboo and HA Proxy?
HA Proxy for canary
- waiting for canary partial
- header for canary isolation
Alternatives to Qubit Bamboo
- Consul template instead Qubit Bamboo
- Production topology
- 4 Mesos slaves and 4 mesos masters
- 4 machines
- 20 services 3 instances each
- 60 docker containers
- Chronos
- being evaluated for scheduling to replace cron
- leader election is a horrible problem with shell script
- Right now only long-running services
- Short running batch processes shouldn’t conflict (hadoop) Constraint is resource CPU/memory
- Reliable non-flaky deployments
- “Cerebro” - in house runs as a docker container through Mesos
- micro service architecture - shared state is not a thing that happens as each service has its own database
- Case study: Production process was falling over (memory issue)
- Used canary isolated to continuously deploy and rollback
Architecture