Automation@Brainly - Polish Linux Autumn 2014

Automation at Brainly
… or how to enter the world of automation in a “different way”.

World’s largest homework help social network, connecting over 25 million users monthly
OPS stack:
About Brainly
● ~80 servers, heavy usage of LXC containers
(~1000)
● 99.9% Debian, 1 Ubuntu host :)
● Nginx / Apache2, 2k reqs per sec
● 147 million page views monthly
● 700Mbps peak traffic
● Python is dominant
DEV stack:
● PHP
－ Symfony 2
－ SOA projects
－ 200 reqs per sec on russian version
● Erlang
－ 45k concurrent users
－ 22k events per sec
● Native Apps
－ iOS
－ Android

Starting point
● Puppet was not feasible for us
- *lots* of dependencies which make container bigger/heavier
- problems with Puppet's declarative language, order of parsing is important
- seemed incoherent, lacking integration of orchestration
- steep learning curve
- YMMV
- we're a Python shop ;)
● "packaging as automation" as an intermediate solution
- dependency hell, installing one package could result in uninstalling others
- inflexible, lots of code duplication in debian/rules file
- LOTS of custom bash and PHP scripts, usually very hard to reuse
and not standardized
- this was a dead end :(
● Ansible
- initially used only for orchestration
- maintaining it required keeping up2date inventory, which later
simplified and helped with lots of things

First steps with Ansible
● we decided to move forward with Ansible and use it for setting up machines as
well
● first project was nagios monitoring plugins setup (check-growth,
available on github BTW)
● turned out to be ideal for containers and our needs in general
- very little dependencies to begin with (python2, python-apt),
and small footprint - "configured" Python modules are transferred
directly to machine, no need for local repositories
- very light, no compilation on the destination host is needed
- easy to understand. Tasks/playbooks map directly to actions
an ops/devops would have done if he was doing it by hand
- compatible with "automation by packages". We were able to
migrate from the old system in small steps.

Avoiding regressions
● we decided to learn from our mistakes and do it right this time
● all policies, rules, and good practices written down in automation's repo main directory
● helps with introducing new people into the team or with devops approach
- newbies are able to start committing to repo quickly
- what's in GUIDELINES.md, that's law and changing it requires wider consensus
- gives examples on how to deal with certain problems in standardized way
● few examples:
- limit the number of tags, each of them should be self-contained
with no cross-dependencies.
- each branch should be named <your-name>/<branch-name>,
with master being in production
- do not include roles/tasks inside other roles,
this creates hard to follow dependencies
- NEVER subset the list of hosts inside the role, do it in site.yml.
Otherwise debugging roles/hosts will become difficult
- think twice before adding new role and esp. groups. As infrastructure grows,
it becomes hard to manage and/or creates "dead code/roles"
- create reusable roles, by moving common code into *_base roles (i.e. apache2_base).

Ugly-hacks reusability
● one of the policies introduced was storing one-off scripts in a
separate directory in our automation repo.
● these scripts are usually Ansible playbooks used just for one
particular task (i.e. Squeeze->Wheezy migration). Rest is in
Python, few of them in bash.
● version-control everything!
● turned out to be very useful, some of them turned out to be useful
enough to be rewritten to proper role or a tool

Apache2 automation
● available on GitHub and Ansible Galaxy:
https://galaxy.ansible.com/list#/roles/940
https://galaxy.ansible.com/list#/roles/941
● “base” role:
- is reused across 8 different production roles we have ATM
- contains basic monitoring, logrotation, packages installation, etc…
- includes PHP setup in modphp/prefork configuration
- PHP disabled functions control
- basic security setup
- does not include any site-specific stuff
● "site” role:
- contains all site specific stuff and dependencies
(vhosts, additional packages, etc...)
- usually very simple
- more than one site role possible, only one base role though
● It is an example of how we make our roles reusable

● automatically setups monitoring basing on inventory and host groups
● implements devops approach - if dev has root on machine, he also has
access to all monitoring stuff related to this system
● automatic host dependencies basing on host groups
● provisioning new hosts is no longer so painful ("auto-discovery")
● code reuse! - uses apache2_base to serve interface
● all services configuration is stored as YAML files, and used in templates
● role uses DNS data from YAML hash in order to make monitoring
independant of DNS failures
Icinga

DNS migration
● at the beginning:
- dozens of authoritative name servers, each of them having
customized configuration, running ~100 zones, all created by hand
- changing anything required lot's of work and was error prone
- the main reason for that was using DNS for switching between primary/secondary
servers/services
● three phases:
- slurping configuration into Ansible
- normalizing the configuration
- improving the setup
● an example of how we can interface Python and Ansible to perform more complex stuff
● Python script which uses Ansible API to fetch normalized zone configuration from each server
- results available in a neat hash, with per-host, per-zone keys!
- normalization using named-checkconf tool
● results, after parsing, are stored in one, big YAML
● parsed/included by all modules requiring dns data
* suboptimal, see WIP section in next slide
● use slurped configuration to re-generate all configs, this time using only the data available to
Ansible's
● "push-button" migration, after all recipes were ready :)

DNS automation
● all zone transfers are signed, all hosts have individual keys
● checks/markets use dns data directly in templates
● ACLs are tight, and auto-generated
● changing/migrating slaves/masters is easy, NS records are auto-generated
● updates to zones automatically bump serial, while still preserving the
YYYYMMDDxx format
● CRM records are auto-generated as well
* see next slide about CRM automation
● still WIP
- streamline the use DNS data directly in playbooks by creating
lookup plugin, no more "chicken and egg problem"
- new hosts should be automatically added to DNS
- reduce number of DNS servers ;)
- auto-generation of reverse zones
- opensourcing!

Corosync & Pacemaker
● we have ~130 CRM clusters
● setting them up by hand would be "difficult" at best, impossible at worst
● available on Ansible Galaxy:
- https://galaxy.ansible.com/list#/roles/956
- https://galaxy.ansible.com/list#/roles/979
● follows pattern from apache2_base
- “base” role suitable for manually set up clusters
- "cluster” role provides service upon base, with few reusable snippets
and a possibility for more complex configurations
● automatic membership based on ansible inventory (no multicasts!)
● the most difficult part was providing synchronous handlers which will not
screw up the cluster at one hand and update the configuration on other
● few simple configurations are provided, like single service-single vip

User management automation
● initially we did not have time nor resources to set up
full fledged LDAP,
● we needed:
- user should be able to log in even during a network outage
- removal/adding users, ssh-keys, custom settings, etc..
all had to be supported
- it had to be reusable/accessible in other roles
(i.e. Icinga/monitoring)
- different privileges for dev,production and other environments
- UID/GID unification
● turned out to be simpler than we thought - users are managed using few
simple tasks and group_vars data. Rest is handled via variables precedence.
● migration/standardization required some effort though

● we are leasing our servers from Hetzner,
no direct Layer 2 connectivity
● all tunnel setups are done using Ansible, new server
is automatically added to our network
● firewalls need to be tight, set up by Ansible as well
- OPS contribute the base firewall, DEVs can open
the ports of interest for their application
- ferm at it's base
- WIP
Networking

Backups
● based on Bareos, opensource Bacula fork
● new hosts are automatically set up for backup,
extending storage space is no longer a problem
● authentication using certificates, PITA without ansible

Scaling markets
● we are present on 20 markets (and counting!), each of them grows
at it's own pace
● providing configuration for each and every of them would be
problematic, having one-size-fits-them-all would be inefficient
● "Market classes" - each size has its own set of params
● scaling markets up/down requires just one change and a playbook run
● done using group variables and inheritance hierarchies

Not everything is perfect
● Jinja2 template error messages are "difficult" to say the least. There is
no information regarding where is the problem, and i.e. description of
it's nature. Often we have to bisect
● templates sometimes grow to huge complexity
● Jinja2 is designed for speed, but with tradeoffs. We *really* miss some
Python operators, and creating custom plugins/filters poses some problems
● the same goes with Ansible roles - with more complexity, sometimes it would be
really handy to have Python around. Corosync handlers are a good example
● multi-inheritance, problems with 2-headed trees
- bit ugly, but we had no other choice
● speed, improved with "pipelining=True"
● some useful functionality requires paid subscription (Ansible Tower)
- RESTfull API, usefull if you want to push new application version
to productions via i.e. Jenkins
- schedules - currently we need to push the changes ourselves

Dev,DevOps,Ops
● developers by default have RO access to repo, RW on case-by-case basis
● changes to systems owned by developers are done by developers,
OPS only provide the platform and tools
● all non-trivial changes require a Pull Request and a review from Ops
● in order to limit access to most sensitive data (i.e. passwords,
certificate keys, etc...), we encrypt mission critical data with Ansible Vault
and push it directly to the repo
- *strong* encryption
- available to Ansible without the need for decryption
(password still required though)
- all security sensitive stuff can be skipped by developers with
"--skip-tags" option to ansible-playbooks

Opensource! Opensource! Opensource!
● some of the things we mentioned can be find on our Github account
● we are working on opensourcing more stuff
https://github.com/brainly

Conclusions
● time needed to deploy new markets dropped considerably
● increased productivity
● better cooperation with developers
● more workpower, Devs are no longer blocked so much, we can push
tasks to them
● infrastructure as a code
● versioning
● code-reuse, less copy-pasting

We are hiring!
http://brainly.co/jobs/

Automation@Brainly - Polish Linux Autumn 2014

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (18)

Similaire à Automation@Brainly - Polish Linux Autumn 2014

Similaire à Automation@Brainly - Polish Linux Autumn 2014 (20)

Dernier

Dernier (20)

Automation@Brainly - Polish Linux Autumn 2014