Ansible and Configuration Management at Reclaim Hosting
When we started Reclaim Hosting it was with a lot of hopes and dreams and a single server. Hard to believe we only had a single server to manage for the first ~8 months of Reclaim's existence (and Hippie Hosting was always so small that it always only had a single shared server for all customers). Those days however are long gone, today Reclaim Hosting manages a fleet of over 100 servers for customers and institutions across the globe. As you can likely imagine, it's been a hell of a learning curve to get to a point where we are comfortable with such a large infrastructure.
Unlike a high-availability setup where you might have a bunch of servers spinning up and down but you'd never be logging into a particular one, when it comes to web hosting each server is managed individually. SSH Keys and a shared password management system go a long way to alleviate headaches with access. But the biggest hurdle has been configuration management and I finally feel we're starting to get to a place where I'm comfortable with it (I think it will always be a process). When I say configuration management I mean keeping track of and managing everything from the versions of PHP installed, what system binaries are installed and available like git, upgrades to software over time, and things of that nature. With an infrastructure this large split currently across 3 different companies, many different datacenters, and even a few different operating systems and versions it is inevitable to find that what you thought was running on one server is actually out of date or didn't get installed at the time of provisioning.
There are two things thus far I have done to tackle this issue. The first was to take a hard look at our provisioning process. In the past this was completely manual. Fire up a server and start working through a list of things that needed to be installed. If I normally installed something but forgot to mention it to Jim then there's a great chance it wouldn't get installed. And if the coffee hasn't kicked in then the whole system by its very nature is just prone to user error resulting in an incomplete system, not to mention this could take between a few days to up to a week to complete if other stuff got in the way. It may seem simple, but I created a bash script as a very simple way to get at the deploy process. There are a few prerequisites that have to be installed before running it (namely git since it's in a private repo and screen so we can run it with the session closed) but what used to be a process measured in days can now complete the majority of work in about an hour and a half. Here's everything thus far that the script does:
- Install git, screen, nano, and ImageMagick
- Install cPanel
- Run through some first-time setup options for cPanel configuration
- Compile PHP with all extensions and Apache modules we need
- Install Installatron and configure settings
- Install Configserver scripts (firewall, mail management, exploit manager) and configure with settings
- Update php.ini with approved values
- Install Let's Encrypt
- Install Bitninja (a distributed firewall product we use)
- Setup custom cPanel plugin Reclaim Hosting uses for application icons
- Configure automatic SSL certificates
- Reboot server
Mostly my process has been working backward from what things we used to install and configured and finding what commands I could use to do the same process (installation is always with cli but often configuration is with the GUI so finding the right way to configure was a bit more time consuming and sometimes involved editing files directly using sed). We have more that could be done, again I treat this all as a process where we can continue to refine, but this has gone a long way to making that initial hurdle of setting up servers a "set it and forget it" approach.
So with our deployment process becoming more streamlined the second piece of the puzzle was to get a handle on long term configuration of servers, the various changes and settings that we have to manage. If Apache is getting an upgrade it should be tested on a dev server and with approval it should be pushed to all servers to avoid any mismatch. In the past that meant me opening up a bunch of tabs and getting to work but that's not a scalable approach. I've taken to learning more about Ansible as a configuration management tool and it has already saved me countless hours.
Ansible doesn't require any agent to be installed on our machines, it uses SSH access to run its commands. Commands are put together with "Playbooks" which are nothing more than .yml text files. There are various "roles" that can be used to handle everything from installing software with yum and moving files back and forth, to more complex tasks, and people can write their own roles so there is lot out there already for popular approaches to configuring and managing servers. At this point you might be thinking "Well if it can do all that why did you write a bash script instead of using Ansible for deploying servers?" and you're not wrong Walter. Long term it probably does make sense to do that, but Ansible playbooks have a very specific way of doing things that is replicable across a lot of servers and frankly it would require a lot of work to rewrite the deployment method that way so it's a goal but not a major issue in my eyes.
Now with Ansible if I decide I want to roll out HTTP2 support to our servers I can write a small playbook that installs it via yum and then run that against our entire fleet. If a server already has the support for it then it doesn't have to make a change so there's no harm in running playbooks on a variety of servers that may or may not have common configurations to get them all up-to-date. If anything the biggest challenge is not writing the playbooks (which I actually enjoy), it's keeping our inventory file that holds data for all of our servers up-to-date. A dream would be to use the Digital Ocean API to dynamically update our inventory in Ansible so if a new server is added there it's automatically added to our inventory.
I'm confident that developing these configuration management processes will help us ensure our customers benefit from the latest software and better support with a standard environment they can count on from one server to the next. And if it saves us time in the process to devote to new development work even better.