Monitoring and Improving
Our newest employee, Joe McMahon, recently wrote on the Reclaim blog about the excellent monitoring setup he's been working on using a product called Observium. Before now a lot of email notifications for various services had been coming directly to me and it got rather noisy, plus it wasn't sustainable for one person to have a general idea when things were amiss. We needed to get better about anticipating issues which was why we made better monitoring a priority when we began working with Joe. One of the great things about the new setup is that server issues now report to Slack instead of via email. Slack has proved itself invaluable to our company and really helps filter out the noise (not to mention it doubles as an excellent archive to search against).
By knowing about issues as they come up in Slack we can proactively work on things long before someone notifies us asking if we're having problems. There's work to be done to tune the notifications and make sure we know about everything that's happening but we're making headway and even today I added another integration that's going to be a huge improvement for us.
We use a product called R1Soft for off-site backups on our servers. It's been an excellent tool with low server overhead that quietly backups up all files and databases to a separate server every night. I can't tell you how many times folks are pleasantly surprised to know that we can easily restore lost work automatically and get them back up and running. While in theory backups are the responsibility of all of us individually, we take the approach at Reclaim that when possible we'd rather this kind of stuff just happen automatically and that you shouldn't have to think about it. Luckily the solution has been affordable enough for us to absorb the cost and it's excellent peace of mind for us as well.
Occasionally, as with all software, R1soft will have issues with a backup. Perhaps a firewall change on the server or a full disk or any number of other factors will contribute to it no longer backing up regularly. I would log in and see a screen similar to this:
Those red numbers are never good. And to make it worse if I wasn't checking every day I could find out only after the fact that a server hadn't been backed up in quite awhile. That's bad for everyone. We needed better notifications in order to know if things were going wrong and take action on them.
So today I worked on that issue building on the work Joe had done to write to a monitoring channel in Slack using their email integration and setup SMTP on the R1soft servers to start sending us reports if any servers failed to backup the previous night.
Slack makes integrations like this incredibly easy. You do have to have a paid account in order to use email integrations, but setting one up is as simple as them handing you an email address and off you go. You can customize where the messages go and who they show up from, even the avatar of the user. No special scripts, just a simple email address to plug into whatever software you're working with.
The biggest lesson I'm learning as we continue to refine our monitoring processes is that you can't proactively make systems better until you have clear insight into what's going on. Too often I find myself in a reaction stance of putting out fires when instead I want to know when things are starting to heat up. Turns out the age old saying rings truer than ever, Knowledge is Power.