Managing Servers Series (2/3)

Part Two

Managing your servers is like managing a nightclub. We have things in place to keep unsavoury people out (security). We have the best DJs with the latest tunes to make sure everyone has a good time (server management). And we regularly check in to make sure things are running smoothly, the drinks are flowing and the disco ball is spinning (monitoring).

Security

We take the security of your data and service very seriously. That's why we led the charge as a Drupal hosting company, with our ISO 27001 certification. In this section we describe some of our approaches to maintaining a secure infrastructure. We start with AWS Virtual Private Connections (VPC) in front of everything. These provide a private, segregated network environment for your layouts and applications and are loaded with features, such as:

granular visibility and control over traffic rules
industry-leading intrusion protection
built-in network profiling
throughput tailored to the amount of AWS compute resource you purchased
unlimited VPCs, for secure network separation
industry-standard site-to-site VPN capabilities

AWS have advanced data center-level attack mitigation services in place. There are a suite of AWS products you can consider, if you believe you are at a high risk of malicious activity, such as distributed denial of service (DDoS) attacks, which include AWS WAF and AWS Shield.

At a server level, we run advanced intrusion detection software on every Virtual Machine. This is a combination of:

rkhunter (for inspecting machines for malicious changes and scripts);
OSSEC (real time intrusion detection and automatic blocking), and;
ClamAV for virus and malware detection (which can also be run as a daemon and integrated with your website file uploads, for additional peace of mind).

We also lock down all server access to our secure, multi-factor authenticated VPN endpoint and (if requested) client VPN / gateway IP addresses. Critical services are never exposed to the wider Internet; only accessible to Code Enigma and client staff.

We subscribe to security mailing lists for all the critical elements of our software stack. If there are any critical exploits released in the interim period between scheduled upgrades, a ticket is automatically raised in our system. Our 24/7/365 team will react immediately to formulate a mitigation plan, communicate with affected clients and implement any necessary emergency patching.

Finally, all production servers are protected by four factors of authentication:

username
password
SSH private key and
a physical additional factor (e.g. YubiKey or supported phone apps like Google Authenticator).

You can't login to the command line of a Code Enigma production server without having a valid username, corresponding password, corresponding SSH private key for that username and a one-time password, generated by the YubiKey and/or phone associated with your username.

Server Management

At Code Enigma we use a server management system called Puppet to store the core configuration and software of our servers. We apply any configuration and software changes made in the server's manifest file to the servers in a controlled way, using a centrally controlled system called Puppetmaster. Using this we can ensure our backbone services stay up, while giving you the freedom to control other aspects of the stack.

After that, we use Ansible to keep an easily retrievable record of client customisations. Every server managed and/or operated by Code Enigma has a unique manifest file. This is stored in a version control system where change is strictly tested and peer reviewed. It represents its "DNA" - a complete profile of the installed software and settings. Any changes to configuration get added to the server's manifest file. This way if we have a disaster recovery scenario or someone accidentally breaks something, we can just run that server's Ansible manifest on it. Everything gets put back how it was.

All this has numerous advantages:

guards against human error
allows for easy and automatic duplication of servers for scaling out applications
allows for rapid disaster recovery
records historical change of server configurations for forensic analysis
makes platform configuration edits simple and safe.

The same is also true of the actual infrastructure, whose configuration is also kept in code, using the Terraform system from Hashicorp. This affords us change control and ease of management over infrastructure as well as software.

Patching and backup schedules

Every server goes on our patching calendar, and every week gets manually upgraded by a trained engineer, to ensure we're running the very latest stable packages. On the odd occasion an upgrade requires a system reboot, highly available layouts (as described in part 1 of this series) are patched in such a way as to ensure zero downtime. Layouts with single points of failure have their downtime scheduled with the client via support ticket.

Servers are also backed up in two different ways. We take local backups of databases, should there be any corruption or data loss. And we back up your entire server off site to Rackspace Cloud Files. This offsite backup is encrypted with an encryption key unique to your organisation and securely stored by us. We use the open source software Duplicity to provide this offsite backup functionality. All backups run nightly and Duplicity off site backups are restored and checked for integrity quarterly. The results are stored in our ticketing system so there's an audit trail.

Monitoring

Our engineers monitor your servers around the clock via our two-vector monitoring servers. Our monitors are based at different ISPs, to reduce the likelihood of false-positives or the system being rendered impotent by an ISP outage. They monitor all Code Enigma hosted servers in real time using the open source Nagios infrastructure monitoring system. This covers all Linux systems and stack software, with sensible limits set to allow them time to react. Alerting occurs via phone apps, SMS, an IRC bot in our company channel and email. Where we can't use Nagios (for example, RDS) we use AWS CloudWatch alerts to email and Pushover instead, so we are alerted to any critical issues with AWS services as well.

We use Munin and AWS CloudWatch to monitor servers over time, watching key indicators to proactively advise when your system is nearing capacity. This allows us to keep a watch on resource and capacity and ensure we are proactive and you're never "surprised" by additional resource requirements. It's also useful for forensic investigation. If a software package crashes inexplicably, it'll give us a detailed picture of server behaviour up to the point of the crash. Key monitoring features are:

Nagios alerting of all Linux services and server health, including regex checks across your website
CloudWatch alerting of all AWS services and their health
Munin graphical monitoring of the server resource
2 vector monitoring set-up (so we also monitor the monitors to help eliminate false positives)
Drupal monitoring via Nagios available, if applicable.
We also monitor all websites via StatusCake, an online monitoring service, to help keep our own monitor nodes "honest".

In our final part of this series, we look at the tools we provide to help you manage complex server layouts and systems…