Assisting a fellow Drupal agency to solve a deployment issue
The RCPCH (Royal College of Paediatrics and Child Health), is a professional organisation for paediatricians in the United Kingdom. It is in charge of paediatricians' postgraduate training and administers the Royal College of Paediatrics and Child Health examinations.
Code Enigma is RCPCH’s hosting provider, managing their AWS service for them. In fact, in our other RCPCH case study you can read about how we ensured the anticipated spike in traffic during exam time would not crash, even when their budget was tight.
NDP Studio, another specialist Drupal development agency we often partner with on projects, is responsible for the site’s Drupal development. Almost always, scheduled deployments happen smoothly without issue. However, on this rare occasion during a minor upgrade, the deployment mysteriously failed. RCPCH asked us to investigate the deployment problems and we were happy to assist.
What was useful is that RCPCH has a pre-live environment that we were able to use for final testing. However, as this lives on the same app servers as the live site, any builds had to take place out of hours.
The challenge was that the pre-live environment was built on the main autoscale cluster. It was therefore not possible to run a mock deployment through to this pre-live without disrupting Live. RCPCH receives global traffic and so it was difficult choosing a specific time where this would be feasible.
Deploying to the pre-live environment would indeed have restarted the same services as were serving the production build. However, as this was a cluster and the restarts happen sequentially, outages were nearly undetectable.
The ideal situation would have been to have the two separate environments but in the immediate term, the actual risk to production was fairly low.
What we did
The code in the pre-live branch differed from that in the live branch. As such, we renamed the existing pre-live branch and replaced it with a copy. This way, we were able to sync the live database down to the pre-live environment, then run an accurate test deployment before proceeding with the next live deployment.
We ran the overnight testing and checked the Jenkins build output. On investigation, it seemed that there were a few composer packages that were throwing warnings, but the main error seemed to be when the Drupal caches were rebuilt.
From the media entity database updates above, it looked like the deployment was trying to handle the switch from using the media contrib module to using media in Drupal core.
We have seen deployment issues like this before where multiple, incremental deployments have worked fine on development or stage environments, but when those changes have been merged through to the live site in one combined deployment, things fail due to the sequencing of updates and cache clearing.
As such, it's a good idea to test these deployments on a staging environment first, after syncing the live database down to give an accurate replica of the live environment.
We’re happy to report that we were able to fix the issue and get RCPCH upgraded to Drupal 8. More recently, this has allowed them to easily upgrade to Drupal 9.
Plus, as exam time approaches, we've increased the minimum number of servers for the main website ahead of the exam application period so the site will withstand the inevitable traffic spike. We look forward to helping RCPCH in their mission to educate.