Businesses need IT to be dynamic and responsive, especially at this time of year, which makes it exactly the wrong time to try to prevent changes.
The last few weeks of the year are a hectic time, but in IT Operations, things should in theory be nice and quiet. This is because many businesses impose some form of change freeze across all production systems during the period from Thanksgiving until the New Year.
In theory, all changes are postponed until January to avoid disrupting operations during a pretty critical time of year. In practice, there are a number of exceptions to the freeze, but the overall rate of change should still be lower.
The reason for this is to avoid the nightmare situation when what should be a routine change impacts production, and the company misses out on major Black Friday sales or fails to close its books by the end of the quarter, or whatever your personal holiday nightmare would be.
Changes are risky for two reasons: the change itself may be buggy, or its implementation may introduce (or reveal) issues. If you’re in retail and you roll out a new version of your shopping cart software just prior to Black Friday, you had better be absolutely certain that it works and performs under load. Maybe you should roll it out at a quieter time, watch how it behaves under moderate load, and then decide whether it can handle what is probably the highest load of your entire year.
Even assuming the code itself is perfect, what if the roll-out is imperfect? Remember Knight Capital? They botched a deployment by missing one target server out of eight, and as a result, they are no longer a going concern.
Or what if you deploy what should be a routine patch, reboot to apply it, and the database server doesn’t come back up? Now you have a problem, and it’s not just a technical problem either. ITOps is probably under-staffed at this time of year, as people take time off from work to visit their families.
This is the sort of scenario that keeps hardened sysadmins awake at night — which is why ITIL codifies change freezes to avoid those nightmares coming true.
Here’s the problem: That precautionary approach no longer works. If your organization runs on CI/CD, it’s actually not a great idea to interrupt the pipeline. Also, the business may require changes to be made on the fly, in response to customer demand patterns or surprise moves by competitors. If a new zero-day vulnerability hits, the calculus of whether to patch it promptly has to include the expectation that it will be exploited almost immediately, not sometime in the nebulous future.
Doing this sort of thing during a change freeze is technically an exception, but when every supposed freeze includes multiple exceptions, you are better off recognizing that changes will happen, and working out how to mitigate their impact — without an unrealistic attempt to freeze the entire business for weeks at a time.
There are three main areas for IT Operations to focus on in order to accommodate constant change: architecting application infrastructure for change, planning, and de-risking changes so that they can be executed as safely as possible, and when something does go wrong, catching it early and preventing it from triggering cascading outages:
Build change into the architecture itself;
Plan changes around the current state of the environment;
Catch any unforeseen consequences of a change early.
If making changes to your application feels like swapping engines on an airplane mid-flight, it’s reasonable to be nervous. It shouldn’t be that way; the analogy should be more like Disneyland, with service tunnels everywhere so that any issues can be dealt with immediately without disrupting visitors’ experience.
Nobody would recommend hot-patching your production server; instead, run it on a cluster, such that you can drop nodes and patch them one at a time while keeping the overall application up and running. Ensure that you have automated fail-over configured for any critical bottlenecks. Automate everything, and test the automation before you need it — perhaps even going as far as unleashing an army of monkeys in your environment every now and then, just to make sure you can cope.
ITIL assumes that the current state of the IT environment can be known at any given time. This may have been true once, but the rate of change of IT infrastructure, and the number of human and non-human actors involved in making those changes, have long exceeded any capability to construct an exhaustive model of enterprise IT environments.
If there is no longer an authoritative “single source of truth”, any departure from which can be tracked as a change, processes based on that assumption – such as the whole idea of a change freeze – start to break down.
A better way is to use algorithms to look for patterns in real-time events. The emerging discipline of AIOps avoids this dependency on a model, which is almost certainly out of date.
The last thing you want is to make a change that should be safe, but turns out not to be, because of some undocumented change in production. Automating all changes, as per the previous point, will help — but working with real-time data ensures that required changes can be planned with a more complete and up-to-date understanding of the true state of the IT environment.
From an AIOps point of view, a change is itself just another event, which can, in turn, be correlated with events from availability and performance monitoring tools. The idea is to identify whether there are any unexpected events being produced as a consequence of a change, and if so, determine rapidly whether they might represent unforeseen consequences of that change.
This holistic view enables IT Operations teams to understand very early on when a change might be causing problems so that they can move quickly to resolve or at least mitigate the issue. Ideally, a problematic deployment can simply be rolled back, but if not, maybe traffic can be rerouted away from impacted systems or other standby plans put into effect.
Business moves fast, and IT must move just as fast to avoid becoming a bottleneck, or worse, a liability. Last century it might have been reasonable to institute a change freeze, because the goal was making sure that relatively simple IT systems — cash registers and bar-code scanners — would continue to operate. By and large, if they were not interfered with, they would just keep ticking along.
These days, the business operates at a different pace, and complexity has increased by orders of magnitude, especially with online sales growing to an ever-greater share of the total. Far from being a reasonable safety precaution, a change freeze represents a business vulnerability — and an unnecessary one. Implementing new discounts or promotions or responding to outside events is going to require IT changes. The speed with which those changes can be accommodated directly determines the agility of the business. That means change must be routine, not an unexpected exception that triggers emergency processes.
Change is normal and it’s all around us, all the time. IT processes must accommodate it, and so must the tools that implement those processes. Moogsoft AIOps is ready to thaw out your frozen IT, even at this cold time of year.
Moogsoft is the AI-driven observability leader that provides intelligent monitoring solutions for smart DevOps. Moogsoft delivers the most advanced cloud-native, self-service platform for software engineers, developers, and operators to instantly see everything, know what’s wrong and fix things faster.
Also published here.