By Rajani Kolli, VP Global Cloud Operations
If your SaaS development is limited to client software, you may be a “box hugger.”
The biggest mistake many software companies make when embarking on a Software-as-a-Service (SaaS) business model is focusing software development efforts on the client software, while relegating the server and infrastructure to “IT” as strictly a software distribution activity. But savvy SaaS companies realize that efficiency, performance, and reliability in dictated in the server room: the infrastructure must be coded. There is no other way to make network operations – and the entire customer experience – predictable and repeatable. The opposite of coding the infrastructure is what I call as “Just throw people to problem, aka ‘box hugging’”. That is, if network operations is considering merely logging into “boxes” to configure, install packages, start/stop services, or do maintenance, you are a box hugger. Coding the infrastructure requires treating automation artifacts (i.e. shell scripts, Ansible manifests, fabric scripts, etc.) and configuration as code – not check boxes. If there is no repeatable code to boot the bare infrastructure to a desirable operational state, then you are a box hugger. Box hugging is a bad habit, and is bad for the business. It makes recovery from failures time consuming. It does not scale. Most fat-finger and self-inflicted outages start with box hugging. Sure, it may have worked 100 times, but just one fat-finger mistake that 101st time could be enough to make your team’s life miserable.
So how to cure box hugging? It isn’t hard: only two steps. First, internalize the idea that the box you’ve just meticulously finished setting up could end up bursting into flames the very next minute. And second, treat your operations with the same discipline and rigor that you treat your software development activity.
So coding the infrastructure is not hard. Here are some tips in the process.
Infrastructure is ephemeral
Infrastructure is not permanent. It will fail. While we all calculate MTBF estimates, they are just that: estimates. Failures don’t follow estimates. When you treat infrastructure as ephemeral, the act of bringing up new infrastructure to a desired operational state becomes a normal and known practice. You can ignore the dead nodes, and focus on bringing up new nodes as quickly as possible.
Think of system setup as a series of state changes
Start with the basic infrastructure, and apply a sequence of steps to bring it to the desired state. The steps could include installing packages, configuring them, starting servers, setting chron jobs, and so on. This is no different than most traditional coding exercises – start from a known state, apply some computations, and arrive at a new state.
Make the steps repeatable
This is like coding any math problem. You first solve the problem on paper, arrive at an algorithm, and then code the algorithm for repeatability and scale.. Operational problems should be approached in the same way. While treating operations with this level of discipline may seem tedious and time-consuming, unless you make the steps repeatable through automation, you won’t be able to recover from failures easily. Repeatability is a way of rehearsing recovery. Node just died? No problem – just run the automation to bring up a new node. And you’re back in business.
But repeatability alone is not sufficient when state changes are numerous. Idempotency is the concept of denoting an element of a set that is unchanged in value when multiplied or otherwise operated on by itself. Each state change needs to be made idempotent. Apply the same change again – the system should not burn up. Practicing idempotency makes an outcome certain. If something breaks in the middle, you can simply replay the whole sequence of changes when you know that each step is idempotent.
Review, test and version control
Finally, apply the same engineering rigor to automation artifacts as you would apply to software development – that is, ensure that the automation scripts are peer-reviewed, tested and maintained in source control. There should be no difference. DevOps is not just about integrating development and operations, but about treating operations as development, and development as operations.