Where suck = fail to detect the problems you experience later in production.
A good staging environment is the last line of defense between your production environment and the new bugs introduced by your latest release. Without it, the first ones to test your new code under real world conditions will be your users in production. It’s not that testing in production is such a bad practice, but testing first in production certainly is.
Because this is a complex problem, I’ve tried to break it down into a few smaller actionable ones by listing some of the more common pitfalls and shortcomings I’ve struggled with over the years. Some are more obvious than others, and some are harder to solve, but addressing all seven will ensure that you catch significantly more bugs before going live.
Failing in Production
Your staging environment is not architecturally representative of your production environment.
A strong staging environment has to have at least the same component structure — micro services, databases, message queues, and cache — as your production environment. Cost is always a factor, so the allocated resources per component don’t have to be identical. But I would at least try to match the multiplicity factor, e.g., if you have multiple instances of a specific service in production, then you should have at least two instances in your staging. This leaves room for concurrency issues such as deadlocks and other race conditions to arise.
You only keep it running for a few minutes.
Spinning up your staging servers five minutes before you deploy to production is not going to cut it. Some problems take a while to brew (memory leaks, data corruption, etc.) and leaving your staging environment running for a significant amount of time is critical to identifying these problems. You most likely never shut down your production servers, so you shouldn’t turn off your staging either. This may sound wasteful, but the cost of failing your users or being unavailable can be much higher.
You are not monitoring the staging behavior.
Monitoring your staging is critical for two reasons: first, you need to know when things go wrong. To do that you have to identify your stable state threshold and stop promoting code in case you see it being crossed. Second, monitoring tools and agents are part of the architectural similarity you want to achieve. Without naming names, I’ve seen cases where monitoring agents went haywire or added substantial overhead on the monitored server. These issues cannot be reproduced without having the same agents embedded in staging.
It doesn’t contain real data.
This is more common than you would expect. I can’t remember how many times I’ve logged into someone’s staging environment and found it completely empty. Either there is no data at all, or there are only some leftovers from failed automated tests. Querying against empty tables will teach you nothing about the experience your user goes through when using your search function, and will never uncover slow performing queries. An even bigger problem is DB migrations. If you want to be sure your latest schema changes won’t break anything, you must have the same edge cases that exist in your production DB (weird characters, null values, extremely long values and other unidentified floating garbage). There are many commercial solutions around this problem; these tools can take data from production and sanitize it for use as test data (you don’t want real password hashes and credit card numbers hanging around where they shouldn’t be). I have yet to find one that I actually enjoyed using, so feel free to leave a comment if you know of a really good one.
Nothing is happening in it.
Most staging environments I’ve tested with were super fast and responsive even when the real app was slow and glitchy. This usually happens because these environments tend to be empty and uninhabited places, since no one is using them. You can’t expect to catch performance issues, race conditions and deadlocks in an environment without multiple concurrent users. To make these environments come to life, some choose to direct synthetic traffic into them. At Loadmill, we help companies use real Internet traffic to replicate their production user behavior. It’s best not to separate load testing from other automated tests, since you want to test under the same conditions as your production. You may think this approach generates too much noise and chaos, but that is precisely the point — how else are you going to find the bugs you didn’t think about?
It is not facing the Internet.
If your production servers are serving requests from everywhere around the world, so should your staging environment. How can you gain confidence in your cache servers, CDN and load balancer when all you are doing is sending requests from your neighboring server on AWS? This works nothing like your real internet traffic. If you are running a global operation, you will have to test it globally with requests that match the usage patterns from the same locations.
It is missing the element of surprise.
In The Pink Panther, Inspector Clouseau instructs his manservant, Cato, to surprise and attack him at any moment in order to keep Clouseau vigilant. Our reality is chaotic and therefore we should expect to be surprised. Servers will crash; abuse and DoS attacks will happen; hosting services will experience downtime and network outage. We need to be ready for anything, and the only way to do that is to constantly surprise ourselves. Adding elements of chaos to your staging environment while you run your test cycle is a great way to work on your system reliability and resilience. The solutions here range from great open-source tools like Chaos Monkey and Simian Army to commercial solutions such as Gremlin.
Realistically simulating your production environment before you deploy to it is one of the most helpful tools to achieve the ultimate goals of five nines and fantastic user experience. That’s not to say that other more localized testing methodologies don’t matter, but rather they are complementary. Some problems can only be detected by testing in real-world conditions. If you think of the time and money lost because of performance issues and bugs that slipped past your test cycle, the effort of addressing most of my points above doesn’t seem so big.