Through out the conference there was a whole lot of discussion about how Devops interacts in the live environment. There were many ideas discussed and there was a considerable amount of time dedicated to the processes and tools to support them.
Testing in Live
Eeek! that sounds “exotic” Parallel live systems are used in various ways I’m not going to type it all up http://blog.christianposta.com/deploy/blue-green-deployments-a-b-testing-and-canary-releases/
Most of these techniques deliver a lot of information and help deliver what the customer wants more quickly and reliably BUT it does not do this in a “throw it over the fence to operations never see it again environment”. The separation of the live data from the data about the live system should not be so challenging if the operational logging is suitable, but that needs development and operations co-operation, and time invested. That is a total cost of ownership sell, not an isolated development and separate support sell.
Bitmovin is a video transcoding company that uses permanent canary i.e. they always have two versions on the go and switch users over gradually. back and forth. Another variation is feature flags for A/B testing – same version of code configured differently Shadowing….same traffic user sees old… Verifies new… Able to monitor relative performance – (upgrades/changes to protocols are an issue here) A major difference can be whether there are separate databases or not (obviously data migration is an issue with this – although document bases NoSQL DBs would handle this).
I would say the thing to do with this is to be in a position to replicate the infrastructure for testing and pivot between approaches. To choose between approaches is a false dicotomy, both on a product, project and sprint-by-sprint basis. Is there a data change? Is there a protocol change? Is there a user interaction to be measured? Is performance the key measure? With Docker and Kubernetes the replicated infrastructure need not be up for longer than required and can be readily reconfigured if done right.
Docker in live
Do you know where your Docker images come from?
Images should all be signed! You must run a private Docker image registry for security, repeatability.
Specific versions of container images from controlled sources – none of the “latest” version from a hooky repo, please!
Don’t patch live containers. Containers should be rebuilt tested and installed. Otherwise here lies the madness of manually patched systems. Service discovery is therefore important because containers have to configure themselves. NB this can also be handle by kubernetes secrets. Waiting for other services is a thing too. Automated start-up is an issue if it all has to be in order.
Images should be impregnable and use service discovery to acquire third party tokens. Only the deployment environment user should have access to the master discovery keys by proclamation.
An Ops Tool
One thing I’ve colaborated on was getting a web front end to read A+ on the RSALabs SSL testing tool. I’ve always wondered how to Pen-test my own services without spending silly money on experts.
Tools to Support Testing
OpenSCAP looks like a really great tool for examining the attack surface (out of date auth protocols, open ports and the like) You can specify the security policy you wish to apply… It generates an Ansible script to fix your issues. I’m not sure if that is applicable if you don’t use Ansible or whether it is human readable to use it for manual patching.
Chaos Monkey – tool for testing your infrastructure for machines and networks going up and down.
Site Reliability
Google Site Reliability Engineering… https://www.amazon.co.uk/Site-Reliability-Engineering-Production-Systems/dp/149192912X Sigterm not sigkill – the responsibiliy of microservices to “tidy up quickly and quietly” (as my primary school teacher used to say). Expecially true as Docker sigterms and then sigkills automatically. Likewise, “FFS don’t use latest as a config tag” was oft repeated advice over the two days.
Some interesting stuff about how to run parallel systems for users switching over groups and proportions from one to the other. I’ve done this myself on a distributed EFTPOS systems (to provide “exponential rollout” to limit risk).
When ‘it Hits the Fan
There was an interesting talk about how to analyse incidents. Three Steps: Observation – Response – Remediation
“Mean Time To Recover” – slicing and dicing
Accountability = responsible for reporting, not to take the blame
Disable a service the when micro-services on which it depends are unavailable
A Status page for stake holders running on a different system was considered to be essential. Certainly my experience is that this can be reassuring for users and is helpful for keeping the people fixing the system focussed on fixing rather than reporting. Automatic functionality is useful for proactive notification, .
There is a free book which was recommended. Bit.ly/PIR_book
Tool for the Ops side of the Devops
Kubemonkey is a tool providing circuit breakers graceful degradation