GitOps (sometimes known as 'operations by pull request') almost always seems to be talked about in the context of Kubernetes, in this post I will highlight what I see as the benefits of the approach and why you ought to consider it - regardless of whether you're using Kubernetes.
What is GitOps?
The coining of the term is generally attributed Weaveworks and it describes a system that adheres to the following principals:
- The system is described declaratively - that is to say the desired state is defined, not the mechanism for 'making it so'
- This declarative model is stored in Git - which is considered 'the source of the truth'
- Changes to this model are made via an approval process - i.e. a pull request
- An automated process consumes this model and converges the system to the desired state
If you think that all sounds pretty generic, then you're not wrong - so why is it mostly discussed in the context of managing Kubernetes clusters?
Why Always Kubernetes?
The recommended Kubernetes configuration mechanism is entirely declarative (using YAML manifests) so they lend themselves to this approach particularly well - hardly surprising really, if you assume that the genesis of the approach was to help tame Kubernetes!
Whilst the principles under-pinning it weren't fundamentally new, it represented a very practical implementation of many Continuous Delivery aspects, however, for the purposes of this post I want to focus on this key pillar from that seminal tome:
Keep Absolutely Everything in Version Control
When things get named they often become closely associated with the use-case that coined it, and ultimately I think this is what happened with GitOps.
Other Use Cases
Hopefully, the previous definition shows that the potential use-cases are much broader. I believe that the GitOps approach can work really for scenarios where at least some of the following points are true:
- The desired state of the system can be modelled declaratively
- The changes to be managed relate to a single system/process, or can be naturally scoped in that way without arbitrarily partitioning or duplicating related configuration
- Managing the overall desired state can be more efficiently achieved by sharing responsibility across multiple teams (rather than a central 'owner')
- Different teams have their own specialist insights into specific parts of the overall desired state
- The system can be expected to not be modified outside of the GitOps process and it is acceptable for such changes to be reverted
- Additional value can be wrought from having a definitive, computer-readable representation of the current state (e.g. generating ancillary reports, acting as an authoritative data source to other interested systems etc.)
These largely echo the above points, but I think it's worth highlighting some of the specific benefits:
- A natural way to collaborate and delegate responsibility, whilst still maintaining control
- Great for supporting audit requirements
- Great for disaster recovery and business continuity scenarios
- Ability to easily leverage the Git versioning semantics for the process - e.g. branching, tagging, comparing, even (dare I say) rolling-back!
- Ability to track drift from the desired state - it's one thing for the desired state to have changed, but what happens when the system is changed outside of this process?
However, it's fair to say that it isn't all plain sailing.
Having an Idempotent Change Mechanism
I think the biggest challenge is point #4 in the definition above - at least it can be if the native tooling for the system in question does not already have a declarative model or otherwise offer idempotent change mechanisms.
For this approach to work you really are reliant on having automation that can consistently and reliably:
- Process the current desired state
- Apply just the necessary changes
- Report those changes back (the last thing you need is an automated process that generates a vast amount of noise every time it runs)
If the native tooling lacks such characteristics, then an option is to build your own layer of idempotent tooling that can consume the declarative model and drive the native tools accordingly. Clearly you need to understand the cost/benefit of such an undertaking before starting down that particular road - keeping the process as simple as possible will be your best bet for easing this task.
Potentially Long Convergence Times
Typically, the early iterations of such an automated process will sequentially work its way through the declarative model, applying any changes that are necessary. At some point the time it takes to complete a single convergence run may take long enough that it starts impeding the natural cadence of the incoming changes. Suddenly a system that has made life easier for the last few months/years falls from favour and people find themselves trying to work around the system in order to avoid such friction.
In some cases it might be possible to use parallelism to speed things up, though this largely depends on the extent to which the underlying system can accommodate parallel requests and it can also complicate logging and tracking of the overall process.
Other options are to utilise the data held within Git commits to implement a 'delta' convergence (where only those parts of the system affected by the commit are converged) and fall-back to a full run on a less frequent basis.
Modelling a desired state also implies that anything not in the model is extraneous and should potentially be removed; but how do you track the absence of something in the model against its presence within the system?
How you handle this scenario greatly depends on the underlying system, but you absolutely need to think about the following:
- Can you efficiently query for items that exist in the system but not in the model?
- Should extraneous items actually be deleted or do a subset of them need to be retained? (e.g. for compliance reasons)
- Can we rely on the information contained within a Git commit to identify deletions?
- Can 'renames' (in whatever context this relates to the system) be treated as 'delete and create' operations or will this lose data? If not, should you even support renaming?
- For low levels of this type of occurrence, will a simple exception report be sufficient?
Keys to Success
If you find the concept appealing, then here's my parting advice:
- Make sure you understand exactly what you need to get out of such a process
- Constrain the challenges by making the convergence process as simple as possible
- Prefer several, simple convergence processes over fewer, more complicated ones
- Consider multiple discrete GitOps repositories for systems that have no overlapping concerns (again, having several simpler repos will likely be easier to manage than a single behemoth repo)
Have you tried GitOps? What for and how did it go? Let me know in a comment below or ping me on Twitter (@James_Dawson)