There is a natural tension here between:
(a) fixing something that is wrong automatically: because for example, a CVE has been created with a high enough severity to need to do something about it.
(b) breaking the system: because the automation introduced a breaking change or a defect.
I have a tendency towards systems being under constant development, even if no features are being added to them they need to kept current to avoid exposure to vulnerabilities. The rate of change to the 3rd party components of the majority of systems I cover is so high that's its actually worth rebuilding everything and redeploying it daily.
This introduces a couple of requirements:
- It is highly automated,
- 99% of user journies are covered by end-to-end tests so each build has a high degree of confidence that it will work,
- We don't roll out to all users, we employ canary release and a rollback mechanism,
- We use version constraints in package manifests, etc... i.e. the
version of React must be >= 16.7.1 and < 17.0.0
- We manage by exception, i.e. if the end-to-end tests fail or the canary release fails humans get involved.
This, in turn, makes systems like trivy, OWASP ZAP and SonarQube more about generating metrics and stopping vulnerabilities from going live than triggering automated remediation.
We do then have a set of rules like:
- The live application must not have an unmitigated CVE with a CVSS score of 7.0 or higher.
- The list of CVEs with a score of 4.0 or higher will be reviewed on a weekly basis to assess if remediation should be undertaken (i.e. change to a different library, introduce a new safeguard, remove a defunct library, fix the CVE at source, etc.)
This then gives my SRE teams the operational boundaries to work within, they may in practice:
- just re-run the deployment ad-hoc to fix it now rather than within 24h,
- override a default, for example, if there is a known vulnerability deploy something with a less critical but still higher than CVSS 7.0 known vulnerability,
- Introduce a new safeguard, i.e. a Web Application Firewall Rule to mitigate,
- Swap out the dependency for something less vulnerable,
- Disable that functionality entirely until a fix can be introduced.