How to properly automate CM based on log events?

Question

Let's say I have a number of tools like Docker bench security, some static security scanners, etc. Now I have let's say free Ansible edition (no expenses are considered).

I have lots of logs, but not much automated, actionable info that triggers scripts. What is a proper way to automate link between static scanners and CM tools like Ansible and Jenkins?

Should it go like:

Parse logs for set of errors and then automatically trigger "correcting scripts"?
Just run a playbook once in a while as it checks the state before doing anything?
Use tools/plugins for Ansible?
Something else?

With respect to Tools, not policy there will be more proper ways depending on the use-case. — Vladimir Botka, Aug 04 '18 at 05:14
It sounds like what you are describing here is automated remediation not automated configuration management. — Bruce Becker, Oct 22 '18 at 15:13

score 1 · Answer 1 · answered Oct 14 '19 at 10:58

There is a natural tension here between:

(a) fixing something that is wrong automatically: because for example, a CVE has been created with a high enough severity to need to do something about it.

(b) breaking the system: because the automation introduced a breaking change or a defect.

I have a tendency towards systems being under constant development, even if no features are being added to them they need to kept current to avoid exposure to vulnerabilities. The rate of change to the 3rd party components of the majority of systems I cover is so high that's its actually worth rebuilding everything and redeploying it daily.

This introduces a couple of requirements:

It is highly automated,
99% of user journies are covered by end-to-end tests so each build has a high degree of confidence that it will work,
We don't roll out to all users, we employ canary release and a rollback mechanism,
We use version constraints in package manifests, etc... i.e. the version of React must be >= 16.7.1 and < 17.0.0
We manage by exception, i.e. if the end-to-end tests fail or the canary release fails humans get involved.

This, in turn, makes systems like trivy, OWASP ZAP and SonarQube more about generating metrics and stopping vulnerabilities from going live than triggering automated remediation.

We do then have a set of rules like:

The live application must not have an unmitigated CVE with a CVSS score of 7.0 or higher.
The list of CVEs with a score of 4.0 or higher will be reviewed on a weekly basis to assess if remediation should be undertaken (i.e. change to a different library, introduce a new safeguard, remove a defunct library, fix the CVE at source, etc.)

This then gives my SRE teams the operational boundaries to work within, they may in practice:

just re-run the deployment ad-hoc to fix it now rather than within 24h,
override a default, for example, if there is a known vulnerability deploy something with a less critical but still higher than CVSS 7.0 known vulnerability,
Introduce a new safeguard, i.e. a Web Application Firewall Rule to mitigate,
Swap out the dependency for something less vulnerable,
Disable that functionality entirely until a fix can be introduced.

How to properly automate CM based on log events?

1 Answers1