5

We're Scrum teams building microservices. Our GitHub repositories are single-branch, each of us integrates his/her code into master several times a day, with no feature branches. Our Jenkins pipelines compile the code, run automatic tests, supply the code to other services for static code scans, and deploy our software across a CloudFoundry landscape for further testing. If all the steps succeed, our pipelines automatically deploy the software into production spaces on Amazon Web services. We live Uncle Bob's Clean Code, write reliable unit tests with > 90% mutation coverage, and honor Jez Humble's ideas from Continuous Delivery. We think we are doing everything right.

It doesn't work.

99% of our builds fail on their way through the pipelines. In many weeks our velocity is nearly 0.

Now the first impulse is to say we'd need to code cleaner, test more, stop pushing into red pipelines, roll back faster, perform root cause and post mortem analyses, and so on. But we've been there and done all that, chomped through our code, practices, and culture. We've streamlined and upskilled our own teams as best as we could.

The problem is that the builds fail for reasons that our teams have no control over: the Jenkins servers, the Maven Nexus, the npm registry, the code scanning services, the Cloud Foundry landscape, all of these are maintained by other teams in our company. Individually, all of the involved 15 tools and teams are nice, and suffer only sporadic outages that might block a pipeline from minutes to a handful of days max. But in combination, the failure probabilities sum up to a nearly impenetratable wall of random failure.

What are strategies to improve this situation?

Florian
  • 167
  • 3
    Push the issue up to higher management. – BobDalgleish Mar 10 '20 at 13:41
  • With what exact request? "Make the services of the other teams more reliable"? – Florian Mar 10 '20 at 15:45
  • 3
    @Florian: Phrased more tersely than you should phrase it to them: "either ensure that the other teams do their work OR don't blame us when the new builds can't be deployed". It's not up to you to manage the other teams nor be responsible for their issues - that's why they are defined as "another team". – Flater Mar 10 '20 at 15:51
  • I'm suuuuper confused. This whole post, more or less, should just be an email to the management teams in charge of these build services. (Strangers on the internet can't fix this for you... Unless there's something you're not telling us... Are other development teams using these build tools having the same problems?) – svidgen Mar 10 '20 at 15:57
  • If only the integration tools were themselves integrated and delivered continuously! – Steve Mar 11 '20 at 08:26
  • @Flater Let me specifically add that my question is not how to optimally play the blame game. We've made transparent to management what the problem is and that we don't take responsibility for other team's failures. – Florian Mar 11 '20 at 08:52
  • @svidgen There are other teams experiencing the same. However, many of them seem to suffer quietly without pointing out the severity of the issue. The points have also been escalated to management. So far, we have seen some improvement as a result of this, but progress is very slow. At that pace, the current situation is going to last several months or even years more. We want to work out how to make the best of it during that time. – Florian Mar 11 '20 at 09:03
  • @Steve The integration tools do evolve continuously. However, that process is obviously far from flawless. – Florian Mar 11 '20 at 09:06

3 Answers3

10

I find it hard to believe that "only sporadic outages" of few dependent services would result in 99% builds failing. Maybe you need some detailed analysis on which services are worst offenders and either communicate with their owners to improve their stability or replaces them with more stable options.

Quick math. If each service has x% chance of failing and your build uses N services in sequence, then the probability of random build failure is : 1-(1-x)^N. So for failure rate of 1% and 5 services, it is 1-(1-0.01)^5 = 0.049, which is roughly 5%.

Another idea is to separate "necessary for deployment" dependencies from dependencies that provide useful information, but do not prevent deployment. Static code analysis comes to mind. You don't need it for deployment, but it is nice information. Instead of running it as part of your deployment pipeline, run it asynchronously in separate pipeline, that provides reports to developers.

Next idea that comes to mind is, that instead of single pipeline, you could have "phased" pipeline. Inspiration for this comes from book A Practical Approach to Large-Scale Agile Development, that Jez Huble likes to reference. There, they describe a build pipeline, that has multiple phases. Each able to run on it's own and which can be re-started if it fails, without need to re-start all that came before it.

Euphoric
  • 37,384
  • 1
    I’d guess that your initial assumption of 1% failure rate is wrong. – gnasher729 Mar 10 '20 at 10:30
  • @gnasher729 Maybe. But then, I would question sanity of anyone who willingly uses a service that has higher failure rate. Imagine if someone like Google had 1 in 100 chance that a search would result in error. – Euphoric Mar 10 '20 at 10:32
  • We kept track of each failure in a log. Over time, we identified about 15 components that sporadically failed. Though some fail more often than others, there aren't any truly black sheep. Who's currently worst also shifts over time. – Florian Mar 10 '20 at 10:41
  • 1
    @Florian Then just pick one and fix it. In Jez Humble's own words "Continuous Delivery is just Continuous Improvement." – Euphoric Mar 10 '20 at 10:42
  • "Pick one and fix it" means sitting down with another team, figuring out what they could do to reduce the failure rate of their service, and motivate them to improve things. Not an easy task, and definitely not a quick one. We are doing this, of course. However, it pushes down the failure rate one tiny bit at a time, and each step takes weeks to months to implement. Meanwhile, our pipelines are still very red. – Florian Mar 10 '20 at 12:39
  • Note that for the third paragraph, I can see counterargument for a team/company refusing to put something in production that doesn't pass the code analysis (to each their own). But in that case, by making that decision you logically consent to blocking your production deployment for any issue that arises with the code analysis or analyzer. You can't have your cake and eat it too. – Flater Mar 10 '20 at 13:40
  • @Florian: It seems like your team is taking flak for problems caused by other teams. That's not a healthy attitude, exactly because it makes you responsible for something that you have no direct control over. This is where management, specifically the first common manager you and the team share, should get involved. Escalate it to them (possibly via your own manager), explain you can't fix other teams' work for them, and when faced with internal complaints about a broken pipeline, redirect them to the appropriate team/manager. Not your circus, not your monkeys. – Flater Mar 10 '20 at 13:43
  • Imagine the extreme case: Let's say we extracted everything except for build and deploy into a second pipeline that runs besides and sends only reports to the developers. Does this really improve our situation? Of course, the first pipeline might now run smoothly and would enable us to put every commit into production. But we'd fly blindly. In theory, the second pipeline should give us updates on things that our commits break in a timely manner. But now it's this second pipeline that breaks 99% of its builds, so the reports would reach us days or even weeks too late. – Florian Mar 10 '20 at 15:19
  • @Florian Never have I suggested to be "extreme". If a service is so unstable to make your build unusable, then removing it from a deployment pipeline might be an acceptable tradeof, if it's information is not critical for safe deployment. – Euphoric Mar 10 '20 at 15:23
  • 1
    @Florian: For your extreme case, that's a decision (and balance) left up to the discretion of management. If it's cheaper to pay for the additional development effort for some bugs caught late, compared to fixing an environment which maybe can't easily (or cheaply) be fixed, then so be it. We can have discussions about good practice indefinitely, but it's irrelevant if the company is unwilling to deal with the consequences of applying it (i.e. a robust infrastructure and highly skilled and committed teams). This is a cost vs benefit analysis left up to management. – Flater Mar 10 '20 at 15:48
8

Technical answer

The exact concept that is needed here is fault tolerance. You need a pipeline that is resilient to failure. This is an open-ended engineering problem, of course.

The most obvious brute force solution is redundancy, i.e. have redundant nodes in the pipeline so that even if something fails there is another server that can take over.

Another important but often neglected aspect to consider is the recovery mode. If there is a failure, does the pipeline pick up where it left off, or does it always start again from the beginning? Are build tasks idempotent and capable of being run repeatedly without issue, or do you have to trigger a full rollback and clean the system every time? This can make a huge difference in any process that has continuous problems.

Business answer

The other team needs help in getting visibility to the issues that are preventing them from deliveing a high quality, stable pipeline. Help them by gathering metrics of downtime and estimating the actual cost to the business. This will help them make a case with their own management to get additional resource (staff, hardware, training, or third party services) to meet the business' needs. You can also help with measuring success by coordinating and agreeing on an internal SLA and tracking their ability to meet it.

Needless to say, do not be adversarial about it, as this is often counterproductive.

John Wu
  • 26,462
5

Abandon the flaky network services, and do stuff locally.

You can unit test stuff on your own PC, and you can do it before you check in buggy code. You don't need a Maven repository if the third-party code is all in a repository that your team has control over. You don't need Jenkins to build source, instead fire up your IDE, and do a Clean and Build. If it doesn't absolutely need to be "in the cloud", do it locally within your team, and keep control over it.

Added bonus: In five years time, when a customer comes back and asks for an update to a product that you thought was long retired, you don't want to find that you have no way of building the thing anymore. Relying on a mess of servers maintained by multiple other people is asking for trouble. They will all have been retired, or moved on to newer versions that aren't compatible with your old project. It's much better if everything you need is configured locally, so you can check it all out and build the whole lot from scratch according to your documented procedures.

Simon B
  • 9,621
  • I assume you mean except for services and steps that absolutely require landscape interaction to make sense? For example, validating performance locally on your notebook doesn't make much sense if the hardware setup in the cloud is considerably different. – Florian Mar 10 '20 at 15:35
  • @Florian OK, so you can't get out of the messy bit at the end where you have to deploy and test it on a real system. – Simon B Mar 10 '20 at 23:10
  • I strongly agree with Simon on this one. You tried to create "it's to-ta-lly au-to-ma-tic" and now it's biting you in the ass. The members of your teams need to do, and to be able to do, the testing that needs be done ... and you don't want to "automatically deploy(!) to production(!!) spaces." Your team needs to be able to do what it needs to do without relying on a bunch of "somebody else's software" on "somebody else's machine." These various tools sound nifty-swifty until you actually try to use them. Everybody tries for a time to "do what the books say." None succeed. – Mike Robinson Mar 11 '20 at 15:25
  • 1
    In the cloud, we always depend on someone else's software and someone else's computers; in some way that's exactly the definition of "cloud". I don't see how turning everything manual would improve things. For example, if my pipeline fails to cf push to an AWS space because of a bug in the push service, why should a manual command from my notebook's console succeed? – Florian Mar 12 '20 at 11:07
  • @Florian That's why I am essentially saying don't use the cloud. Control everything locally. Obviously, if the final target is a cloud server, that needs to be up to deploy your software to it. But every time you are relying on somebody else to provide a service to you, you are introducing a potential point of failure. Their service may be down, they may have moved on to a new version of the service that's incompatible with your development environment, or they may have permanently discontinued the service. You have no control over that. – Simon B Mar 12 '20 at 11:20