Docker managers are losing quorum on AWS

Question

I have deployed Docker EE on AWS by choosing the package directly from the AWS Marketplace: Docker EE for AWS (Standard/Advanced) - BYOL. The cluster has been successfully launched and deployed on AWS.

After deploying several stacks on the UCP something really strange is happening, the managers are not able to see each other, they are losing the quorum inside the Swarm and the docker info command says that they are out of Swarm.

The AutoScalling group from AWS starts launching new managers inside the Swarm and the new launched managers are terminated in a couple of minutes because they are not Swarm's participants as well. Actually the AutoScalling group goes in a infinity loop by launching new managers and terminating the old ones.

The worst thing is that I am loosing the data from the Docker Services and there is no way of recovering.

Any ideas?

What instance sizes are you using? Don't use T2 or anything that cap's CPU. Also be sure you're sizing resources based on UCP/DTR requirements and best practice arch. https://docs.docker.com/datacenter/ucp/2.2/guides/admin/install/system-requirements/ https://success.docker.com/article/docker-ee-best-practices — Bret Fisher, Mar 30 '18 at 05:37
@BretFisher I am using m4.large for the managers and c3.xlarge for the workers.
I think this is not resource issue because the cluster is still in preparation state, we do not have requests from outside to the cluster. The same issue had happened with the previous two clusters with same configuration. — ibedelovski, Mar 30 '18 at 12:11
Since you're using EE I recommend contacting docker support for that, definitely sounds like a misconfiguration. https://success.docker.com/support — Bret Fisher, Mar 30 '18 at 21:44

score 1 · Answer 1 · answered Apr 02 '18 at 13:52

It seems like the AutoScaling Group from AWS are not that synchronized with the Docker Swarm configuration and the default timeouts in the AutoScaling are not accepted in the Docker Swarm managers.

Actually the Health Check is not configured good by default in the EE CloudFormation template, even if the managers are running and stable the Health Check fails and the AutoScalling Group tries to launch new managers. New launched managers are not able to become cluster participants, right after their running state.

Docker managers are losing quorum on AWS

1 Answers1