Losing connection between nodes on private network

Question

I have 3 nodes running on a private network.

After asking this question, I've manually added node1 as peer to node2 and node3, so my network looks like this:

                 _______                    
     ---------> | node2 |
    /           |_______|
   v
 _______
| node1 |
|_______|
   ^
    \            _______
     ---------> | node3 |
                |_______|

I've left the nodes running through last night, but around 10 PM (BRST), node2 and node3 stopped listening to node1 (admin.peers was empty), while node1 was still connected to both of them (admin.peers contained 2 items), but not getting any interaction.

Is it possible that this is a problem with Ethereum protocol or would it be something else?

Edit:

I'm running the nodes in the same physical machine, but in different VMs. Those VMs have Ubuntu 12.04 installed.

The VMs are running over a CentOS with VMWare vCenter.

Are the clocks on your different nodes all synced to network time? — BokkyPooBah, Apr 07 '16 at 14:00
Yes, they are. In deed, they are in the same physical machine, altough they're in different VMs. — Henrique Barcelos, Apr 07 '16 at 14:04
What VM system are you using andwhat is the host operating system? — JackWinters, Apr 07 '16 at 14:49

BokkyPooBah · Answer 1 · 2016-04-11T15:07:24.790

7

Summary

I would have expected that the bootnodes parameters would allow node 2 and node 3 to find node 1, node 2 and node 3 and then for all the nodes to retain their peer connections.

From Henrique Barcelos's issue, it seems that this configuration is not stable in a private network running the nodes in a VM.

The alternative of using static-nodes.json does not seem to provide a stable connection.

The next configuration to try is using trusted-nodes.json and specifying the --maxpeers parameter

Below are the steps we are taking to solve the problem.

An interesting problem.

Probability wise, I would say it is highly unlikely that this is a problem with the Ethereum protocol as it is robust enough for the Internet.
You have stated that your time is synced to network time. This seems to be a problem regularly encountered and that is why I asked the question above.
Let's look at your configuration.
Question. Are you running any other Ethereum mainnet nodes within your private network. The reason I am asking this is that there may be a clash with the network port 30303, referring to the first link in your question.

From the first link in your question, you are starting up your geth instances on the different nodes with the following command:

geth --verbosity 4 --autodag --nat any --genesis /opt/blockchain/genesis.json
 --datadir /opt/blockchain/data --networkid 4828 --port 30303 --rpc
 --rpcaddr 10.48.247.25 --rpcport 8545 --rpcapi db,eth,net,web3,admin
 --rpccorsdomain '*' --fast --mine --ipcdisable

Node 1 has been set as your bootnode. Nodes 2 and 3 have the file /opt/blockchain/data/static-nodes.json with the following information:

["enode://{node1publickey}@{node1ip}:30303"]

The information in /opt/blockchain/data/static-nodes.json in 5. above matches the startup information in the node1 geth instance that is displayed using the admin.nodeInfo command in the node1 geth instance.
```
...
enode: "enode://{node1publickey}@{node1ip}:30303",
...
```
Is this correct?
Looking at your geth startup parameters now:
a. You could try removing the --nat any parameter
b. Your --networkid is unique from the mainnet "1"
c. Your --port 30303 should not be a problem unless you are running other mainnet Ethereum nodes within your network, and even so should not be a problem as it will be on a different IP
d. Your --rpcaddr should be the IP address of each machine. Node 1 should have the node 1 IP address. Node 2 should have the node 2 IP address and node 3 should have the node 3 IP address.
e. Your --rpcport, --rpcapi and --rpccorsdomain should not affect your network connectivity.
f. You should not need the --fast parameter as your blockchain should be small anyway, being an internal testnet blockchain.
g. Your --mine and --ipcdisable parameters should not affect your connectivity.
What I would try next is to remove the --bootnode parameter, and try the /opt/blockchain/data/static-nodes.json method in 5. above on nodes 2 and 3. Run this for a while and let's hope you don't have the connectivity dropouts. We can cross this off the list if you are still having the connectivity dropouts.

Below added 12/04/2016:

Henrique Barcelos has been using the static-nodes.json method for the peers to find each other, but the peer connections dropped out. I've suggesting trying trusted-nodes.json as these specify trusted peer nodes to connect to, and would not be blocked by the --maxpeers connection limits.
What I have found from my testing is that the static-nodes.json option would not connect unless I add a non-zero --maxpeers parameter, although --maxpeers is meant to default to 25.

edited Apr 11 '16 at 15:07

answered Apr 07 '16 at 14:58

BokkyPooBah

40,274
14
123
193

1

About 4. No, I'm only running the nodes in the private network. – Henrique Barcelos Apr 07 '16 at 15:23
1

About 5. node1 is the bootnode, I'm passing its enode via --bootnodes to node2 and node3, but there is no static-nodes.json – Henrique Barcelos Apr 07 '16 at 15:24
1

About 7.a. --nat defaults to "any" according to the docs – Henrique Barcelos Apr 07 '16 at 15:26
1

I've added a static-nodes.json file in node2 and node3, with an array containing the enode of node1. I'll leave the nodes running through the night and tomorrow I'll comment the results here (I'm at GMT-0300) – Henrique Barcelos Apr 08 '16 at 00:09
1

Looking forward to hearing about your results. – BokkyPooBah Apr 08 '16 at 00:31
1

Apparently the connectivity problem was solved. I don't know exactly what has solved i though. – Henrique Barcelos Apr 08 '16 at 14:52
1

Probably, the static-nodes.json helped, but I can't know for sure. – Henrique Barcelos Apr 08 '16 at 14:52
1

Run it for a few more days and check if the connections are robust. You could then revert back to the old method of using the bootnodes and if there are disconnections, confirm that the bootnodes method does not suit your environment. – BokkyPooBah Apr 08 '16 at 15:30
1

Yeah, I was planning to do something like that. Thank you for your help with this issue @BokkyPooBah – Henrique Barcelos Apr 08 '16 at 16:40
1

Oh no! The nodes diverged again during the weekend :/ – Henrique Barcelos Apr 11 '16 at 13:19
1

When your nodes diverge, they never rejoin? I have been doing some further testing and found that I needed to set the --maxpeers parameter for the connection between peers to work. I'll look into the a bit more and get back to you.I have also been setting the --verbosity a bit higher than the default 3 to trace my peer connection issue. – BokkyPooBah Apr 11 '16 at 13:26
1

I don't think they would rejoin any time because they were 1000 blocks away from each other. I'm currently running with verbosity=4, I've seen some messages like "peer has become unhealthy", but I don't know exactly what this means. – Henrique Barcelos Apr 11 '16 at 13:40
1

I'm going through the logs, the last block import was done on Apr-09 at 10:37:19 GMT-0300 (last saturday), meaning that the nodes haven't communicated with each other for almost 2 days. – Henrique Barcelos Apr 11 '16 at 13:58
1

And I've been looking also at testing out the trusted-nodes.json as the ignores the maxpeers limit. See http://ethereum.stackexchange.com/questions/2478/what-is-the-difference-between-a-static-node-and-a-trusted-node. You may want to test this configuration next. – BokkyPooBah Apr 11 '16 at 14:00
1

This is odd because --maxpeers defaults to 25 and I only have 3 running nodes :/ – Henrique Barcelos Apr 11 '16 at 14:11
1

I found that when I was using --dev --networkid xxxx --nodiscover with static-nodes.json that I could not get the peers to connect with a node rejection messages (shown when verbosity is set to 4 or 5), until I manually set --maxpeers to a non-zero number. – BokkyPooBah Apr 11 '16 at 14:20
1

I'm not using --dev, should I be? – Henrique Barcelos Apr 11 '16 at 14:31
1

Your --networkid and --genesis will make your network instance unique. --help shows that --dev switches on the Developer mode: pre-configured private network with several debugging flags. – BokkyPooBah Apr 11 '16 at 14:48
1

I've noticed more verbose logs in --dev mode. The nodes have behaved nicely through the night, no diverging so far. – Henrique Barcelos Apr 12 '16 at 12:24
1

I have just finished writing up my test results on http://ethereum.stackexchange.com/questions/2851/how-can-i-reliably-induce-a-blockchain-fork-for-testing-purposes . There are some bits there on setting admin.verbosity(6) to trace the P2P connection attempts. – BokkyPooBah Apr 12 '16 at 12:45
2

@HenriqueBarcelos Did you ever resolve this? I am having the same problem and these steps have not solved it. – stone.212 Dec 02 '17 at 14:43
1

Where do you place trusted-nodes.json? Do you have to restart geth for it to use the list of enodes? – Nyxynyx Jul 17 '18 at 15:58

Losing connection between nodes on private network

Edit:

1 Answers1

Linked