Centos7.2, Ceph with 3 OSD, 1 MON running on a same node. radosgw and all the daemons are running on the same node, and everything was working fine. After reboot the server, all osd could not communicate (looks like) and the radosgw does not work properly, it's log says:
2016-03-09 17:03:30.916678 7fc71bbce880 0 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403), process radosgw, pid 24181
2016-03-09 17:08:30.919245 7fc712da8700 -1 Initialization timeout, failed to initialize
ceph health shows:
HEALTH_WARN 1760 pgs stale; 1760 pgs stuck stale; too many PGs per OSD (1760 > max 300); 2/2 in osds are down
and ceph osd tree give:
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.01999 root default
-2 1.01999 host app112
0 1.00000 osd.0 down 1.00000 1.00000
1 0.01999 osd.1 down 0 1.00000
-3 1.00000 host node146
2 1.00000 osd.2 down 1.00000 1.00000
and service ceph status results:
=== mon.app112 ===
mon.app112: running {"version":"0.94.6"}
=== osd.0 ===
osd.0: running {"version":"0.94.6"}
=== osd.1 ===
osd.1: running {"version":"0.94.6"}
=== osd.2 ===
osd.2: running {"version":"0.94.6"}
=== osd.0 ===
osd.0: running {"version":"0.94.6"}
=== osd.1 ===
osd.1: running {"version":"0.94.6"}
=== osd.2 ===
osd.2: running {"version":"0.94.6"}
and this is service radosgw status:
Redirecting to /bin/systemctl status radosgw.service
● ceph-radosgw.service - LSB: radosgw RESTful rados gateway
Loaded: loaded (/etc/rc.d/init.d/ceph-radosgw)
Active: active (exited) since Wed 2016-03-09 17:03:30 CST; 1 day 23h ago
Docs: man:systemd-sysv-generator(8)
Process: 24134 ExecStop=/etc/rc.d/init.d/ceph-radosgw stop (code=exited, status=0/SUCCESS)
Process: 2890 ExecReload=/etc/rc.d/init.d/ceph-radosgw reload (code=exited, status=0/SUCCESS)
Process: 24153 ExecStart=/etc/rc.d/init.d/ceph-radosgw start (code=exited, status=0/SUCCESS)
Seeing this, I have tried sudo /etc/init.d/ceph -a start osd.1 and stop for a couple of times, but the result is the same as above.
sudo /etc/init.d/ceph -a stop osd.1
=== osd.1 ===
Stopping Ceph osd.1 on open-kvm-app92...kill 12688...kill 12688...done
sudo /etc/init.d/ceph -a start osd.1
=== osd.1 ===
create-or-move updated item name 'osd.1' weight 0.02 at location {host=open-kvm-app92,root=default} to crush map
Starting Ceph osd.1 on open-kvm-app92...
Running as unit ceph-osd.1.1457684205.040980737.service.
Please help. thanks
EDIT: it seems like mon cannot talk to osd. but both daemons are running ok. the osd log shows:
2016-03-11 17:35:21.649712 7f003c633700 5 osd.0 234 tick
2016-03-11 17:35:22.649982 7f003c633700 5 osd.0 234 tick
2016-03-11 17:35:23.650262 7f003c633700 5 osd.0 234 tick
2016-03-11 17:35:24.650538 7f003c633700 5 osd.0 234 tick
2016-03-11 17:35:25.650807 7f003c633700 5 osd.0 234 tick
2016-03-11 17:35:25.779693 7f0024c96700 5 osd.0 234 heartbeat: osd_stat(6741 MB used, 9119 MB avail, 15861 MB total, peers []/[] op hist [])
2016-03-11 17:35:26.651059 7f003c633700 5 osd.0 234 tick
2016-03-11 17:35:27.651314 7f003c633700 5 osd.0 234 tick
2016-03-11 17:35:28.080165 7f0024c96700 5 osd.0 234 heartbeat: osd_stat(6741 MB used, 9119 MB avail, 15861 MB total, peers []/[] op hist [])
