Client banned by request, lost one CS, cluster died

Discussion in 'General Questions' started by SteveITS, Jun 2, 2015.

  1. SteveITS

    SteveITS Tera Poster

    Messages:
    271
    I am trying to decipher these log entries. pstorage get-event shows no events from 9:28am to 9:37pm, then:

    01-06-15 21:37:55.217 MDS WRN: CS#1028 is inactive
    01-06-15 21:37:55.326 MDS WRN: The cluster is degraded with 4 active, 1 inactive, 0 offline CS
    01-06-15 21:38:06.131 CLN WRN: IO requests on client #2990 serviced more than 8000 ms
    01-06-15 21:38:53.056 MDS INF: Client 2987 is banned by request from 10.254.1.101:50075
    01-06-15 21:39:13.157 CLN WRN: IO requests on client #2989 serviced more than 8000 ms
    01-06-15 21:39:33.265 CLN WRN: IO requests on client #3032 serviced more than 8000 ms
    01-06-15 21:43:06.144 CLN WRN: IO requests on client #2990 serviced more than 8000 ms
    01-06-15 21:44:13.172 CLN WRN: IO requests on client #2989 serviced more than 8000 ms
    01-06-15 21:52:51.337 MDS WRN: CS#1028 is offline
    01-06-15 21:52:56.348 MDS INF: Replication started, 1693 chunk(s) are queued (1693 chunk(s) on offline CS#1028)
    01-06-15 21:54:25.544 MDS INF: The cluster physical free space: 4.3Tb (88%), total 4.9Tb
    01-06-15 21:57:25.645 MDS INF: The cluster physical free space: 4.2Tb (86%), total 4.9Tb
    01-06-15 21:58:51.369 MDS WRN: CS#1030 is inactive
    01-06-15 21:59:01.177 CLN WRN: IO requests on client #2990 serviced more than 8000 ms
    01-06-15 21:59:03.217 CLN WRN: IO requests on client #2989 serviced more than 8000 ms
    01-06-15 21:59:03.343 CLN WRN: IO requests on client #3032 serviced more than 8000 ms
    01-06-15 21:59:12.371 MDS WRN: CS#1027 is inactive
    01-06-15 21:59:47.877 MDS INF: Client 2991 is banned by request from 10.254.1.101:51007
    01-06-15 21:59:48.979 MDS WRN: Failed to allocate 3 replicas at tier 0 since only 2 chunk servers are available for allocation - consider adding more chunk servers or reducing the replication factor
    01-06-15 21:59:48.979 MDS WRN: Failed to allocate 3 replicas at tier 0 since only 2 chunk servers are available for allocation - consider adding more chunk servers or reducing the replication factor [+...]

    A minute later all VMs and containers stopped and all nodes went offline. I could not ping any of them from "inside" our rack in the data center. CS #1028 is a hardware node that I think crashed because it would not respond to a keyboard plugged in. That one I powered off. Other nodes showed various network errors on their consoles. I ended up restarting them, and it's back up, but I am confused why this happened? I suspect that something happened to the storage system and the other nodes were spending so much time waiting on the network that they couldn't talk to each other...? What does "banned by request" mean? It's not in the KB that I can find...
     
  2. IP^__^

    IP^__^ Odin Team

    Messages:
    80
    You may find useful the following KB - http://kb.odin.com/en/118835

    >Other nodes showed various network errors on their consoles.
    What were these errors?
     
  3. SteveITS

    SteveITS Tera Poster

    Messages:
    271
    I saw that KB. I don't think it applies directly since the nodes definitely didn't reboot on their own after two minutes. But, the description of lost network communication fits. I just searched for "banned" in pstorage-mount.log.gz.1 and didn't find any occurrences.

    Unfortunately the network errors were scrolling mildly fast, I was moving back and forth between the front and back of the rack aisle and I didn't think to take a picture until later (it was 2am). From memory they included things like access to the network being denied, transmission errors, things of that nature...not the exact wording of course. I did power off the switch used for the storage but that didn't seem to help.

    My best guess is that the failed node did something that knocked the others off the network. Perhaps put the switch in a funny state which affected the other NICs, but in a way that just powering off the switch itself didn't fix it.
     
  4. SteveITS

    SteveITS Tera Poster

    Messages:
    271
    I should clarify that when I say I couldn't ping the nodes, I couldn't ping the "front side" of their network. We have storage on a separate network...NICs and switch. So really if something happened on the storage network, it would have had to affect the networking on the other network, or the networking software on the nodes.
     
  5. IP^__^

    IP^__^ Odin Team

    Messages:
    80
    Steve,

    In such cases it's better to contact our Product Support Department. They will be able to check the issue urgently and tell exactly what happened to your node.
    Of course, you should not reboot the node until they will check the issue (it can be not enough just to check logs to say what exactly happened).
    In cases when node(s) are only available via internal network - you may want to setup jumper which will have access to the internal network as well as to the external one.
     
  6. SteveITS

    SteveITS Tera Poster

    Messages:
    271
    I thought I'd follow up my post. This happened again to us about 9 months later. It took me a bit to realize it was the same issue as June. We suspect it is a hardware failure where one of the physical servers locks up, or gets into a bad state and then locks up, and somehow breaks network communication on our storage network (eth1 interfaces), taking down all the storage while leaving the other physical servers running. I replaced that switch and it was up for under a minute, so I think the switches were blocking traffic. Unfortunately whatever happened also blocked network communication on the front (non-storage) side of the network since no physical servers were pingable on eth0 either. Shutting down the servers, powering off the locked-up server, and powering off the storage network switch resolved the issue. (and then we left that server off pending replacement)
     

Share This Page