Neighbour table overflow on hardware node

Discussion in 'Networking Questions' started by SteveITS, Aug 17, 2016.

  1. SteveITS

    SteveITS Mega Poster

    Messages:
    210
    We had a short episode recently where pstorage top logged events like this, for about 5 minutes:

    15-08-16 18:48:53 MDS WRN CS#1030 is inactive
    15-08-16 18:48:56 MDS INF CS#1030 is active
    15-08-16 18:48:59 MDS WRN CS#1030 is inactive
    15-08-16 18:49:02 MDS INF CS#1030 is active
    15-08-16 18:49:03 CLN WRN IO requests on client #34977 serviced more than 8000 ms
    15-08-16 18:49:03 CLN WRN IO requests on client #35018 serviced more than 8000 ms
    15-08-16 18:49:05 MDS WRN CS#1030 is inactive
    15-08-16 18:49:07 MDS INF CS#1030 is active
    15-08-16 18:49:10 MDS WRN CS#1030 is inactive
    15-08-16 18:49:12 MDS INF CS#1030 is active

    Looking at the CS 1030 server's messages log, this was at the same time as log entries like this:

    Aug 15 18:48:20 hn2 kernel: [333535.133688] Neighbour table overflow.
    Aug 15 18:48:20 hn2 kernel: [333535.133803] Neighbour table overflow.
    Aug 15 18:48:20 hn2 kernel: [333535.133990] Neighbour table overflow.
    Aug 15 18:48:20 hn2 kernel: [333535.134103] Neighbour table overflow.
    Aug 15 18:48:20 hn2 pstorage-mount: 34977 IO_TIMES 333686466674+998546+8297614 IO1R [000add1a]/root.hds 4096@847556608 [333686466674] QD:1 CS#1031:rd:332946060673/397/3776410 CS1033:rd:333456115767/1699/444 CS1029:rd:333284459075/345/0 on /pstorage/cluster1
    Aug 15 18:48:25 hn2 kernel: [333540.131315] __ratelimit: 34496 callbacks suppressed
    Aug 15 18:48:25 hn2 kernel: [333540.131318] Neighbour table overflow.
    Aug 15 18:48:25 hn2 kernel: [333540.131454] Neighbour table overflow.
    Aug 15 18:48:25 hn2 kernel: [333540.131565] Neighbour table overflow.

    The messages just stop at 18:50 like it resolved itself.

    Looking up the overflow error, it sounds like increasing the ARP cache size is necessary (http://www.cyberciti.biz/faq/centos-redhat-debian-linux-neighbor-table-overflow/). Is that the right thing to do here? The current values are:

    net.ipv4.neigh.default.gc_thresh1 = 128
    net.ipv4.neigh.default.gc_thresh2 = 2048
    net.ipv4.neigh.default.gc_thresh3 = 4096

    All of our containers and VMs use Bridged mode networking, if that matters. This particular node has 10 containers on it, but all of those have their own block or blocks of IPv6 addresses.

    I found this KB article but it talks about getting that message inside a VM not on the physical server, and on routed networking.

    We have Virtuozzo 6 Update 11 Hotfix 11 installed.
     
  2. Pavel

    Pavel A.I. Auto-Responder Odin Team

    Messages:
    403
    Hello,

    KB article mentions this behavior inside of a VM - that was caused by a particular bug, guest network misconfiguration performed by guest tools - a bit irrelevant here.
    You've got that message on a hardware node due to high amount of IP addresses in use, that is a common system administrative scenario. You were right about increasing ARP cache size - that's the correct solution. You can do that on all of your nodes in cluster in a preventive manner.
     
  3. Pavel

    Pavel A.I. Auto-Responder Odin Team

    Messages:
    403
    Oh right... Since you're actively using IPv6 you might want to increase ipv6 cache size as well:
    net.ipv6.neigh.default.gc_thresh1
    net.ipv6.neigh.default.gc_thresh2
    net.ipv6.neigh.default.gc_thresh3
     
  4. SteveITS

    SteveITS Mega Poster

    Messages:
    210
    Thanks!
     
  5. SteveITS

    SteveITS Mega Poster

    Messages:
    210

Share This Page