log reports

Discussion in 'General Questions' started by Dave W, Mar 10, 2016.

  1. Dave W

    Dave W Kilo Poster

    Messages:
    67
    Howdy,

    Log reports:
    11-03-16 02:19:43 CLN WRN IO requests on client #133744 serviced more than 8000 ms
    11-03-16 02:19:58 CLN WRN IO requests on client #133744 serviced more than 8000 ms
    11-03-16 02:21:02 CLN WRN IO requests on client #101184 serviced more than 8000 ms
    11-03-16 02:21:03 CLN WRN IO requests on client #100615 serviced more than 8000 ms
    11-03-16 03:23:18 MDS WRN CS#1061 is inactive
    11-03-16 03:23:19 MDS INF CS#1061 is active


    What do these mean?

    Should we be concerned?

    Dave
     
  2. Pavel

    Pavel A.I. Auto-Responder Odin Team

    Messages:
    416
    Hello Dave,

    It means what it states - some I\O operation took more than 8 seconds to complete. Generally containers tolerate this just fine, but VMs would get paused if I\O takes more than 8 seconds, for consistency sake.
    If that is a very rare occasion - you might disregard it. However if these messages occur regularly you might want to get you cluster looked at to find what exactly affects performance.
    Unfortunately analysis of such problem is too sophisticated to just write it down in two words or even sentences, therefore I'd recommend you to contact technical support for assistance.
     
  3. Dave W

    Dave W Kilo Poster

    Messages:
    67
    Thank you Pavel, will get them to take a look.
     
  4. SteveITS

    SteveITS Mega Poster

    Messages:
    211
    Dave, since you posted the inactive/active lines, this sounds a bit relevant?
    https://kb.plesk.com/en/128537
    See bug fix 1: "Default mtime granularity could result in chunk servers switching between active/inactive states in Virtuozzo Storage"

    There was another update or hotfix last year that I recall said fixed a problem causing the "serviced more than 8000 ms" errors, however, neither fix seems to have helped us since we still get an average of a few occurrences of this per week since we installed PCS. Similar to your post they seem to happen in bunches...2-4 chunk servers (out of 5, now 6) within a couple seconds, and then not again until 12-72 hours later.

    I've (almost) never noticed any sort of symptom, though 8 seconds would be a short window to notice even if I was connected to a server.

    However since Pavel mentioned VMs having an issue, for the first time ever Saturday we had a VM pause (one of our sync'd router VMs, so no one noticed). PVA showed it Running but prlctl showed it paused and wouldn't start it. I punted and installed the 6.11 update and vzrebooted the node, and it started after the boot. At about the same time (within 15 minutes of the "8000 ms" error as I recall) we got an error backing up that VM: "Can't backup VM. The operation could not be completed because there was a problem reading the virtual machine location. The hard disk storing the virtual machine "xxxxxxxxx" is slow. The virtual machine will be paused until disk reading speed is increased." The storage seemed fine by the time I looked at it.
     
  5. Pavel

    Pavel A.I. Auto-Responder Odin Team

    Messages:
    416
    Hello Steve,

    Can you paste output of the following command somewhere to pastebin?
    # grep IO_TIMES /var/log/messages*

    You should run this command on a host where these messages were noted.
    Also, add output from
    # pstorage -c %clustername% stat -n

    At least I can say if there is a single bottleneck somewhere.
     
  6. SteveITS

    SteveITS Mega Poster

    Messages:
    211
    http://pastebin.com/JPaqZ65y

    It has a size limit so I only posted from /var/log/messages not messages*, which also included the period of the hardware failure I mentioned in another thread. Pstorage event log shows all nodes at one time or another so I picked one. Note no errors since 3/22:

    20-03-16 03:08:08 CLN WRN IO requests on client #26748 serviced more than 8000 ms
    20-03-16 03:08:23 CLN WRN IO requests on client #26748 serviced more than 8000 ms
    20-03-16 03:08:48 CLN WRN IO requests on client #26748 serviced more than 8000 ms
    20-03-16 03:08:54 CLN WRN IO requests on client #26820 serviced more than 8000 ms
    20-03-16 03:09:03 CLN WRN IO requests on client #26748 serviced more than 8000 ms
    20-03-16 03:09:18 CLN WRN IO requests on client #26748 serviced more than 8000 ms
    20-03-16 03:09:24 CLN WRN IO requests on client #26820 serviced more than 8000 ms
    20-03-16 03:09:33 CLN WRN IO requests on client #26748 serviced more than 8000 ms
    20-03-16 03:09:34 CLN WRN IO requests on client #26820 serviced more than 8000 ms
    20-03-16 04:01:10 MDS INF The cluster physical free space: 3.4Tb (58%), total 5.7Tb
    20-03-16 04:50:19 CLN WRN IO requests on client #26820 serviced more than 8000 ms
    20-03-16 04:50:34 CLN WRN IO requests on client #26820 serviced more than 8000 ms
    20-03-16 06:22:53 CLN WRN IO requests on client #26748 serviced more than 8000 ms
    20-03-16 06:22:54 CLN WRN IO requests on client #26776 serviced more than 8000 ms
    20-03-16 06:22:55 CLN WRN IO requests on client #26820 serviced more than 8000 ms
    20-03-16 06:22:55 CLN WRN IO requests on client #26709 serviced more than 8000 ms
    20-03-16 06:56:20 CLN WRN IO requests on client #26820 serviced more than 8000 ms
    20-03-16 22:04:31 MDS INF The cluster physical free space: 3.4Tb (60%), total 5.7Tb
    21-03-16 02:03:11 MDS INF The cluster physical free space: 3.5Tb (62%), total 5.7Tb
    21-03-16 04:12:41 MDS INF The cluster physical free space: 3.5Tb (60%), total 5.7Tb
    21-03-16 11:27:48 CLN WRN IO requests on client #26820 serviced more than 8000 ms
    21-03-16 11:27:50 CLN WRN IO requests on client #26709 serviced more than 8000 ms
    21-03-16 11:27:50 CLN WRN IO requests on client #26776 serviced more than 8000 ms
    21-03-16 11:27:53 CLN WRN IO requests on client #26748 serviced more than 8000 ms
    21-03-16 19:05:56 MDS INF Global configuration updated by request from 10.254.1.105:56758
    22-03-16 02:01:32 MDS INF The cluster physical free space: 3.5Tb (62%), total 5.7Tb
    22-03-16 06:03:49 CLN WRN IO requests on client #26709 serviced more than 8000 ms
    22-03-16 06:03:50 CLN WRN IO requests on client #26776 serviced more than 8000 ms
    22-03-16 06:03:51 CLN WRN IO requests on client #26820 serviced more than 8000 ms
    22-03-16 06:03:53 CLN WRN IO requests on client #26748 serviced more than 8000 ms
    22-03-16 13:45:50 CLN WRN IO requests on client #26709 serviced more than 8000 ms
    22-03-16 13:45:52 CLN WRN IO requests on client #26776 serviced more than 8000 ms
    22-03-16 13:45:52 CLN WRN IO requests on client #26820 serviced more than 8000 ms
    22-03-16 13:45:54 CLN WRN IO requests on client #26748 serviced more than 8000 ms
    24-03-16 04:13:04 MDS INF The cluster physical free space: 3.5Tb (60%), total 5.7Tb
     
  7. Pavel

    Pavel A.I. Auto-Responder Odin Team

    Messages:
    416
    Thanks for the output.
    I've wrote a small parser for IO_TIMES few days ago, tested it on your output.
    Statistics are:
    4 bottleneck(s) in disk in CS#1027
    20 bottleneck(s) in disk in CS#1029
    11 bottleneck(s) in disk in CS#1031
    21 bottleneck(s) in disk in CS#1032
    191 bottleneck(s) in fuse

    Network has not been a problem even a single time. Note, "disk" technically might be affected by host slowness, it is not necessary an I\O throughput issue (e.g. sudden cpu scheduling issues cause CS/MDS to be running slowly). "stat" is not full since I forgot to add "-a" to my command, but it's crystal clear that disks are doing fine and are quite unlikely to be the root cause.

    Since it's mostly fuse who's slow I'd assume it's indeed something related to performance of the host or process scheduling.
    Lets try following - use "renice" on pstorage processes (cs/mds/mount) to gain a higher priority. Lets even go for "-20" nice value, and let us know if that helps to avoid IO_TIMES messages.
     
  8. Pavel

    Pavel A.I. Auto-Responder Odin Team

    Messages:
    416
    Also, make sure you're running latest kernel with update 11 - it provides more optimization and should be helpful especially for cases like this one.
     
  9. SteveITS

    SteveITS Mega Poster

    Messages:
    211
    I've reniced csd, pstorage-mount, and mdsd processes. Will wait a few days and report back.

    These servers are only a year old though...SuperMicro dual-socket boards with one Xeon CPU E5-2650 v3, 64 GB RAM, and an SSD for caching on each. I think the highest usage ones have around 10 containers, as we migrate from our old hosting. CPU usage has not appeared anywhere near the max whenever I've looked. After talking with SuperMicro we have plans to upgrade the LSI RAID firmware when we next go to the data center, but that wasn't to fix anything specific.

    We did just upgrade to the update 11 fix...that (and more likely the vzreboot afterwards) got the VM I mentioned above running again.
     
  10. Pavel

    Pavel A.I. Auto-Responder Odin Team

    Messages:
    416
    Steve, neither the reboot nor the upgrade were necessary. VM should've been unpaused automatically once it's able to write to the disk - that's the design and so far it has never failed us, proving to be reliable enough.

    I'll be waiting for your feedback, although I might not be able to respond next week due to planned vacation.
     
  11. SteveITS

    SteveITS Mega Poster

    Messages:
    211
    I'd agree but like I said PVA showed it Running, and wouldn't pause it, while prlctl showed it paused and wouldn't/couldn't start it. So it was in a weird state. My other option that I didn't try was to power off the VM but thought a vzreboot was cleaner than a power off, if it stayed paused.

    I did the upgrade because it was on my to-do list, and I figured I'd take care of that if I was going to reboot anyway.
     
    Last edited: Mar 25, 2016
  12. SteveITS

    SteveITS Mega Poster

    Messages:
    211
    I am still getting the messages log entries and pstorage top events after settings the processes to a niceness of -20.
     
  13. Pavel

    Pavel A.I. Auto-Responder Odin Team

    Messages:
    416
    Steve, thanks for giving it a shot. Unfortunately it's not a kind of issue that can be troubleshooted via forum conversation, it requires some troubleshooting and debugging. Please submit a support ticket for deeper analysis
     
  14. SteveITS

    SteveITS Mega Poster

    Messages:
    211
    Dave do you have either Intel DC S3710 SSDs or LSI 3108 RAID controllers?

    I hope I am not jumping the gun but on 4/20 we updated the RAID firmware and SSD firmware. In the 6 days since, we have had only one "serviced more than 8000 ms" error on only one server along with matching IO_TIMES log entries (all at 2:06:24 am), FAR fewer than before. As in, zero on the other servers.

    Putting some other things together, we had a Windows server we built five months ago for a client that had some weird behavior where the Intel management software would show one or even both drives dropped out of the RAID 1 array, but the array drive letter was still readable and writable from Windows. Updating the management software and SSD firmware seemed to fix that.

    Intel has released multiple updates to the firmware of that drive in the past 6 months and is a common factor so I suspect that is the main culprit.
     
  15. Pavel

    Pavel A.I. Auto-Responder Odin Team

    Messages:
    416
    Nice point Steve, thanks for the idea!
     
  16. SteveITS

    SteveITS Mega Poster

    Messages:
    211

Share This Page