Hi, I've installed Virtuozzo 7 on 2 identical servers, but on both I now have the exact same error ( after some days of running, error not on the same dates, like 20+ days after the first server had this error - also at 12:00:01 ): ABRT has detected 1 problem(s). For more info run: abrt-cli list --since ... id f1e6ccf85836a9a4ecb511e12fa3d7a41d40fa63 reason: pfcached killed by SIGBUS time: Sat 01 Apr 2017 12:00:01 AM CEST cmdline: /sbin/pfcached package: pfcache-7.0.20-12.vz7 uid: 0 (root) count: 1 Directory: /var/spool/abrt/ccpp-2017-04-01-00:00:01-2038 I don't know if this is normal? Both servers run only 1 container with minimal load. Hardware info: DL380G9's with 2 CPU's 160 GB ram 7x SSD disks from HP running in RAID6 If anyone has similar problems or know the cause, please let me know. Thanks in advance!
Hello deagan, Never heard of anything like that. SIGBUS might indicate problems with memory management in code - e.g. double-free or access to foreign memory. Can you please check if ABRT has saved coredump of the process? If you have a core it would help us greatly.
Hi Pavel, I'm not familiar with abrt, but in the folder I see a coredump file. I could send you the whole dir, what is the best way to send you the file? Folder: /var/spool/abrt/ccpp-2017-04-01-00:00:01-2038 total 3.2M drwxr-x--- 2 root abrt 4.0K Apr 3 11:54 . drwxr-x--x 3 root abrt 4.0K Apr 1 00:00 .. -rw-r----- 1 root abrt 6 Apr 1 00:00 abrt_version -rw-r----- 1 root abrt 4 Apr 1 00:00 analyzer -rw-r----- 1 root abrt 6 Apr 1 00:00 architecture -rw-r----- 1 root abrt 249 Apr 1 00:00 cgroup -rw-r----- 1 root abrt 14 Apr 1 00:00 cmdline -rw-r----- 1 root abrt 7 Apr 1 00:00 component -rw-r----- 1 root abrt 2.7K Apr 1 00:00 core_backtrace -rw-r----- 1 root abrt 12M Apr 1 00:00 coredump -rw-r----- 1 root abrt 1 Apr 1 00:00 count -rw-r----- 1 root abrt 1.8K Apr 1 00:00 dso_list -rw-r----- 1 root abrt 218 Apr 1 00:00 environ -rw-r----- 1 root abrt 0 Apr 1 00:00 event_log -rw-r----- 1 root abrt 18 Apr 1 00:00 executable -rw-r----- 1 root abrt 82 Apr 1 00:00 exploitable -rw-r----- 1 root abrt 4 Apr 1 00:00 global_pid -rw-r----- 1 root abrt 18 Apr 1 00:00 hostname -rw-r----- 1 root abrt 25 Apr 1 00:00 kernel -rw-r----- 1 root abrt 10 Apr 1 00:00 last_occurrence -rw-r----- 1 root abrt 1.3K Apr 1 00:00 limits -rw-r----- 1 root abrt 135 Apr 1 00:00 machineid -rw-r----- 1 root abrt 9.8K Apr 1 00:00 maps -rw-r----- 1 root abrt 581 Apr 1 00:00 open_fds -rw-r----- 1 root abrt 259 Apr 1 00:00 os_info -rw-r----- 1 root abrt 29 Apr 1 00:00 os_release -rw-r----- 1 root abrt 21 Apr 1 00:00 package -rw-r----- 1 root abrt 4 Apr 1 00:00 pid -rw-r----- 1 root abrt 6 Apr 1 00:00 pkg_arch -rw-r----- 1 root abrt 1 Apr 1 00:00 pkg_epoch -rw-r----- 1 root abrt 7 Apr 1 00:00 pkg_name -rw-r----- 1 root abrt 6 Apr 1 00:00 pkg_release -rw-r----- 1 root abrt 6 Apr 1 00:00 pkg_version -rw-r----- 1 root abrt 1.2K Apr 1 00:00 proc_pid_status -rw-r----- 1 root abrt 1 Apr 1 00:00 pwd -rw-r----- 1 root abrt 25 Apr 1 00:00 reason -rw-r----- 1 root abrt 4 Apr 1 00:00 runlevel -rw-r----- 1 root abrt 10 Apr 1 00:00 time -rw-r----- 1 root abrt 4 Apr 1 00:00 type -rw-r----- 1 root abrt 1 Apr 1 00:00 uid -rw-r----- 1 root abrt 5 Apr 1 00:00 username -rw-r----- 1 root abrt 40 Apr 1 00:00 uuid -rw-r----- 1 root abrt 22K Apr 1 00:00 var_log_messages
yes, this information is exactly what we need Could you upload it to FTP? ftp://fe.virtuozzo.com/1fa8351f961c2c1c9780a5e77ebf0726/
Hello deagan, We've looked into the dump yesterday, unfortunately it did not reveal the root cause. Problem itself is minor and should not affect your host stability - let me know if it indeed causes any observable issues. We've received similar reports today, and got 4 more dumps. hope they will shed more light to the problem. Just for the reference - internal bug ID for this is PSBM-63453.
Hello deagan, While we're trying to figure out the bug from the crash dump, I'd like to ask if you have any ideas of what could be the trigger? I mean, pfcache started to crash at some point recently, however, it wasn't changed from our side for some time and it was quite stable, there should be some kind of a trigger. Was there any update installation going on at 12:00 ? Any regular tasks at 12:00 when crashes took place? Looking forward to your reply!
Also I wanted to ask you if you use readykernel. Output from following command would be helpful: Code: # readykernel info
Hi Pavel, ------------------ # readykernel info Patch name: readykernel-patch-20.18-17.0-1.vl7 Patch module: kpatch_cumulative_17_0_r1 File: /var/lib/kpatch/3.10.0-327.36.1.vz7.20.18/kpatch-cumulative-17-0-r1.ko Version: 17.0 The following issues are fixed by the ReadyKernel patch: v17.0 and newer: * CVE-2017-7308 Integer overflows in packet_set_ring(). v16.0 and newer: * CVE-2017-7184 Local privilege escalation in XFRM framework. v15.0 and newer: * CVE-2017-2647 Null pointer dereference in search_keyring(). v14.0 and newer: * CVE-2017-6074 dccp: double free in dccp_rcv_state_process. * CVE-2016-2053 Kernel panic and system lockup by triggering BUG_ON() in public_key_verify_signature(). v13.0 and newer: * PSBM-57512 A privileged user inside a container can cause a host kernel crash in udp_lib_get_port(). * CVE-2017-6214 ipv4/tcp: Infinite loop in tcp_splice_read(). v12.0 and newer: * PSBM-59964 Broken isolation for some of "ip ntable" settings. v11.0 and newer: * CVE-2016-3070 Null pointer dereference in trace_writeback_dirty_page(). * CVE-2016-8645 A BUG() statement can be hit in net/ipv4/tcp_input.c. * CVE-2016-9806 Potential double free in netlink_dump(). * PSBM-59983 iptables: forwarding does not work with '--netfilter full'. * PSBM-57499 NULL pointer dereference in write() -> netlink_sendmsg() -> netlink_unicast(). * PSBM-57511 General protection fault in sendmsg() -> netlink_sendmsg() -> netlink_unicast(). v10.0 and newer: * CVE-2017-2583 kvm: vmx/svm potential privilege escalation inside guest. * CVE-2017-2584 kvm: use after free in complete_emulated_mmio. v8.0 and newer: * CVE-2015-8539 Keys: general protection fault in trusted_update(). * PSBM-57915 fs/fadvise: a way was needed to deactivate pages after cached reads. ------------------------- # yum update No packages marked for update crontab on host is empty, in VPS: MAILTO="" 0 0 * * * /usr/local/psa/admin/bin/php -dauto_prepend_file=sdk.php '/usr/local/psa/admin/plib/modules/letsencrypt/scripts/renew.php' I had this on 2 exact similar servers ( hardware nodes ), exact same time ( different dates but at 12:00 ) and reports. 1 of them doesn't do anything ( empty, running 1 Plesk VPS with 1 domain ), the other also doesn't have much to do ( 1 Plesk VPS with some domains ). I didn't run anything special at that time as far as I know. Could you move the files from public FTP? Thanks If there is any more info I can provide, let me know.
Thanks for the details! We've managed to reproduce the problem with a custom-built pfcache. There seem to be race condition, our test binary just helps to catch it easier. Crash takes place when /etc/cron.d/pfcache_cron is executed, it collects statistics. You may disable the task until the fix is available. Root cause is not quite clear so far, but we suspect it's in the kernel. Since we have a way to reproduce the issue I believe we'll find the reason soon enough.