[This is a follow-up for https://status.documentfoundation.org/incidents/265 .]
Starting from the early hours of April 25 some services, notably the
wiki, blog, pad and extension sites experienced slower response time
than usual. Unfortunately the situation got worse and by 10AM most
requests timeouted, thereby making the aforementioned sites mostly
unreachable. In addition outgoing emails, as well as emails to our
public mailing lists, were not delivered resp. accepted in a timely
manner thereby causing delays.
We identified that some volumes in our distributed file system had lost
consistency. That happens from time to time and discrepancies are
normally transparently solved by the self-heal services. Occasionally
something gets stuck and manual intervention is required, which was
apparently the case here: we therefore triggered a manual heal and asked
for patience while it was underway.
A manual heal is typically an I/O intensive operation so we didn't think
much about high loads or processes racing for I/O on the backend. But
it typically completes under 30min, while this time it seemed to be much
slower… We paused some non mission critical VMs to free I/O and give
the healing process some slack, but that didn't seem to improve things
significantly. Then it dawned on us that the crux of the problem was
perhaps elsewhere after all, even though no hardware alert had gone off.
Inspecting per-device I/O statistics we noticed a specific disk in a
RAID array a lot more queued reads than its peers. S.M.A.R.T.
attributes were hinting at a healthy disk, but it obviously wasn't: once
marked as faulty the load almost immediately stabilized to acceptable
levels. (Theoretically the kernel could grab data from one of the
redundant peers instead of insisting in using the slow disk, but it
apparently didn't.) It was shortly before 2PM and from that point it
didn't take long for the heal to finally complete — it would probably
have lasted much shorter if we had kicked the faulty disk before.
That's likely what triggered the issue (consistency loss) in the first
place: I/O-needy processes racing against other isn't a good thing when
I/O is scarce… Unfortunately while we had detailed I/O metrics in the
monitoring system, no alert threshold was defined, and S.M.A.R.T. failed
to properly identify the faulty device.
Once the issue was mitigated, the faulty drive was replaced later that
afternoon. Then later during the week the array rebuilt and VMs moved
back to better balance the load.
Appologies for the inconvenience.
To unsubscribe e-mail to: firstname.lastname@example.org
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/projects/
- [libreoffice-projects] Postmortem April 25 incident report · Guilhem Moulin
Impressum (Legal Info)
: Unless otherwise specified, all text and images
on this website are licensed under the
Creative Commons Attribution-Share Alike 3.0 License
This does not include the source code of LibreOffice, which is
licensed under the Mozilla Public License (MPLv2
"LibreOffice" and "The Document Foundation" are
registered trademarks of their corresponding registered owners or are
in actual use as trademarks in one or more countries. Their respective
logos and icons are also subject to international copyright laws. Use
thereof is explained in our trademark policy