Postmortem April 25 incident report

[This is a follow-up for https://status.documentfoundation.org/incidents/265 .]

Starting from the early hours of April 25 some services, notably the
wiki, blog, pad and extension sites experienced slower response time
than usual. Unfortunately the situation got worse and by 10AM most
requests timeouted, thereby making the aforementioned sites mostly
unreachable. In addition outgoing emails, as well as emails to our
public mailing lists, were not delivered resp. accepted in a timely
manner thereby causing delays.

We identified that some volumes in our distributed file system had lost
consistency. That happens from time to time and discrepancies are
normally transparently solved by the self-heal services. Occasionally
something gets stuck and manual intervention is required, which was
apparently the case here: we therefore triggered a manual heal and asked
for patience while it was underway.

A manual heal is typically an I/O intensive operation so we didn't think
much about high loads or processes racing for I/O on the backend. But
it typically completes under 30min, while this time it seemed to be much
slower… We paused some non mission critical VMs to free I/O and give
the healing process some slack, but that didn't seem to improve things
significantly. Then it dawned on us that the crux of the problem was
perhaps elsewhere after all, even though no hardware alert had gone off.
Inspecting per-device I/O statistics we noticed a specific disk in a
RAID array a lot more queued reads than its peers. S.M.A.R.T.
attributes were hinting at a healthy disk, but it obviously wasn't: once
marked as faulty the load almost immediately stabilized to acceptable
levels. (Theoretically the kernel could grab data from one of the
redundant peers instead of insisting in using the slow disk, but it
apparently didn't.) It was shortly before 2PM and from that point it
didn't take long for the heal to finally complete — it would probably
have lasted much shorter if we had kicked the faulty disk before.
That's likely what triggered the issue (consistency loss) in the first
place: I/O-needy processes racing against other isn't a good thing when
I/O is scarce… Unfortunately while we had detailed I/O metrics in the
monitoring system, no alert threshold was defined, and S.M.A.R.T. failed
to properly identify the faulty device.

Once the issue was mitigated, the faulty drive was replaced later that
afternoon. Then later during the week the array rebuilt and VMs moved
back to better balance the load.

Appologies for the inconvenience.

Hello Guilhem,

[This is a follow-up forhttps://status.documentfoundation.org/incidents/265 .]

thanks a lot indeed to you and Christian for handling the situation, and thanks a lot for the post-mortem incident report. We communicated a brief summary during the incident via IRC and on the status page, thanks for follow-up with more detail here.

I know you are already working on adding those metrics to the monitoring, so we can identify such a situation next time.

Florian