report about wiki outage, June 26 ~05:30 → 17:00 UTC

As you probably noticed https://wiki.documentfoundation.org had major
hiccups on Tuesday between 05:30 UTC and 12:00 UTC, which we were not
able to fix live; we thus took it offline for unplanned maintenance, and
finally brought it back at 17:00 UTC. We apologize for the
inconvenience, and we thank all of you for your patience during this
time.

Those lurking on the #tdf-infra IRC channel could see what was going on;
if you're interested in the technical details to above's summary, see
below.

https://wiki.documentfoundation.org (and also https://help.libreoffice.org)
where running MediaWiki 1.29.2 in a Debian 8 VM. MediaWiki's 1.29
branch will be End-of-Life by July 1st, so we had to upgrade before that,
as mentioned in the last infra call minutes [0].

At the end of last week the wiki started showing strange symptoms, with
the PHP FPM workers randomly idling and refusing to take requests, or
SIGSEGV'ing. Since we were preparing an upgrade of the whole stack (MW
1.29 → 1.31, PHP 5.6 → 7.0, MySQL 5.5 → MariaDB 10.1, nginx 1.6 → 1.10,
Debian 8 → 9) we didn't spend a lot of time trying to investigate the
issue with the production stack: restarting php5-fpm daily or so was
enough to make it work for about a day.

On Mon Jun 11 we changed the wiki's authentication method from
username/password to Single Sign-On [0], but we don't think it's related
to that issue in any way. (For one, the problem started over a week
later.) While we now have an idea what the problem was, we don't
understand why it suddenly started showing up last week: the last MW
upgrade was performed on Nov 14 last year, and no Debian package had
been upgraded since Jun 11 (in particular, the last PHP FPM upgrade was
performed on Jan 9).

We finished upgrading and testing our test instance this week-end. It
runs a database snapshot of https://wiki.documentfoundation.org dating
from about 1 year ago, but other than that the OS and MediaWiki
configuration are identical to the production instance. We had to do
some minor tuning but things mostly looked fine, and I scheduled the
upgrade of the production instance on Mon → Tue European night.

Initially the plan was to upgrade both the OS and the MediaWiki (first
to 1.30 then to 1.31) the same night, but it was past 04:30 AM UTC when
https://wiki.documentfoundation.org and https://help.libreoffice.org
where upgraded to 1.30, and as the OS upgrade causes high I/O load and
a short downtime, I deferred it to the next European night to affect as
few users as possible.

The MediaWiki instances were still running fine shortly after 05:00 UTC.
We get a lot of visits from Europe however, and with the European office
hours starting, the load quickly started to raise, as well as CPU and
memory usage, and finally brought https://wiki.documentfoundation.org
down to its knees. Curiously https://help.libreoffice.org was not
affected at all, although it was running on the very same VM, in the
same PHP FPM pool, has exact same MediaWiki code (and mostly similar
configuration), and gets 10x more visits than TDF wiki…

Throwing more CPU and RAM at the VM didn't help solving the issue.
After spending a couple of hours(!) diagnosing it, doing of lot of
speculation, tests and debugging, we decided to bring the instance down
to relieve the VM off some load, and perform the due OS upgrade.

The OS upgrade (Debian 8 → 9), as well as MediaWiki upgrade (1.30 →
1.31) were finished at 14:00 UTC, but unfortunately that didn't help.
Worst, every single request was now causing the PHP FPM worker to max
out CPU and ramp up memory-wise. And meanwhile the help wiki was still
running glitchlessly on the brand new PHP 7 / MariaDB stack, which we
couldn't explain. Of course, we did try to reduce the delta between the
two MediaWiki instances; but removing all extensions and configuration
options that were not in common didn't help, either.

Studying the trace of a PHP FPM child that had gone wild, we saw — a bit
by chance — that it was choking on getting the parent category of links;
possibly something related to a reported MediaWiki bug [1]. So we tried
to disable all options related to Categories, and… bingo!

Of course during that that whole time, it was always an option to
restore the database from backup and re-deploy MediaWiki 1.29 (in
read-only mode to avoid divergence). Should the problem have persisted
until the evening we would have done that before the night, but during
day time we were all busy debugging and investigating, and deploying
another instance would have meant allocating less resources at trying to
fix the problem.

As to why this was affecting the production instance but not the test
instance, the culprit might be the number of requests, as towards the
end of the afternoon the situation looked very bleak (a single request
to the PHP FPM pool was maxing out its worker thread) while in the early
morning clients that were lucky enough to get an idling FPM child were
able to receive a request. On the other hand, it's still mysterious why
https://help.libreoffice.org was not affected. And so is the fact that
our production suddenly instance started having hiccups last week, a bit
out of the blue.

All in all, it was a tough day for everyone… We now hope you'll enjoy
the new Wiki and its stack below! On the positive side, it's now eating
less resources than before, and is more responsive; thus hopefully
providing better experience both for front-end users as well as for the
infra team :slight_smile: