Hello,
the infra team has met to review the past issues, and would like to make
its analysis public to the community:
I. introduction
We have three big servers running, called dauntless, excelsior and
falco. The first two were put online in October, the last one in
December. The planned platform was oVirt with Gluster, with CentOS as
the base system. All servers are comparable in their hardware setup,
with 256 GB of RAM, one internal and one external Gbit networking card,
server-grade mainboard with IPMI, several HDDs, and 64 core CPU. All
three hosts were connected internally via a Gbit link, with oVirt
managing all of them, and Gluster being the underlying network file system.
II. before going productive
Extensive tests were carried out before going live with the platform.
The concrete CentOS, oVirt and Gluster version used for production was
tested twice, on separate and on the actual hardware, with the IPMI and
BIOS versions used later; that included two desaster simulations, where
one host was disconnected unannounced and oVirt behaved exactly as
expected - running the VMs on the other host without interruption.
When the platform was ready, to not endanger anything, first all
non-critical VMs were migrated, i.e. mainly testing VMs where a downtime
is not critical, but that still produce quite some load. Working
exclusively with these was done over several weeks, with no problems
detected.
After that, several other VMs were migrated, including Gerrit, and the
system worked fine for weeks with no I/O issues and no downtime.
III. downtime before FOSDEM
The first issues happened from Wednesday, January 28, around 1440 UTC,
until Thursday, January 29, early morning. Falco was disconnected for up
to a couple of minutes. The reason of this is still unclear.
-> The infra team is looking for an oVirt expert who can help us
parse the logs to better understand what has happened.
At the same time, on an unrelated reason, a file system corruption on
excelsior was discovered, which is a failure that has been in existence
since January 5 already.
-> The monitoring scripts, based on snmpd, claimed everything was
ok. Scripts have already been enhanced to properly report the status.
Each of the errors on their own, if detected in time, would have caused
no downtime at all.
With that, 2 our of 3 Gluster bricks were down, and the platform was
stopped. (Gluster is comparable to RAID 5 here.) Gluster detected a
possible split-brain situation, so a manual recovery was required. The
actual start of the fix was not complicated, in comparison to other
networking file systems, and could easily be handled, but the recovery
took long due to the amount of data already on the platform and the
internal Gbit connectivity. Depending on the VM, the downtime was
between 3 and 18 hours. oVirt's database also had issues, which could
however be fixed. In other words, the reason for most of the downtime
was not finding a fix, but waiting for it to be completed.
-> Situations like these can be less time-consuming with an
(expensive) internal 10 Gbit connection, or with a (slower, but more
redundant and cheaper) internal trunked/bonded x * 1 Gbit connection,
which we will be looking into.
-> Work in progress currently is an SMS monitoring system where we
seek for volunteers to be included in the notification. SMS notification
is to be sent out in case of severe issues and can be combined with
Android tools like Alarmbox.
-> In the meantime, we have also fine-tuned the alerts and
thresholds to distinguish messages.
All infra members including several volunteers were working jointly on
getting the services back together. However, we experienced some issues
with individual VMs, where it was unclear who is responsible for them,
and where documentation was partially missing or outdated. It worked out
in the end, however.
-> Infra will enforce a policy for new and existing services. At
least 2 responsible maintainers are required per service, including
proper documentation. That will be announced with a fair deadline.
Services not fulfilling those requirements will be moved from production
to test. A concrete policy is still to be drafted with the public.
On a side note, we discovered, although it has worked fine for months,
and survived two desaster simulations, that oVirt does not support
Gluster running on the internal interface, and the hosted engine and
management on the external interface. This fact is undocumented in the
oVirt documentation and was discovered by Alex during FOSDEM, when he
attended an oVirt talk, where this was mentioned as a side-note.
-> An option is to look into SAN solutions, which are not only
faster, but also probably more reliable. We might have some supporting
offer here that needs looking into.
During FOSEM, Alex also got in touch with one oVirt and one Gluster
developer. We also talked to a friendly supporter who gained experience
with Gluster and virtualization, providing infrastructure solutions in a
larger company for a living, and offered to help us. While his initial
comment was that Gluster is not to be recommended, it turned out in a
later phone conference that his experience is several years old, and
related to older kernels that miss a patch where KVM can run on Gluster.
oVirt had that patch included, and newer distributions do so as well.
After outlining the situation, he supported our progress, and finished
with "It looks like you're doing the right thing, and here simply Murphy
kicked in. You should proceed, it looks like a good solution."
All services were running fine again, with reasons for outage identified
and solutions brought forward, when we headed to Brussels for FOSDEM.
IV. downtime after FOSDEM
The second downtime occured on Tuesday, February 17 around 1800 UTC,
when we discovered the virtual machines were getting slow. Gluster was
in a consistent state and there was no need to heal, and we discovered
the slowness was due oVirt migrating virtual machines from one host to
another. The reason of that is unknown - although connectivity was in
place and services running, oVirt was assuming dauntless and excelsior
are down, and so migrated all VMs to falco.
Excelsior rebooted itself the same day around 1946 UTC, which lead to a
disconnect of several minutes. The reason is unclear, as the log files
show no reason.
-> The infra team is looking for an oVirt expert who can help us
parse the logs to better understand what has happened.
Gluster then began to heal itself. Despite healing, Gluster was
available all of the time and required no manual intervention, no
split-brain situation occured, it was just slower due to the limited
internal network bandwith.
-> Situations like these can be less time-consuming with an
(expensive) internal 10 Gbit connection, or with a (slower, but more
redundant and cheaper) internal trunked/bonded x * 1 Gbit connection,
which we will be looking into.
The hosted engine, which is responsible for managing the virtual
machines, didn't come up again, and required manual intervention to get
it running again.
The infra team - again everyone was working jointly together - was
shutting down all non-critical VMs to speed up the process and free
cycles, and the systems were stable again around 2100 UTC. Again,
kicking off the fix was not the most time-consuming part, but waiting
for the fix to be carried out, as we had to wait for the Gluster heal to
finish.
The next morning around 0700 UTC, we discovered services were partially
down again. We discovered that dauntless was frozen at 0155 UTC for
unknown reason and eventually rebooted itself. After the reboot, oVirt,
which manages the firewall, locked it down for no apparent reason. That
led to Gluster being disconnected from the other bricks.
After discovering that, we first declined oVirt access to the IPMI
management interface to avoid it rebooting the hosts, and manually
opened the firewall.
-> Work in progress currently is an SMS monitoring system where we
seek for volunteers to be included in the notification. SMS notification
is to be sent out in case of severe issues and can be combined with
Android tools like Alarmbox.
We also immediately ordered a temporary server and migrated the website
VM to it, to not endanger the planned release, and the migration of the
website worked smooth. In the meantime, after the firewall had been
opened manually, Gluster began healing, there was no split-brain
situation. Again, I/O was limited due to the internal connectivity, and
kicking off the fix was much faster than waiting for it to be carried
out, but the file system was always available.
-> Situations like these can be less time-consuming with an
(expensive) internal 10 Gbit connection, or with a (slower, but more
redundant and cheaper) internal trunked/bonded x * 1 Gbit connection,
which we will be looking into.
We then migrated all productive VMs (the others were stopped in the
meantime) to excelsior, to reboot daunless and falco.
To get to a stable state, we decided to get rid of oVirt and reinstalled
falco with a plain Linux, local storage and KVM; no oVirt, no Gluster,
KISS principle, and could migrate several critical VMs in time.
The migration of the last productive VM in the oVirt platform, Gerrit,
needed more time due to the sheer amount of data.
-> Situations like these can be less time-consuming with an
(expensive) internal 10 Gbit connection, or with a (slower, but more
redundant and cheaper) internal trunked/bonded x * 1 Gbit connection,
which we will be looking into.
V. gerrit downtime
We decided to take a snapshot of the VM while it's running, to not
interrupt development, afterwards copy the snapshot, then take down
Gerrit with an announced maintenance window either during weekend or in
the evening, and copy a diff over.
Unfortunately, the platform didn't survive until then. On Tuesday,
February 24, around 0834 UTC, gerrit went down, and the image was not
yet fully copied over to falco. Dauntless got stuck again for no
apparent reason. The platform was nearly unusable then, as oVirt did not
know the states of several VMs or showed the wrong status, and was
unable to perform any action, neither via the web interface, nor via
command line. Several VMs were shown as defunct on the shell, i.e. there
were zombie processes. Gluster ran without problems, but oVirt was
confused to a point that not even a manual editing of oVirt's database
to reflect the real VM status helped; oVirt changed this back to unknown
state immediately.
We then decided, as Gerrit was down, to directly migrate it to falco
with plain KVM instead of bringing it back on oVirt. The move to falco
per se went smooth, we just had to wait again due to the internal
connectivity, so again, finding the fix was not the time-eater, but
waiting for it to be carried out. The downtime in the end was between 3
and 4 hours.
-> Situations like these can be less time-consuming with an
(expensive) internal 10 Gbit connection, or with a (slower, but more
redundant and cheaper) internal trunked/bonded x * 1 Gbit connection,
which we will be looking into.
VI. further conclusions and next steps
Status quo is that our productive VMs run on falco with plain KVM, no
network file system, and the website is on a different host, likewise
with plain KVM and no network file system host.
The two other hosts have also been setup, one running testing and
non-productive VMs on it, the other one exclusively running the
crashtest VM. In the near future, we need to schedule a downtime to
re-arrange the VMs to be better distributed between both hosts, in terms
of availability and load.
One assumption is that we hit a kernel bug wrt. KVM and some specific
CPUs. We could take a screenshot of a kernel panic on dauntless, and
research yielded there is such a bug that can occur under certain
circumstances. The CentOS version used was based on a 2.6 kernel (newer
CentOS versions with newer kernels had a pending bug wrt. oVirt).
Supportive of that assumption is the fact that our website VM, which had
freezes every few days before (we power-cycled it before that occured in
the end), runs fine for 18 days now with the same VM, but on a different
host environment.
The problems started timing-wise when the third host was added, which
carried the crashtesting VM. The crashtesting is known to regularly kill
the VMs its running in, depending on the kernel, but so far, no problems
with the host occured. It also was running without problems before on
the two-server setup. However, the resources attached to it were not as
many as with the third host, so chances are this change is at least part
of the problems. As KVM uses paravirtualization for I/O and networking,
and both of those became a bootleneck that led to downtime, and we have
seen the KVM kernel panic, there is indeed a chance that this is part of
the root issue. If that is true, it should be solved now, as
crashtesting is running now its own dedicated host.
-> As oVirt claimed internal connectivity issues, we will monitor
the internal connectivity when all hosts are setup again. Drop of the
internal connectivity right now would not affect us, as we don't run a
network file system, so it's the best chance to diagnose if there are
really problems.
What we clearly need to do, and this e-mail is a first step towards
being more transparent, is to communicate more often and more regularly
with the community, to keep you up to date, and to attract new
volunteers. With most of the admins being located in the same area, it's
convenient to meet in person, but we appreciate it's a big barrier for
those willing to contribute.
-> The infra team will communicate more open, more transparent,
more regularly and more publicly on what it's doing and tries to attract
new contributors.
-> Admin meetings in person do help (last time we introduced people
to oVirt in person, and their knowledge really helped us out here), but
we will have regular admin phone conferences where everything will be
communicated, and important decisions are only to be taken either in the
call or via the mailing list, not in personal meetings.
In terms of platform, we right now stick with the most simple solution,
i.e. local storage and KVM. oVirt is charming, had quite a few features,
but the more levels of complexity you add, the more risks it bears.
Therefore, any decision to move to some more sophisticated
virtualization solution and/or a networking file system again should be
discussed, voted on and taken in public, with more people in the boat,
and not be rushed.
-> The next admin call will be planned soon, and one of the topics
will be how to proceed.
That being said, sorry again for the outage. In infra, anything that
does not work is immediately spotted by many people and especially with
a worldwide distributed virtual community as ours, has heavy impact on
everyone.
Even more I'd love to thank Alex, Cloph, Robert and Alin for their
tremendous work on getting us back online as soon as possible. I'm
really glad for your work and your knowledge and your support!
Florian
--
To unsubscribe e-mail to: website+unsubscribe@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/website/
All messages sent to this list will be publicly archived and cannot be deleted
Context
- [libreoffice-website] infra analysis and status quo · Florian Effenberger
Privacy Policy |
Impressum (Legal Info) |
Copyright information: Unless otherwise specified, all text and images
on this website are licensed under the
Creative Commons Attribution-Share Alike 3.0 License.
This does not include the source code of LibreOffice, which is
licensed under the Mozilla Public License (
MPLv2).
"LibreOffice" and "The Document Foundation" are
registered trademarks of their corresponding registered owners or are
in actual use as trademarks in one or more countries. Their respective
logos and icons are also subject to international copyright laws. Use
thereof is explained in our
trademark policy.