Date: prev next · Thread: first prev next last


Hello,

the infra team has met to review the past issues, and would like to make its analysis public to the community:


I. introduction

We have three big servers running, called dauntless, excelsior and falco. The first two were put online in October, the last one in December. The planned platform was oVirt with Gluster, with CentOS as the base system. All servers are comparable in their hardware setup, with 256 GB of RAM, one internal and one external Gbit networking card, server-grade mainboard with IPMI, several HDDs, and 64 core CPU. All three hosts were connected internally via a Gbit link, with oVirt managing all of them, and Gluster being the underlying network file system.


II. before going productive

Extensive tests were carried out before going live with the platform. The concrete CentOS, oVirt and Gluster version used for production was tested twice, on separate and on the actual hardware, with the IPMI and BIOS versions used later; that included two desaster simulations, where one host was disconnected unannounced and oVirt behaved exactly as expected - running the VMs on the other host without interruption.

When the platform was ready, to not endanger anything, first all non-critical VMs were migrated, i.e. mainly testing VMs where a downtime is not critical, but that still produce quite some load. Working exclusively with these was done over several weeks, with no problems detected.

After that, several other VMs were migrated, including Gerrit, and the system worked fine for weeks with no I/O issues and no downtime.


III. downtime before FOSDEM

The first issues happened from Wednesday, January 28, around 1440 UTC, until Thursday, January 29, early morning. Falco was disconnected for up to a couple of minutes. The reason of this is still unclear.

-> The infra team is looking for an oVirt expert who can help us parse the logs to better understand what has happened.

At the same time, on an unrelated reason, a file system corruption on excelsior was discovered, which is a failure that has been in existence since January 5 already.

-> The monitoring scripts, based on snmpd, claimed everything was ok. Scripts have already been enhanced to properly report the status.

Each of the errors on their own, if detected in time, would have caused no downtime at all.

With that, 2 our of 3 Gluster bricks were down, and the platform was stopped. (Gluster is comparable to RAID 5 here.) Gluster detected a possible split-brain situation, so a manual recovery was required. The actual start of the fix was not complicated, in comparison to other networking file systems, and could easily be handled, but the recovery took long due to the amount of data already on the platform and the internal Gbit connectivity. Depending on the VM, the downtime was between 3 and 18 hours. oVirt's database also had issues, which could however be fixed. In other words, the reason for most of the downtime was not finding a fix, but waiting for it to be completed.

-> Situations like these can be less time-consuming with an (expensive) internal 10 Gbit connection, or with a (slower, but more redundant and cheaper) internal trunked/bonded x * 1 Gbit connection, which we will be looking into.

-> Work in progress currently is an SMS monitoring system where we seek for volunteers to be included in the notification. SMS notification is to be sent out in case of severe issues and can be combined with Android tools like Alarmbox.

-> In the meantime, we have also fine-tuned the alerts and thresholds to distinguish messages.

All infra members including several volunteers were working jointly on getting the services back together. However, we experienced some issues with individual VMs, where it was unclear who is responsible for them, and where documentation was partially missing or outdated. It worked out in the end, however.

-> Infra will enforce a policy for new and existing services. At least 2 responsible maintainers are required per service, including proper documentation. That will be announced with a fair deadline. Services not fulfilling those requirements will be moved from production to test. A concrete policy is still to be drafted with the public.

On a side note, we discovered, although it has worked fine for months, and survived two desaster simulations, that oVirt does not support Gluster running on the internal interface, and the hosted engine and management on the external interface. This fact is undocumented in the oVirt documentation and was discovered by Alex during FOSDEM, when he attended an oVirt talk, where this was mentioned as a side-note.

-> An option is to look into SAN solutions, which are not only faster, but also probably more reliable. We might have some supporting offer here that needs looking into.

During FOSEM, Alex also got in touch with one oVirt and one Gluster developer. We also talked to a friendly supporter who gained experience with Gluster and virtualization, providing infrastructure solutions in a larger company for a living, and offered to help us. While his initial comment was that Gluster is not to be recommended, it turned out in a later phone conference that his experience is several years old, and related to older kernels that miss a patch where KVM can run on Gluster. oVirt had that patch included, and newer distributions do so as well. After outlining the situation, he supported our progress, and finished with "It looks like you're doing the right thing, and here simply Murphy kicked in. You should proceed, it looks like a good solution."

All services were running fine again, with reasons for outage identified and solutions brought forward, when we headed to Brussels for FOSDEM.


IV. downtime after FOSDEM

The second downtime occured on Tuesday, February 17 around 1800 UTC, when we discovered the virtual machines were getting slow. Gluster was in a consistent state and there was no need to heal, and we discovered the slowness was due oVirt migrating virtual machines from one host to another. The reason of that is unknown - although connectivity was in place and services running, oVirt was assuming dauntless and excelsior are down, and so migrated all VMs to falco.

Excelsior rebooted itself the same day around 1946 UTC, which lead to a disconnect of several minutes. The reason is unclear, as the log files show no reason.

-> The infra team is looking for an oVirt expert who can help us parse the logs to better understand what has happened.

Gluster then began to heal itself. Despite healing, Gluster was available all of the time and required no manual intervention, no split-brain situation occured, it was just slower due to the limited internal network bandwith.

-> Situations like these can be less time-consuming with an (expensive) internal 10 Gbit connection, or with a (slower, but more redundant and cheaper) internal trunked/bonded x * 1 Gbit connection, which we will be looking into.

The hosted engine, which is responsible for managing the virtual machines, didn't come up again, and required manual intervention to get it running again.

The infra team - again everyone was working jointly together - was shutting down all non-critical VMs to speed up the process and free cycles, and the systems were stable again around 2100 UTC. Again, kicking off the fix was not the most time-consuming part, but waiting for the fix to be carried out, as we had to wait for the Gluster heal to finish.

The next morning around 0700 UTC, we discovered services were partially down again. We discovered that dauntless was frozen at 0155 UTC for unknown reason and eventually rebooted itself. After the reboot, oVirt, which manages the firewall, locked it down for no apparent reason. That led to Gluster being disconnected from the other bricks.

After discovering that, we first declined oVirt access to the IPMI management interface to avoid it rebooting the hosts, and manually opened the firewall.

-> Work in progress currently is an SMS monitoring system where we seek for volunteers to be included in the notification. SMS notification is to be sent out in case of severe issues and can be combined with Android tools like Alarmbox.

We also immediately ordered a temporary server and migrated the website VM to it, to not endanger the planned release, and the migration of the website worked smooth. In the meantime, after the firewall had been opened manually, Gluster began healing, there was no split-brain situation. Again, I/O was limited due to the internal connectivity, and kicking off the fix was much faster than waiting for it to be carried out, but the file system was always available.

-> Situations like these can be less time-consuming with an (expensive) internal 10 Gbit connection, or with a (slower, but more redundant and cheaper) internal trunked/bonded x * 1 Gbit connection, which we will be looking into.

We then migrated all productive VMs (the others were stopped in the meantime) to excelsior, to reboot daunless and falco.

To get to a stable state, we decided to get rid of oVirt and reinstalled falco with a plain Linux, local storage and KVM; no oVirt, no Gluster, KISS principle, and could migrate several critical VMs in time.

The migration of the last productive VM in the oVirt platform, Gerrit, needed more time due to the sheer amount of data.

-> Situations like these can be less time-consuming with an (expensive) internal 10 Gbit connection, or with a (slower, but more redundant and cheaper) internal trunked/bonded x * 1 Gbit connection, which we will be looking into.


V. gerrit downtime

We decided to take a snapshot of the VM while it's running, to not interrupt development, afterwards copy the snapshot, then take down Gerrit with an announced maintenance window either during weekend or in the evening, and copy a diff over.

Unfortunately, the platform didn't survive until then. On Tuesday, February 24, around 0834 UTC, gerrit went down, and the image was not yet fully copied over to falco. Dauntless got stuck again for no apparent reason. The platform was nearly unusable then, as oVirt did not know the states of several VMs or showed the wrong status, and was unable to perform any action, neither via the web interface, nor via command line. Several VMs were shown as defunct on the shell, i.e. there were zombie processes. Gluster ran without problems, but oVirt was confused to a point that not even a manual editing of oVirt's database to reflect the real VM status helped; oVirt changed this back to unknown state immediately.

We then decided, as Gerrit was down, to directly migrate it to falco with plain KVM instead of bringing it back on oVirt. The move to falco per se went smooth, we just had to wait again due to the internal connectivity, so again, finding the fix was not the time-eater, but waiting for it to be carried out. The downtime in the end was between 3 and 4 hours.

-> Situations like these can be less time-consuming with an (expensive) internal 10 Gbit connection, or with a (slower, but more redundant and cheaper) internal trunked/bonded x * 1 Gbit connection, which we will be looking into.


VI. further conclusions and next steps

Status quo is that our productive VMs run on falco with plain KVM, no network file system, and the website is on a different host, likewise with plain KVM and no network file system host.

The two other hosts have also been setup, one running testing and non-productive VMs on it, the other one exclusively running the crashtest VM. In the near future, we need to schedule a downtime to re-arrange the VMs to be better distributed between both hosts, in terms of availability and load.

One assumption is that we hit a kernel bug wrt. KVM and some specific CPUs. We could take a screenshot of a kernel panic on dauntless, and research yielded there is such a bug that can occur under certain circumstances. The CentOS version used was based on a 2.6 kernel (newer CentOS versions with newer kernels had a pending bug wrt. oVirt). Supportive of that assumption is the fact that our website VM, which had freezes every few days before (we power-cycled it before that occured in the end), runs fine for 18 days now with the same VM, but on a different host environment.

The problems started timing-wise when the third host was added, which carried the crashtesting VM. The crashtesting is known to regularly kill the VMs its running in, depending on the kernel, but so far, no problems with the host occured. It also was running without problems before on the two-server setup. However, the resources attached to it were not as many as with the third host, so chances are this change is at least part of the problems. As KVM uses paravirtualization for I/O and networking, and both of those became a bootleneck that led to downtime, and we have seen the KVM kernel panic, there is indeed a chance that this is part of the root issue. If that is true, it should be solved now, as crashtesting is running now its own dedicated host.

-> As oVirt claimed internal connectivity issues, we will monitor the internal connectivity when all hosts are setup again. Drop of the internal connectivity right now would not affect us, as we don't run a network file system, so it's the best chance to diagnose if there are really problems.

What we clearly need to do, and this e-mail is a first step towards being more transparent, is to communicate more often and more regularly with the community, to keep you up to date, and to attract new volunteers. With most of the admins being located in the same area, it's convenient to meet in person, but we appreciate it's a big barrier for those willing to contribute.

-> The infra team will communicate more open, more transparent, more regularly and more publicly on what it's doing and tries to attract new contributors.

-> Admin meetings in person do help (last time we introduced people to oVirt in person, and their knowledge really helped us out here), but we will have regular admin phone conferences where everything will be communicated, and important decisions are only to be taken either in the call or via the mailing list, not in personal meetings.

In terms of platform, we right now stick with the most simple solution, i.e. local storage and KVM. oVirt is charming, had quite a few features, but the more levels of complexity you add, the more risks it bears. Therefore, any decision to move to some more sophisticated virtualization solution and/or a networking file system again should be discussed, voted on and taken in public, with more people in the boat, and not be rushed.

-> The next admin call will be planned soon, and one of the topics will be how to proceed.

That being said, sorry again for the outage. In infra, anything that does not work is immediately spotted by many people and especially with a worldwide distributed virtual community as ours, has heavy impact on everyone.

Even more I'd love to thank Alex, Cloph, Robert and Alin for their tremendous work on getting us back online as soon as possible. I'm really glad for your work and your knowledge and your support!


Florian

--
To unsubscribe e-mail to: projects+unsubscribe@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/projects/
All messages sent to this list will be publicly archived and cannot be deleted

Context


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.