Date: prev next · Thread: first prev next last

2015 Archives by date, by thread · List index

[libreoffice-website] infra analysis and status quo

Florian Effenberger <floeff -AT- documentfoundation.org>
Mon, 09 Mar 2015 11:55:49 +0100

Hello,

the infra team has met to review the past issues, and would like to makeits analysis public to the community:



I. introduction

We have three big servers running, called dauntless, excelsior andfalco. The first two were put online in October, the last one inDecember. The planned platform was oVirt with Gluster, with CentOS asthe base system. All servers are comparable in their hardware setup,with 256 GB of RAM, one internal and one external Gbit networking card,server-grade mainboard with IPMI, several HDDs, and 64 core CPU. Allthree hosts were connected internally via a Gbit link, with oVirtmanaging all of them, and Gluster being the underlying network file system.



II. before going productive

Extensive tests were carried out before going live with the platform.The concrete CentOS, oVirt and Gluster version used for production wastested twice, on separate and on the actual hardware, with the IPMI andBIOS versions used later; that included two desaster simulations, whereone host was disconnected unannounced and oVirt behaved exactly asexpected - running the VMs on the other host without interruption.

When the platform was ready, to not endanger anything, first allnon-critical VMs were migrated, i.e. mainly testing VMs where a downtimeis not critical, but that still produce quite some load. Workingexclusively with these was done over several weeks, with no problemsdetected.

After that, several other VMs were migrated, including Gerrit, and thesystem worked fine for weeks with no I/O issues and no downtime.



III. downtime before FOSDEM

The first issues happened from Wednesday, January 28, around 1440 UTC,until Thursday, January 29, early morning. Falco was disconnected for upto a couple of minutes. The reason of this is still unclear.

-> The infra team is looking for an oVirt expert who can help usparse the logs to better understand what has happened.

At the same time, on an unrelated reason, a file system corruption onexcelsior was discovered, which is a failure that has been in existencesince January 5 already.

-> The monitoring scripts, based on snmpd, claimed everything wasok. Scripts have already been enhanced to properly report the status.

Each of the errors on their own, if detected in time, would have causedno downtime at all.

With that, 2 our of 3 Gluster bricks were down, and the platform wasstopped. (Gluster is comparable to RAID 5 here.) Gluster detected apossible split-brain situation, so a manual recovery was required. Theactual start of the fix was not complicated, in comparison to othernetworking file systems, and could easily be handled, but the recoverytook long due to the amount of data already on the platform and theinternal Gbit connectivity. Depending on the VM, the downtime wasbetween 3 and 18 hours. oVirt's database also had issues, which couldhowever be fixed. In other words, the reason for most of the downtimewas not finding a fix, but waiting for it to be completed.

-> Situations like these can be less time-consuming with an(expensive) internal 10 Gbit connection, or with a (slower, but moreredundant and cheaper) internal trunked/bonded x * 1 Gbit connection,which we will be looking into.

-> Work in progress currently is an SMS monitoring system where weseek for volunteers to be included in the notification. SMS notificationis to be sent out in case of severe issues and can be combined withAndroid tools like Alarmbox.

-> In the meantime, we have also fine-tuned the alerts andthresholds to distinguish messages.

All infra members including several volunteers were working jointly ongetting the services back together. However, we experienced some issueswith individual VMs, where it was unclear who is responsible for them,and where documentation was partially missing or outdated. It worked outin the end, however.

-> Infra will enforce a policy for new and existing services. Atleast 2 responsible maintainers are required per service, includingproper documentation. That will be announced with a fair deadline.Services not fulfilling those requirements will be moved from productionto test. A concrete policy is still to be drafted with the public.

On a side note, we discovered, although it has worked fine for months,and survived two desaster simulations, that oVirt does not supportGluster running on the internal interface, and the hosted engine andmanagement on the external interface. This fact is undocumented in theoVirt documentation and was discovered by Alex during FOSDEM, when heattended an oVirt talk, where this was mentioned as a side-note.

-> An option is to look into SAN solutions, which are not onlyfaster, but also probably more reliable. We might have some supportingoffer here that needs looking into.

During FOSEM, Alex also got in touch with one oVirt and one Glusterdeveloper. We also talked to a friendly supporter who gained experiencewith Gluster and virtualization, providing infrastructure solutions in alarger company for a living, and offered to help us. While his initialcomment was that Gluster is not to be recommended, it turned out in alater phone conference that his experience is several years old, andrelated to older kernels that miss a patch where KVM can run on Gluster.oVirt had that patch included, and newer distributions do so as well.After outlining the situation, he supported our progress, and finishedwith "It looks like you're doing the right thing, and here simply Murphykicked in. You should proceed, it looks like a good solution."

All services were running fine again, with reasons for outage identifiedand solutions brought forward, when we headed to Brussels for FOSDEM.



IV. downtime after FOSDEM

The second downtime occured on Tuesday, February 17 around 1800 UTC,when we discovered the virtual machines were getting slow. Gluster wasin a consistent state and there was no need to heal, and we discoveredthe slowness was due oVirt migrating virtual machines from one host toanother. The reason of that is unknown - although connectivity was inplace and services running, oVirt was assuming dauntless and excelsiorare down, and so migrated all VMs to falco.

Excelsior rebooted itself the same day around 1946 UTC, which lead to adisconnect of several minutes. The reason is unclear, as the log filesshow no reason.

-> The infra team is looking for an oVirt expert who can help usparse the logs to better understand what has happened.

Gluster then began to heal itself. Despite healing, Gluster wasavailable all of the time and required no manual intervention, nosplit-brain situation occured, it was just slower due to the limitedinternal network bandwith.

The hosted engine, which is responsible for managing the virtualmachines, didn't come up again, and required manual intervention to getit running again.

The infra team - again everyone was working jointly together - wasshutting down all non-critical VMs to speed up the process and freecycles, and the systems were stable again around 2100 UTC. Again,kicking off the fix was not the most time-consuming part, but waitingfor the fix to be carried out, as we had to wait for the Gluster heal tofinish.

The next morning around 0700 UTC, we discovered services were partiallydown again. We discovered that dauntless was frozen at 0155 UTC forunknown reason and eventually rebooted itself. After the reboot, oVirt,which manages the firewall, locked it down for no apparent reason. Thatled to Gluster being disconnected from the other bricks.

After discovering that, we first declined oVirt access to the IPMImanagement interface to avoid it rebooting the hosts, and manuallyopened the firewall.

We also immediately ordered a temporary server and migrated the websiteVM to it, to not endanger the planned release, and the migration of thewebsite worked smooth. In the meantime, after the firewall had beenopened manually, Gluster began healing, there was no split-brainsituation. Again, I/O was limited due to the internal connectivity, andkicking off the fix was much faster than waiting for it to be carriedout, but the file system was always available.

We then migrated all productive VMs (the others were stopped in themeantime) to excelsior, to reboot daunless and falco.

To get to a stable state, we decided to get rid of oVirt and reinstalledfalco with a plain Linux, local storage and KVM; no oVirt, no Gluster,KISS principle, and could migrate several critical VMs in time.

The migration of the last productive VM in the oVirt platform, Gerrit,needed more time due to the sheer amount of data.



V. gerrit downtime

We decided to take a snapshot of the VM while it's running, to notinterrupt development, afterwards copy the snapshot, then take downGerrit with an announced maintenance window either during weekend or inthe evening, and copy a diff over.

Unfortunately, the platform didn't survive until then. On Tuesday,February 24, around 0834 UTC, gerrit went down, and the image was notyet fully copied over to falco. Dauntless got stuck again for noapparent reason. The platform was nearly unusable then, as oVirt did notknow the states of several VMs or showed the wrong status, and wasunable to perform any action, neither via the web interface, nor viacommand line. Several VMs were shown as defunct on the shell, i.e. therewere zombie processes. Gluster ran without problems, but oVirt wasconfused to a point that not even a manual editing of oVirt's databaseto reflect the real VM status helped; oVirt changed this back to unknownstate immediately.

We then decided, as Gerrit was down, to directly migrate it to falcowith plain KVM instead of bringing it back on oVirt. The move to falcoper se went smooth, we just had to wait again due to the internalconnectivity, so again, finding the fix was not the time-eater, butwaiting for it to be carried out. The downtime in the end was between 3and 4 hours.



VI. further conclusions and next steps

Status quo is that our productive VMs run on falco with plain KVM, nonetwork file system, and the website is on a different host, likewisewith plain KVM and no network file system host.

The two other hosts have also been setup, one running testing andnon-productive VMs on it, the other one exclusively running thecrashtest VM. In the near future, we need to schedule a downtime tore-arrange the VMs to be better distributed between both hosts, in termsof availability and load.

One assumption is that we hit a kernel bug wrt. KVM and some specificCPUs. We could take a screenshot of a kernel panic on dauntless, andresearch yielded there is such a bug that can occur under certaincircumstances. The CentOS version used was based on a 2.6 kernel (newerCentOS versions with newer kernels had a pending bug wrt. oVirt).Supportive of that assumption is the fact that our website VM, which hadfreezes every few days before (we power-cycled it before that occured inthe end), runs fine for 18 days now with the same VM, but on a differenthost environment.

The problems started timing-wise when the third host was added, whichcarried the crashtesting VM. The crashtesting is known to regularly killthe VMs its running in, depending on the kernel, but so far, no problemswith the host occured. It also was running without problems before onthe two-server setup. However, the resources attached to it were not asmany as with the third host, so chances are this change is at least partof the problems. As KVM uses paravirtualization for I/O and networking,and both of those became a bootleneck that led to downtime, and we haveseen the KVM kernel panic, there is indeed a chance that this is part ofthe root issue. If that is true, it should be solved now, ascrashtesting is running now its own dedicated host.

-> As oVirt claimed internal connectivity issues, we will monitorthe internal connectivity when all hosts are setup again. Drop of theinternal connectivity right now would not affect us, as we don't run anetwork file system, so it's the best chance to diagnose if there arereally problems.

What we clearly need to do, and this e-mail is a first step towardsbeing more transparent, is to communicate more often and more regularlywith the community, to keep you up to date, and to attract newvolunteers. With most of the admins being located in the same area, it'sconvenient to meet in person, but we appreciate it's a big barrier forthose willing to contribute.

-> The infra team will communicate more open, more transparent,more regularly and more publicly on what it's doing and tries to attractnew contributors.

-> Admin meetings in person do help (last time we introduced peopleto oVirt in person, and their knowledge really helped us out here), butwe will have regular admin phone conferences where everything will becommunicated, and important decisions are only to be taken either in thecall or via the mailing list, not in personal meetings.

In terms of platform, we right now stick with the most simple solution,i.e. local storage and KVM. oVirt is charming, had quite a few features,but the more levels of complexity you add, the more risks it bears.Therefore, any decision to move to some more sophisticatedvirtualization solution and/or a networking file system again should bediscussed, voted on and taken in public, with more people in the boat,and not be rushed.

-> The next admin call will be planned soon, and one of the topicswill be how to proceed.

That being said, sorry again for the outage. In infra, anything thatdoes not work is immediately spotted by many people and especially witha worldwide distributed virtual community as ours, has heavy impact oneveryone.

Even more I'd love to thank Alex, Cloph, Robert and Alin for theirtremendous work on getting us back online as soon as possible. I'mreally glad for your work and your knowledge and your support!



Florian

--
To unsubscribe e-mail to: website+unsubscribe@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/website/
All messages sent to this list will be publicly archived and cannot be deleted

Context

[libreoffice-website] infra analysis and status quo · Florian Effenberger

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.