Date: prev next · Thread: first prev next last
2018 Archives by date, by thread · List index

 1. davido
 2. guilhem
 3. Brett
 4. Christian

 * Upgrade Gerrit to 2.13.10
   + Issue filed in Redmine
   + Who is working on it
     - G. Do we need to go through that again?  The issue was filed and
       assigned, we'll follow up from there (no update there = no news)
   + Next steps
   + Set up staging gerrit instance and sync production data
     - Assigned target to Q1

 * Monitoring update
    + Brett: I believe Prometheus to be the better solution for infra monitoring.
      - Prometheus is actively maintained by the community rather than TICK's
        reliance on InfluxDB. Exporters (data collection binaries) are already
        available in debian stable and debian backports (even the shiny, new
        2.0 release). TICK requires external repositories and would require
        some auditing (see: Chronograf phoning home by default).
        . G. yay, agreed :-)
        . What's up with the docker containers on vm213? Brett: I had
          just installed a bunch of throwaway services for metrics
        . grafana: http://localhost:3000/dashboard/db/node-exporter-full (forward 3000/TCP to vm213 
        . prometheus: http://localhost:9090/consoles/node-overview.html (forward 9090/TCP to vm213 
      - Prometheus does not have a useful built-in dashboard, only ad-hoc
        query input: They recommend using Grafana for that. Presently, Debian
        only has a package in Sid.
        . Package was removed from testing during the freeze due to two RC
          bugs, and was subsequently orphaned by its maintainer
        . G. Not a blocker for us: I'd refrain from using third party repos
          when possible, but we need a single installation of that package
          (installing prometheus/telegraph from a third party repo on every
          single host would be another story…)  Might even step in and adopt
          the package if its maintenance is not a burden :-)
      - Debian only has a small number of exporters available in the repos -
        We'd have to manually install/configure any additional exporters from
        . Not a blocker, we can build ourselves and tell salt to install the .deb
      - G. Confidentiality and integrity protection
        . exporters installed to vm191 and vm213 for now; they currently
          communicate via intranet (private IP range)
        . monitoring server needs to be offsite (eg, reuse monitoring.tdf),
          and metrics need to be protected: either with SSL/TLS, or with
          → go for TLS tunnels, most host have a nginx instance anyway
        . client auth: HTTP digest or client certs (ECDSA for minimum
        . server (exporter) auth: server certs (ECDSA for minimum overhead)
        . SSL/TLS protection needs an extra SSL termination proxy (eg nginx,
          or stunnel4); unfortunately both client and server insist on verifying
          the chain for mutual auth — instead of pinning the key material…
      - Dashboards (prometheus & grafana) protection: refrain from using SSO
        here so admins are not locked out if the SAML IdP or LDAP server are
      - Each exporter listens on a different INET port (91XY) and speak HTTP
        with ‘/metrics’ as entry path.  Do we want to open gazilion of ports,
        or a single port with multiple entry paths behind the reverse proxy?
        (eg, ‘/MySQL/metrics’)
      - Status of
        . Ubuntu 14.04.5 LTS root server, hosted at filoo
        . guilhem: Would prefer a Debian (stretch) box instead for the sake of
          uniformity, ok to wipe and recycle?
        . AI guilhem: get in touch with filoo if there is no rescue boot
      - Users have requested a public dashboard / status page, with basic info
        (no graph) such as service/host up/down and a custom field we (admin
        team) can fill to tell them we're aware of the problem and are working
        on it
        . Probably doable with prometheus API(?)
        . Exposing blackbox exporter metrics from our various services would
          probably be enough (HTTP return code and timing)
        . Example: , powered by
          → Sponsored service?

    + Is there a desire for just infra monitoring or application-level
      monitoring as well? If we need application-level monitoring, the ELK stack
      is recommended:
      - At least mail queue, database (MySQL, PostgreSQL, slapd) operations,
        HTTPd response code & timing
        . Brett: These can be handled by prometheus, though AFAIK apache/nginx
          need a module to get in-depth stats.
      - ELK: ElasticSearch + LogStash + Kibana
      - LogStash vs. graylog pro & cons?  (We already have an instance of the
        . Brett: I forgot about graylog, sorry. I see no reason to switch from it.
        . OK let's keep it then

    + Alerting
      - The alert system of our current Incinga-based monitoring system is
        (mostly) not working, and having working alerts is an incentive for
        refactoring the monitoring system; so we really want that one to work
      - Threshold-based is good enough
      - As discussed a few calls ago, needs to be schedulable so volunteers
        aren't awoken during their vacation
        . is a feature
          request from last year with no priority :(
        . Possible workaround at
          There are mentions that using the prometheus API/webhooks could work.
        . Wouldn't it be up to the volunteer to silence alarms when on vacation?
          → It's also about week-ends and night: we don't want to give all
            infra volunteers the feeling that they are on duty
      - Need (at least) SMS *and* mail
        . Prometheus' alertmanager can be bridged to an SMS provider

        . Brett: I've had success with using email as provided by telecoms.
          e.g. I use T-Mobile (Deutsche Telekom) and can email
          to get a text.
        . How about other countries?  Not aware of a Swedish provider offering
          a similar service
        . also sipgate has API (RPS & REST) to send sms
          german entry page to various api docs (spec in English))
        . so does

 * SSO adoption <>:
   + 572 accounts in total (72 since the last infra call)
   + Nextcloud is now using SAML (unauthenticated users are redirected to
     auth.tdf); accounts not in LDAP yet are *locked out*
   + All MC members now have a LDAP account, shared creds (HTTP digest auth)
     is now deprecated
   + governance: 1/10 board member missing; 40/190 (21%) TDF members missing
   + contributors: 84/175 (48%) recent (last 90 days) wiki editors missing
   + Need to resume the redmine migration to SAML

 * Next call: Tuesday March 20 2018 at 18:30 Berlin time (17:30 UTC)


To unsubscribe e-mail to:
Posting guidelines + more:
List archive:
All messages sent to this list will be publicly archived and cannot be deleted


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.