Date: prev next · Thread: first prev next last
2020 Archives by date, by thread · List index


On 29/12/2019 15:17, Stephan Bergmann wrote:
Still trying to track down why sometimes zombie processes survive on the (Linux) Jenkins build machines (and then make later, unrelated Jenkins builds on those machines fail when zombie soffice.bin processes still hold onto named pipes that tests from the new builds want to create too).

One such recent case on tb79 was the aborted <https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/>.  It left behind a zombie python.bin -> oosplash -> soffice.bin process tree executing UITest_calc_tests3.  (Where presumably the soffice.bin process had deadlocked, which then caused the Jenkins

Build timed out (after 15 minutes). Marking the build as aborted.
Build was aborted
Finished: ABORTED

reaction.  But once I noticed, the images of the involved processes had already been overwritten by later builds, so I couldn't use gdb to get backtraces.)

I think I now understand what's going on: Assume some UITest hangs with a deadlock in soffice.bin. <https://plugins.jenkins.io/build-timeout> will kick in after the specified timeout of 900s to kill the build.

The relevant process tree is

java───sh───tb-slave-wrapper─┬─make───make───sh───sh───python.bin───oosplash───soffice.bin
                             └─tee

where:

(1) java is running Jenkins' remoting.jar.

(2) The following sh is Jenkins' way of running

${LODE_HOME}/bin/tb_slave_wrapper --real --mode=config --clean

(as specified in "Build - Execute shell - Command" at <https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/configure>) via some `/bin/sh -xe /tmp/jenkins*.sh` intermediary.

(3) tb-slave-wrapper is running the <https://gerrit.libreoffice.org/plugins/gitiles/lode/+/af02c3a9564062b4d04e457275624a7a30ba2ba2/bin/tb_slave_wrapper> script, which at line 325 does

        make ${keep_going} $target 2>&1 | tee -a ${build_log}

which explains...

(4) ...the first make and...

(5) the tee.

(6) The second make is due to gbuild calling make recursively.

(7) The following sh is running the UITest target's (heavily redacted) recipe line

/bin/sh -c 'S=... && I=... && W=.. && rm -rf $W/UITest/... && mkdir -p $W/UITest/... && ... && ( 
TDOC=... /bin/sh $I/program/python $S/uitest/test_main.py ... || ( RET=$?; $S/solenv/bin/gdb-core-bt.sh ...))'

and...

(8) ...the following sh is running the

TDOC=... /bin/sh $I/program/python $S/uitest/test_main.py ...

part in a subshell.

(9) python.bin is running LO's uitest/test_main.py, which...

(10) ...spawns the soffice script which then execs oosplash, which...

(11) ...spawns soffice.bin.

<https://ci.libreoffice.org/> says at the bottom "Jenkins ver. 2.212", so lets assume that <https://github.com/jenkinsci/jenkins/commit/13ab7b9909927b7afa31797097b6114bf22a2a41> "[maven-release-plugin] prepare release jenkins-2.212" contains the source code of the remoting.jar running on the slave.

I assume the relevant starting point for the killing of the Jenkins job's processes is <https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/Launcher.java#L953>

        @Override
        public void kill(Map<String, String> modelEnvVars) throws InterruptedException {
            ProcessTree.get().killAll(modelEnvVars);
        }

which will try to kill all processes it finds that have inherited the BUILD_ID env var identifying the given Jenkins job.

<https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/util/ProcessTree.java#L731>

        public void killAll(@Nonnull Map<String, String> modelEnvVars) throws InterruptedException {
            for (OSProcess p : this)
                if(p.hasMatchingEnvVars(modelEnvVars))
                    p.killRecursively();
        }

first read the /proc file system tree to get a list of process IDs (in the ProcfsUnix constructor at <https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/util/ProcessTree.java#L741>), then iterates over the list to find processes with matching BUILD_ID and try to kill them.

This looks racy, in that a to-be-killed build can spawn further processes after /proc has been read to produce the list. But that should not be relevant in our scenario of killing a hung build after a timeout, as that build will no longer be spawning new processes (everything but the deadlocked soffice.bin and its parent chain has already terminated).

However, what appears to be relevant here is that processes are killed in a random order (see below).

The actual process killing appears to take place at <https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/util/ProcessTree.java#L909>

        public static void destroy(int pid) throws IllegalAccessException,
                InvocationTargetException {
            if (JAVA8_DESTROY_PROCESS != null) {
                JAVA8_DESTROY_PROCESS.invoke(null, pid, false);
            } else {
                final Optional handle = (Optional)JAVA_9_PROCESSHANDLE_OF.invoke(null, pid);
                if (handle.isPresent()) {
                    JAVA_9_PROCESSHANDLE_DESTROY.invoke(handle.get());
                }
            }
        }


For Java <= 8 it will call the internal java.lang.UNIXProcess.destroyProcess with force=false at <http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/5b5973c3db08/src/solaris/native/java/lang/UNIXProcess_md.c#l717>

JNIEXPORT void JNICALL
Java_java_lang_UNIXProcess_destroyProcess(JNIEnv *env,
                                          jobject junk,
                                          jint pid,
                                          jboolean force)
{
    int sig = (force == JNI_TRUE) ? SIGKILL : SIGTERM;
    kill(pid, sig);
}

which thus sends a SIGTERM. Similarly for Java >= 9 it will call the internal java.lang.ProcessHandle.destroy at <http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/share/classes/java/lang/ProcessHandle.java#l324> (rather than the companion java.lang.ProcessHandle.destroyForcibly), which ends up at <http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/unix/native/libjava/ProcessHandleImpl_unix.c#l312>

JNIEXPORT jboolean JNICALL
Java_java_lang_ProcessHandleImpl_destroy0(JNIEnv *env,
                                          jobject obj,
                                          jlong jpid,
                                          jlong startTime,
                                          jboolean force) {
    pid_t pid = (pid_t) jpid;
    int sig = (force == JNI_TRUE) ? SIGKILL : SIGTERM;
    jlong start = Java_java_lang_ProcessHandleImpl_isAlive0(env, obj, jpid);

    if (start == startTime || start == 0 || startTime == 0) {
        return (kill(pid, sig) < 0) ? JNI_FALSE : JNI_TRUE;
    } else {
        return JNI_FALSE;
    }
}

with force=false, which thus also sends a SIGTERM.

To summarize (and if my browsing of the Jenkins source code is correct), Jenkins will send each process, in effectively random order, a SIGTERM, and will send no process a SIGKILL.

<https://gerrit.libreoffice.org/plugins/gitiles/lode/+/bea0738dbadfe8784e5d3c00f533acf101db4e7e%5E!/> "tb_slave_wrapper: trap signal and kill -9 everything"

trap cleanup 1 2 3 6 15

cleanup()
{
  echo "Caught Signal ... killing everything...."
  # kill everything in same process group (pseudo-pid 0)
  kill -9 0
}

ensures that, if tb_slave_wrapper receives a signal asking it to quit (incl. SIGTERM=15), it will eventually (i.e., once the Bash interpreter would proceed to a new line in the script) send a SIGKILL to all processes in the process group.

The processes (9)--(11) (pythion.bin, oosplash, soffice.bin) will not respond to their SIGTERM, as the former two are stuck waiting on their respective child, while the latter is otherwise deadlocked. They need to rely on the SIGKILL that might get send from (3) tb-slave-wrapper.

But, as processes get the SIGTERM in a random order, if e.g. (8) (the innermost sh) gets a SIGTERM first, it will terminate and cause its parent chain (7), (6), (4), (3) to terminate too. That means that (3) tb-slave-wrapper can terminate before it would receive a SIGTERM, so it will not call cleanup.

That should explain those occasional

python.bin───oosplash───soffice.bin

zombie chains (hanging off processes 1) from timed-out Jenkins builds that we observe on tb75, tb76, and tb79.

Which in turn means that

On 29/12/2019 15:35, Noel Grandin wrote:
If we can't fix this, I suggest we add:

kill $(ps -o pid= --ppid $$)

to the end of the Jenkins build script

would not help, as the problematic processes do not have that $$ as their (immediate) parent.

(And what would happen if the cleanup function in tb_slave_wrapper did kick in? I think it would wreck havoc, as the outer Jenkins remoting.jar process, as well as any parallel tb_slave_wrapper instances, are in the same process group, and would thus all get forcibly killed.)


Context


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.