Re: How are Jenkins builds killed exactly?

Stephan Bergmann <sbergman -AT- redhat.com>
Wed, 8 Jan 2020 16:23:21 +0100

On 29/12/2019 15:17, Stephan Bergmann wrote:

Still trying to track down why sometimes zombie processes survive on the(Linux) Jenkins build machines (and then make later, unrelated Jenkinsbuilds on those machines fail when zombie soffice.bin processes stillhold onto named pipes that tests from the new builds want to create too).
One such recent case on tb79 was the aborted<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/>. Itleft behind a zombie python.bin -> oosplash -> soffice.bin process treeexecuting UITest_calc_tests3. (Where presumably the soffice.bin processhad deadlocked, which then caused the Jenkins
Build timed out (after 15 minutes). Marking the build as aborted.
Build was aborted
Finished: ABORTED
reaction. But once I noticed, the images of the involved processes hadalready been overwritten by later builds, so I couldn't use gdb to getbacktraces.)

I think I now understand what's going on: Assume some UITest hangs witha deadlock in soffice.bin. <https://plugins.jenkins.io/build-timeout>will kick in after the specified timeout of 900s to kill the build.


The relevant process tree is

java───sh───tb-slave-wrapper─┬─make───make───sh───sh───python.bin───oosplash───soffice.bin
                             └─tee


where:

(1) java is running Jenkins' remoting.jar.

(2) The following sh is Jenkins' way of running

${LODE_HOME}/bin/tb_slave_wrapper --real --mode=config --clean

(as specified in "Build - Execute shell - Command" at<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/configure>)via some `/bin/sh -xe /tmp/jenkins*.sh` intermediary.

(3) tb-slave-wrapper is running the<https://gerrit.libreoffice.org/plugins/gitiles/lode/+/af02c3a9564062b4d04e457275624a7a30ba2ba2/bin/tb_slave_wrapper>script, which at line 325 does

        make ${keep_going} $target 2>&1 | tee -a ${build_log}


which explains...

(4) ...the first make and...

(5) the tee.

(6) The second make is due to gbuild calling make recursively.

(7) The following sh is running the UITest target's (heavily redacted)recipe line

/bin/sh -c 'S=... && I=... && W=.. && rm -rf $W/UITest/... && mkdir -p $W/UITest/... && ... && ( 
TDOC=... /bin/sh $I/program/python $S/uitest/test_main.py ... || ( RET=$?; $S/solenv/bin/gdb-core-bt.sh ...))'


and...

(8) ...the following sh is running the

TDOC=... /bin/sh $I/program/python $S/uitest/test_main.py ...


part in a subshell.

(9) python.bin is running LO's uitest/test_main.py, which...

(10) ...spawns the soffice script which then execs oosplash, which...

(11) ...spawns soffice.bin.

<https://ci.libreoffice.org/> says at the bottom "Jenkins ver. 2.212",so lets assume that<https://github.com/jenkinsci/jenkins/commit/13ab7b9909927b7afa31797097b6114bf22a2a41>"[maven-release-plugin] prepare release jenkins-2.212" contains thesource code of the remoting.jar running on the slave.

I assume the relevant starting point for the killing of the Jenkinsjob's processes is<https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/Launcher.java#L953>

        @Override
        public void kill(Map<String, String> modelEnvVars) throws InterruptedException {
            ProcessTree.get().killAll(modelEnvVars);
        }

which will try to kill all processes it finds that have inherited theBUILD_ID env var identifying the given Jenkins job.


<https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/util/ProcessTree.java#L731>

        public void killAll(@Nonnull Map<String, String> modelEnvVars) throws InterruptedException {
            for (OSProcess p : this)
                if(p.hasMatchingEnvVars(modelEnvVars))
                    p.killRecursively();
        }

first read the /proc file system tree to get a list of process IDs (inthe ProcfsUnix constructor at<https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/util/ProcessTree.java#L741>),then iterates over the list to find processes with matching BUILD_ID andtry to kill them.

This looks racy, in that a to-be-killed build can spawn furtherprocesses after /proc has been read to produce the list. But thatshould not be relevant in our scenario of killing a hung build after atimeout, as that build will no longer be spawning new processes(everything but the deadlocked soffice.bin and its parent chain hasalready terminated).

However, what appears to be relevant here is that processes are killedin a random order (see below).

The actual process killing appears to take place at<https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/util/ProcessTree.java#L909>

        public static void destroy(int pid) throws IllegalAccessException,
                InvocationTargetException {
            if (JAVA8_DESTROY_PROCESS != null) {
                JAVA8_DESTROY_PROCESS.invoke(null, pid, false);
            } else {
                final Optional handle = (Optional)JAVA_9_PROCESSHANDLE_OF.invoke(null, pid);
                if (handle.isPresent()) {
                    JAVA_9_PROCESSHANDLE_DESTROY.invoke(handle.get());
                }
            }
        }

For Java <= 8 it will call the internaljava.lang.UNIXProcess.destroyProcess with force=false at<http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/5b5973c3db08/src/solaris/native/java/lang/UNIXProcess_md.c#l717>

JNIEXPORT void JNICALL
Java_java_lang_UNIXProcess_destroyProcess(JNIEnv *env,
                                          jobject junk,
                                          jint pid,
                                          jboolean force)
{
    int sig = (force == JNI_TRUE) ? SIGKILL : SIGTERM;
    kill(pid, sig);
}

which thus sends a SIGTERM. Similarly for Java >= 9 it will call theinternal java.lang.ProcessHandle.destroy at<http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/share/classes/java/lang/ProcessHandle.java#l324>(rather than the companion java.lang.ProcessHandle.destroyForcibly),which ends up at<http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/unix/native/libjava/ProcessHandleImpl_unix.c#l312>

JNIEXPORT jboolean JNICALL
Java_java_lang_ProcessHandleImpl_destroy0(JNIEnv *env,
                                          jobject obj,
                                          jlong jpid,
                                          jlong startTime,
                                          jboolean force) {
    pid_t pid = (pid_t) jpid;
    int sig = (force == JNI_TRUE) ? SIGKILL : SIGTERM;
    jlong start = Java_java_lang_ProcessHandleImpl_isAlive0(env, obj, jpid);

    if (start == startTime || start == 0 || startTime == 0) {
        return (kill(pid, sig) < 0) ? JNI_FALSE : JNI_TRUE;
    } else {
        return JNI_FALSE;
    }
}


with force=false, which thus also sends a SIGTERM.

To summarize (and if my browsing of the Jenkins source code is correct),Jenkins will send each process, in effectively random order, a SIGTERM,and will send no process a SIGKILL.

<https://gerrit.libreoffice.org/plugins/gitiles/lode/+/bea0738dbadfe8784e5d3c00f533acf101db4e7e%5E!/>"tb_slave_wrapper: trap signal and kill -9 everything"

trap cleanup 1 2 3 6 15

cleanup()
{
  echo "Caught Signal ... killing everything...."
  # kill everything in same process group (pseudo-pid 0)
  kill -9 0
}

ensures that, if tb_slave_wrapper receives a signal asking it to quit(incl. SIGTERM=15), it will eventually (i.e., once the Bash interpreterwould proceed to a new line in the script) send a SIGKILL to allprocesses in the process group.

The processes (9)--(11) (pythion.bin, oosplash, soffice.bin) will notrespond to their SIGTERM, as the former two are stuck waiting on theirrespective child, while the latter is otherwise deadlocked. They needto rely on the SIGKILL that might get send from (3) tb-slave-wrapper.

But, as processes get the SIGTERM in a random order, if e.g. (8) (theinnermost sh) gets a SIGTERM first, it will terminate and cause itsparent chain (7), (6), (4), (3) to terminate too. That means that (3)tb-slave-wrapper can terminate before it would receive a SIGTERM, so itwill not call cleanup.


That should explain those occasional

python.bin───oosplash───soffice.bin

zombie chains (hanging off processes 1) from timed-out Jenkins buildsthat we observe on tb75, tb76, and tb79.


Which in turn means that

On 29/12/2019 15:35, Noel Grandin wrote:

If we can't fix this, I suggest we add:

kill $(ps -o pid= --ppid $$)

to the end of the Jenkins build script

would not help, as the problematic processes do not have that $$ astheir (immediate) parent.

(And what would happen if the cleanup function in tb_slave_wrapper didkick in? I think it would wreck havoc, as the outer Jenkinsremoting.jar process, as well as any parallel tb_slave_wrapperinstances, are in the same process group, and would thus all getforcibly killed.)

Context

Re: How are Jenkins builds killed exactly? · Stephan Bergmann
- Re: How are Jenkins builds killed exactly? · Stephan Bergmann
  - Re: How are Jenkins builds killed exactly? · Stephan Bergmann

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.