On 29/12/2019 15:17, Stephan Bergmann wrote:
Still trying to track down why sometimes zombie processes survive on the
(Linux) Jenkins build machines (and then make later, unrelated Jenkins
builds on those machines fail when zombie soffice.bin processes still
hold onto named pipes that tests from the new builds want to create too).
One such recent case on tb79 was the aborted
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/>. It
left behind a zombie python.bin -> oosplash -> soffice.bin process tree
executing UITest_calc_tests3. (Where presumably the soffice.bin process
had deadlocked, which then caused the Jenkins
Build timed out (after 15 minutes). Marking the build as aborted.
Build was aborted
Finished: ABORTED
reaction. But once I noticed, the images of the involved processes had
already been overwritten by later builds, so I couldn't use gdb to get
backtraces.)
I think I now understand what's going on: Assume some UITest hangs with
a deadlock in soffice.bin. <https://plugins.jenkins.io/build-timeout>
will kick in after the specified timeout of 900s to kill the build.
The relevant process tree is
java───sh───tb-slave-wrapper─┬─make───make───sh───sh───python.bin───oosplash───soffice.bin
└─tee
where:
(1) java is running Jenkins' remoting.jar.
(2) The following sh is Jenkins' way of running
${LODE_HOME}/bin/tb_slave_wrapper --real --mode=config --clean
(as specified in "Build - Execute shell - Command" at
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/configure>)
via some `/bin/sh -xe /tmp/jenkins*.sh` intermediary.
(3) tb-slave-wrapper is running the
<https://gerrit.libreoffice.org/plugins/gitiles/lode/+/af02c3a9564062b4d04e457275624a7a30ba2ba2/bin/tb_slave_wrapper>
script, which at line 325 does
make ${keep_going} $target 2>&1 | tee -a ${build_log}
which explains...
(4) ...the first make and...
(5) the tee.
(6) The second make is due to gbuild calling make recursively.
(7) The following sh is running the UITest target's (heavily redacted)
recipe line
/bin/sh -c 'S=... && I=... && W=.. && rm -rf $W/UITest/... && mkdir -p $W/UITest/... && ... && (
TDOC=... /bin/sh $I/program/python $S/uitest/test_main.py ... || ( RET=$?; $S/solenv/bin/gdb-core-bt.sh ...))'
and...
(8) ...the following sh is running the
TDOC=... /bin/sh $I/program/python $S/uitest/test_main.py ...
part in a subshell.
(9) python.bin is running LO's uitest/test_main.py, which...
(10) ...spawns the soffice script which then execs oosplash, which...
(11) ...spawns soffice.bin.
<https://ci.libreoffice.org/> says at the bottom "Jenkins ver. 2.212",
so lets assume that
<https://github.com/jenkinsci/jenkins/commit/13ab7b9909927b7afa31797097b6114bf22a2a41>
"[maven-release-plugin] prepare release jenkins-2.212" contains the
source code of the remoting.jar running on the slave.
I assume the relevant starting point for the killing of the Jenkins
job's processes is
<https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/Launcher.java#L953>
@Override
public void kill(Map<String, String> modelEnvVars) throws InterruptedException {
ProcessTree.get().killAll(modelEnvVars);
}
which will try to kill all processes it finds that have inherited the
BUILD_ID env var identifying the given Jenkins job.
<https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/util/ProcessTree.java#L731>
public void killAll(@Nonnull Map<String, String> modelEnvVars) throws InterruptedException {
for (OSProcess p : this)
if(p.hasMatchingEnvVars(modelEnvVars))
p.killRecursively();
}
first read the /proc file system tree to get a list of process IDs (in
the ProcfsUnix constructor at
<https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/util/ProcessTree.java#L741>),
then iterates over the list to find processes with matching BUILD_ID and
try to kill them.
This looks racy, in that a to-be-killed build can spawn further
processes after /proc has been read to produce the list. But that
should not be relevant in our scenario of killing a hung build after a
timeout, as that build will no longer be spawning new processes
(everything but the deadlocked soffice.bin and its parent chain has
already terminated).
However, what appears to be relevant here is that processes are killed
in a random order (see below).
The actual process killing appears to take place at
<https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/util/ProcessTree.java#L909>
public static void destroy(int pid) throws IllegalAccessException,
InvocationTargetException {
if (JAVA8_DESTROY_PROCESS != null) {
JAVA8_DESTROY_PROCESS.invoke(null, pid, false);
} else {
final Optional handle = (Optional)JAVA_9_PROCESSHANDLE_OF.invoke(null, pid);
if (handle.isPresent()) {
JAVA_9_PROCESSHANDLE_DESTROY.invoke(handle.get());
}
}
}
For Java <= 8 it will call the internal
java.lang.UNIXProcess.destroyProcess with force=false at
<http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/5b5973c3db08/src/solaris/native/java/lang/UNIXProcess_md.c#l717>
JNIEXPORT void JNICALL
Java_java_lang_UNIXProcess_destroyProcess(JNIEnv *env,
jobject junk,
jint pid,
jboolean force)
{
int sig = (force == JNI_TRUE) ? SIGKILL : SIGTERM;
kill(pid, sig);
}
which thus sends a SIGTERM. Similarly for Java >= 9 it will call the
internal java.lang.ProcessHandle.destroy at
<http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/share/classes/java/lang/ProcessHandle.java#l324>
(rather than the companion java.lang.ProcessHandle.destroyForcibly),
which ends up at
<http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/unix/native/libjava/ProcessHandleImpl_unix.c#l312>
JNIEXPORT jboolean JNICALL
Java_java_lang_ProcessHandleImpl_destroy0(JNIEnv *env,
jobject obj,
jlong jpid,
jlong startTime,
jboolean force) {
pid_t pid = (pid_t) jpid;
int sig = (force == JNI_TRUE) ? SIGKILL : SIGTERM;
jlong start = Java_java_lang_ProcessHandleImpl_isAlive0(env, obj, jpid);
if (start == startTime || start == 0 || startTime == 0) {
return (kill(pid, sig) < 0) ? JNI_FALSE : JNI_TRUE;
} else {
return JNI_FALSE;
}
}
with force=false, which thus also sends a SIGTERM.
To summarize (and if my browsing of the Jenkins source code is correct),
Jenkins will send each process, in effectively random order, a SIGTERM,
and will send no process a SIGKILL.
<https://gerrit.libreoffice.org/plugins/gitiles/lode/+/bea0738dbadfe8784e5d3c00f533acf101db4e7e%5E!/>
"tb_slave_wrapper: trap signal and kill -9 everything"
trap cleanup 1 2 3 6 15
cleanup()
{
echo "Caught Signal ... killing everything...."
# kill everything in same process group (pseudo-pid 0)
kill -9 0
}
ensures that, if tb_slave_wrapper receives a signal asking it to quit
(incl. SIGTERM=15), it will eventually (i.e., once the Bash interpreter
would proceed to a new line in the script) send a SIGKILL to all
processes in the process group.
The processes (9)--(11) (pythion.bin, oosplash, soffice.bin) will not
respond to their SIGTERM, as the former two are stuck waiting on their
respective child, while the latter is otherwise deadlocked. They need
to rely on the SIGKILL that might get send from (3) tb-slave-wrapper.
But, as processes get the SIGTERM in a random order, if e.g. (8) (the
innermost sh) gets a SIGTERM first, it will terminate and cause its
parent chain (7), (6), (4), (3) to terminate too. That means that (3)
tb-slave-wrapper can terminate before it would receive a SIGTERM, so it
will not call cleanup.
That should explain those occasional
python.bin───oosplash───soffice.bin
zombie chains (hanging off processes 1) from timed-out Jenkins builds
that we observe on tb75, tb76, and tb79.
Which in turn means that
On 29/12/2019 15:35, Noel Grandin wrote:
If we can't fix this, I suggest we add:
kill $(ps -o pid= --ppid $$)
to the end of the Jenkins build script
would not help, as the problematic processes do not have that $$ as
their (immediate) parent.
(And what would happen if the cleanup function in tb_slave_wrapper did
kick in? I think it would wreck havoc, as the outer Jenkins
remoting.jar process, as well as any parallel tb_slave_wrapper
instances, are in the same process group, and would thus all get
forcibly killed.)
Context
- Re: How are Jenkins builds killed exactly? · Stephan Bergmann
Privacy Policy |
Impressum (Legal Info) |
Copyright information: Unless otherwise specified, all text and images
on this website are licensed under the
Creative Commons Attribution-Share Alike 3.0 License.
This does not include the source code of LibreOffice, which is
licensed under the Mozilla Public License (
MPLv2).
"LibreOffice" and "The Document Foundation" are
registered trademarks of their corresponding registered owners or are
in actual use as trademarks in one or more countries. Their respective
logos and icons are also subject to international copyright laws. Use
thereof is explained in our
trademark policy.