Re: multi-threading task under SolarMutex -> deadlock

Armin Le Grand <armin_le_grand -AT- me.com>
Thu, 19 May 2016 16:19:33 +0200

Hi Norbert,

thanks a lot for the extensive answer. Comments inline...

Am 18.05.2016 um 20:22 schrieb Norbert Thiebaud:

On Wed, May 18, 2016 at 3:43 AM, Armin Le Grand <armin_le_grand@me.com> wrote:

Hi Norbert,

thanks for also having an eye on this - I am looking for the failure reports
on ci.libreoffice.org currently, too.
Last is from http://ci.libreoffice.org/job/lo_tb_master_linux_dbg/7195/, so
last is from Friday, 13th (uhhh...)

Have you seen such or similar stacks anywhere else? In the meantime I tried
ChartTest massively locally on Linux and Win, but could never locally
reproduce.

I noticed, and I still had to cancel a hung job few minutes ago, that
things started to hang regularely.
I only look at that particular case I sent the backtrace about.


Yes, that is not acceptable for long time, I agree.

I knew that and made sure that the multithreaded 3DRenderer WorkerThreads do
not need the SolarMutex for their work. I did not know yet that the memory
fail handler tries to get the SolarMutex, too, but is logic when it wants to
bring up a dialog in some form.

yeah but that _is_ a major flaw. code under a signal handler are only
allow to call async-safe-signal functions.
Ignoring that _will_ cause trouble. it is not your doing. it just
happen that your multithreading case seems to cause memory starvation
which in turn trigger a signal that exhibit clearly why all the thing
we try to do in a signal handler are not a good idea.
Not to mention that, even if that was allowed, trying to bring about a
dialog when we have already ran out of memory is never going to end
well anyway.

for reference:

   Async-signal-safe functions
        A signal handling routine established by sigaction(2) or
signal(2) must be very careful, since processing elsewhere may be
interrupted at some arbitrary point in the execution of the program.
POSIX has the concept of "safe function".  If a signal interrupts the
execution of an unsafe function, and handler calls an  unsafe
function,  then
        the behavior of the program is undefined.

        POSIX.1-2004 (also known as POSIX.1-2001 Technical Corrigendum
2) requires an implementation to guarantee that the following
functions can be safely called inside a signal handler:

            _Exit()
            _exit()
            abort()
            accept()
            access()
            aio_error()
            aio_return()
            aio_suspend()
            alarm()
            bind()
            cfgetispeed()
            cfgetospeed()
            cfsetispeed()
            cfsetospeed()
            chdir()
            chmod()
            chown()
            clock_gettime()
            close()
            connect()
            creat()
            dup()
            dup2()
            execle()
            execve()
            fchmod()
            fchown()
            fcntl()
            fdatasync()
            fork()
            fpathconf()
            fstat()
            fsync()
            ftruncate()
            getegid()
            geteuid()
            getgid()
            getgroups()
            getpeername()
            getpgrp()
            getpid()
            getppid()
            getsockname()
            getsockopt()
            getuid()
            kill()
            link()
            listen()
            lseek()
            lstat()
            mkdir()
            mkfifo()
            open()
            pathconf()
            pause()
            pipe()
            poll()
            posix_trace_event()
            pselect()
            raise()
            read()
            readlink()
            recv()
            recvfrom()
            recvmsg()
            rename()
            rmdir()
            select()
            sem_post()
            send()
            sendmsg()
            sendto()
            setgid()
            setpgid()
            setsid()
            setsockopt()
            setuid()
            shutdown()
            sigaction()
            sigaddset()
            sigdelset()
            sigemptyset()
            sigfillset()
            sigismember()
            signal()
            sigpause()
            sigpending()
            sigprocmask()
            sigqueue()
            sigset()
            sigsuspend()
            sleep()
            sockatmark()
            socket()
            socketpair()
            stat()
            symlink()
            sysconf()
            tcdrain()
            tcflow()
            tcflush()
            tcgetattr()
            tcgetpgrp()
            tcsendbreak()
            tcsetattr()
            tcsetpgrp()
            time()
            timer_getoverrun()
            timer_gettime()
            timer_settime()
            times()
            umask()
            uname()
            unlink()
            utime()
            wait()
            waitpid()
            write()

        POSIX.1-2008 removes fpathconf(), pathconf(), and sysconf()
from the above list, and adds the following functions:

            execl()
            execv()
            faccessat()
            fchmodat()
            fchownat()
            fexecve()
            fstatat()
            futimens()
            linkat()
            mkdirat()
            mkfifoat()
            mknod()
            mknodat()
            openat()
            readlinkat()
            renameat()
            symlinkat()
            unlinkat()
            utimensat()
            utimes()

But the deeper problem is that allocation - here extending a vector of
pointers to a helper class from 1 to 2 entries - fails.

the straw that broke the camel's back....
or as the french put it: the drop that made the water overflow...
bear in mind, iirc the default allocator strategy for a vector is to
'double' on out-of-space.
iow if you had a vector of 128 capacity.. when you try to push a 129th
elements you end up with a 256-sized vector

True - I know. In the case of the stacktrace it is a vector of pointers,so not much mem involved. All objects used in 3D are ref-counted orcow-wrapped. Even the target mem for pixels is shared - thus I doubtthat rendering in paralell uses much more than rendering in one thread.I would state that it will not need double the mem as it needs in asingle thread. Still, of course, a single byte makes the water overflowat the end.


also, tests are run in parallel.. so up to 32 test maybe running in // already.
if each one want to run 32 threads.. that will seriouly over-commit
the machine.. with 32^32 = 1024 threads fighting for cpu.

That might be more a problem. Do the UnitTests run in one mem space? Idoubt so. The overhead for a single WorkerTask should not be too big initself - RegisterSet, Stack, ..?BTW, that global TreadPool and the WorkerThreads used get created whenthe office starts, and other (few for now) usages use it already, sothat problem should have shown earier...?

Sometimes. And that
only on many cores on that machine (up to now).

I do not mind the out-of-memory thing... unless it is a bug (run-away
allocation or something)
out of memory are bound to happen... what I mind is the consequences..
as bad as a crash is... a hung process is worse

fyi. tb75 has 2x8 cores, 32 threads and 64GB or RAM.

It probably indeed is a run out-of-mem in extreme situations, even whenI can hardly imagine that. The 3D rasterconversion is hard limited toupper bounds in pixels, as said, it uses referenced objects, not reallylocal memory objects of remarkable size - I have no idea yet.Interestingly, in stack 7273, the crash is probably related, but happensin freeing one object. Does this not speak against the out-of-memorytheory completely?

I checked all involved classes, their refcounting and that the used
o3tl::cow_wrapper uses the ThreadSafeRefCountingPolicy, looks good so far.
It is also not the case that the WorkerThreads need massive amounts of own
memory, so I doublt that limiting to e.g. 8 thredads would change this,
except maybe making it less probable to happen. I looked at
o3tl::cow_wrapper itself, and the basic B2D/B3DPrimitive implementations
which internally use a comphelper::OBaseMutex e.g. for creating buffered
decompositions.

I found no concrete reason until now, any tipps/help much appreciated.

I honestly did not look at all _why_ it signaled. I only investigated
why that resulted in a deadlock.

If I understand you correct, your main concern is to have a clean crashinstead of the office hanging? That would definitely be better. But howto do that? I see in the stack 7195:


#9  0x00002aea35f04670 in <signal handler called> () at /lib64/libc.so.6
#10 0x00002aea35f045f7 in raise () at /lib64/libc.so.6
#11 0x00002aea35f05ce8 in abort () at /lib64/libc.so.6
#12 0x00002aea35f4a1b8 in  () at /lib64/libc.so.6
#13 0x00002aea35f4d877 in _int_malloc () at /lib64/libc.so.6
#14 0x00002aea35f4dae6 in malloc_check () at /lib64/libc.so.6

#15 0x00002aea3570e0cd in operator new(unsigned long) () at/lib64/libstdc++.so.6

So how to come in-between and not have operator new when it fails tocall signal handler, but e.g. exit()..?I see now way to do that. In consequence that would mean that operatornew is not alowed in WorkerThreads at all - that would be a too biglimitation I guess...?

I keep watching this - at least it did not happen in all the builds since
13th and on no other machine, so the thread now is to somehow nail it to get
it reproducable.

It very likely happened again on
http://ci.libreoffice.org/job/lo_tb_master_linux_dbg/7260/
which I cancelled after more than 2 hours or runtime

Checked 7260, no trace to 'B3D' contained, this one has no 3Dmultihreaded render included. Must be someting else, of course mayberelated to the general signal handler in WorkerThread problem. Or not -'worker' is also not included.

If someone has other traces, please send them! I would hate
to take this back, esp. because we will need multithreading more and more
since Moore's law is tilting.

If we want to use multi-threading. we will have, at least, to solve
the signal handler situation, since as it stand it makes the 'promise
that this worker thread won't try to hold the solarmutex' impossible
to uphold.


True, agreed. How?

I already tried to use SolarMutexReleaser as inverse to SolarMutexGuardaround waitUntilEmpty(), but that makes the next message to beprocessed, not good.

The only thing I could think of is to change the implementation ofwaitUntilEmpty() to not just wait for asll ThreadTasks to finish, but todo this in a timer loop. This is polling with timer, but would in thatway - AFAIK - release the SolarMutex from time to time and thus allow adeadlock to be resolved and let the signal handler do it's thing -> andallow the crash. Not nice, but pragmatic.

Maybe a way around that may be to make sure that all worker thread use
a pthread_sigmask to essentially prevent any of them to be interrupted
for signal handling, by blocking all signal in these thread.
in the same vein we could have a dedicated signal handler thread
looping on a sigwait... then we can do the crazy thing we do in a
handler without as many limitation

Interesting, but too complicated as long as we have the SolarMutexmayhem...?



The whole SolarMutex craziness is yet another story altogether....


Yes, what a big can of worms, sigh...


--
--
ALG (PGP Key: EE1C 4B3F E751 D8BC C485 DEC1 3C59 F953 D81C F4A2)

Context

multi-threading task under SolarMutex -> deadlock · Norbert Thiebaud
- Re: multi-threading task under SolarMutex -> deadlock · Thorsten Behrens
  - Re: multi-threading task under SolarMutex -> deadlock · Norbert Thiebaud
    - Re: multi-threading task under SolarMutex -> deadlock · Armin Le Grand
      - Re: multi-threading task under SolarMutex -> deadlock · Norbert Thiebaud
        
        Re: multi-threading task under SolarMutex -> deadlock · Armin Le Grand
        
        Re: multi-threading task under SolarMutex -> deadlock · Stephan Bergmann
        
        Re: multi-threading task under SolarMutex -> deadlock · Noel Grandin

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.