Date: prev next · Thread: first prev next last
2016 Archives by date, by thread · List index


On Wed, May 18, 2016 at 3:43 AM, Armin Le Grand <armin_le_grand@me.com> wrote:
Hi Norbert,

thanks for also having an eye on this - I am looking for the failure reports
on ci.libreoffice.org currently, too.
Last is from http://ci.libreoffice.org/job/lo_tb_master_linux_dbg/7195/, so
last is from Friday, 13th (uhhh...)

Have you seen such or similar stacks anywhere else? In the meantime I tried
ChartTest massively locally on Linux and Win, but could never locally
reproduce.

I noticed, and I still had to cancel a hung job few minutes ago, that
things started to hang regularely.
I only look at that particular case I sent the backtrace about.


I knew that and made sure that the multithreaded 3DRenderer WorkerThreads do
not need the SolarMutex for their work. I did not know yet that the memory
fail handler tries to get the SolarMutex, too, but is logic when it wants to
bring up a dialog in some form.

yeah but that _is_ a major flaw. code under a signal handler are only
allow to call async-safe-signal functions.
Ignoring that _will_ cause trouble. it is not your doing. it just
happen that your multithreading case seems to cause memory starvation
which in turn trigger a signal that exhibit clearly why all the thing
we try to do in a signal handler are not a good idea.
Not to mention that, even if that was allowed, trying to bring about a
dialog when we have already ran out of memory is never going to end
well anyway.

for reference:

  Async-signal-safe functions
       A signal handling routine established by sigaction(2) or
signal(2) must be very careful, since processing elsewhere may be
interrupted at some arbitrary point in the execution of the program.
POSIX has the concept of "safe function".  If a signal interrupts the
execution of an unsafe function, and handler calls an  unsafe
function,  then
       the behavior of the program is undefined.

       POSIX.1-2004 (also known as POSIX.1-2001 Technical Corrigendum
2) requires an implementation to guarantee that the following
functions can be safely called inside a signal handler:

           _Exit()
           _exit()
           abort()
           accept()
           access()
           aio_error()
           aio_return()
           aio_suspend()
           alarm()
           bind()
           cfgetispeed()
           cfgetospeed()
           cfsetispeed()
           cfsetospeed()
           chdir()
           chmod()
           chown()
           clock_gettime()
           close()
           connect()
           creat()
           dup()
           dup2()
           execle()
           execve()
           fchmod()
           fchown()
           fcntl()
           fdatasync()
           fork()
           fpathconf()
           fstat()
           fsync()
           ftruncate()
           getegid()
           geteuid()
           getgid()
           getgroups()
           getpeername()
           getpgrp()
           getpid()
           getppid()
           getsockname()
           getsockopt()
           getuid()
           kill()
           link()
           listen()
           lseek()
           lstat()
           mkdir()
           mkfifo()
           open()
           pathconf()
           pause()
           pipe()
           poll()
           posix_trace_event()
           pselect()
           raise()
           read()
           readlink()
           recv()
           recvfrom()
           recvmsg()
           rename()
           rmdir()
           select()
           sem_post()
           send()
           sendmsg()
           sendto()
           setgid()
           setpgid()
           setsid()
           setsockopt()
           setuid()
           shutdown()
           sigaction()
           sigaddset()
           sigdelset()
           sigemptyset()
           sigfillset()
           sigismember()
           signal()
           sigpause()
           sigpending()
           sigprocmask()
           sigqueue()
           sigset()
           sigsuspend()
           sleep()
           sockatmark()
           socket()
           socketpair()
           stat()
           symlink()
           sysconf()
           tcdrain()
           tcflow()
           tcflush()
           tcgetattr()
           tcgetpgrp()
           tcsendbreak()
           tcsetattr()
           tcsetpgrp()
           time()
           timer_getoverrun()
           timer_gettime()
           timer_settime()
           times()
           umask()
           uname()
           unlink()
           utime()
           wait()
           waitpid()
           write()

       POSIX.1-2008 removes fpathconf(), pathconf(), and sysconf()
from the above list, and adds the following functions:

           execl()
           execv()
           faccessat()
           fchmodat()
           fchownat()
           fexecve()
           fstatat()
           futimens()
           linkat()
           mkdirat()
           mkfifoat()
           mknod()
           mknodat()
           openat()
           readlinkat()
           renameat()
           symlinkat()
           unlinkat()
           utimensat()
           utimes()



But the deeper problem is that allocation - here extending a vector of
pointers to a helper class from 1 to 2 entries - fails.

the straw that broke the camel's back....
or as the french put it: the drop that made the water overflow...
bear in mind, iirc the default allocator strategy for a vector is to
'double' on out-of-space.
iow if you had a vector of 128 capacity.. when you try to push a 129th
elements you end up with a 256-sized vector

also, tests are run in parallel.. so up to 32 test maybe running in // already.
if each one want to run 32 threads.. that will seriouly over-commit
the machine.. with 32^32 = 1024 threads fighting for cpu.


Sometimes. And that
only on many cores on that machine (up to now).

I do not mind the out-of-memory thing... unless it is a bug (run-away
allocation or something)
out of memory are bound to happen... what I mind is the consequences..
as bad as a crash is... a hung process is worse

fyi. tb75 has 2x8 cores, 32 threads and 64GB or RAM.



I checked all involved classes, their refcounting and that the used
o3tl::cow_wrapper uses the ThreadSafeRefCountingPolicy, looks good so far.
It is also not the case that the WorkerThreads need massive amounts of own
memory, so I doublt that limiting to e.g. 8 thredads would change this,
except maybe making it less probable to happen. I looked at
o3tl::cow_wrapper itself, and the basic B2D/B3DPrimitive implementations
which internally use a comphelper::OBaseMutex e.g. for creating buffered
decompositions.

I found no concrete reason until now, any tipps/help much appreciated.

I honestly did not look at all _why_ it signaled. I only investigated
why that resulted in a deadlock.


I keep watching this - at least it did not happen in all the builds since
13th and on no other machine, so the thread now is to somehow nail it to get
it reproducable.
It very likely happened again on
http://ci.libreoffice.org/job/lo_tb_master_linux_dbg/7260/
which I cancelled after more than 2 hours or runtime

If someone has other traces, please send them! I would hate
to take this back, esp. because we will need multithreading more and more
since Moore's law is tilting.

If we want to use multi-threading. we will have, at least, to solve
the signal handler situation, since as it stand it makes the 'promise
that this worker thread won't try to hold the solarmutex' impossible
to uphold.
Maybe a way around that may be to make sure that all worker thread use
a pthread_sigmask to essentially prevent any of them to be interrupted
for signal handling, by blocking all signal in these thread.
in the same vein we could have a dedicated signal handler thread
looping on a sigwait... then we can do the crazy thing we do in a
handler without as many limitation


The whole SolarMutex craziness is yet another story altogether....

Context


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.