musings about the dump, which must be caused by atexit()

2025-05-28 08:48:45 +00:00 · 2019-08-05 11:15:05 -04:00 · 2019-08-05 11:15:05 -04:00 · e204c1ea74
commit e204c1ea74
parent 2d8d649cdd
1 changed files with 69 additions and 0 deletions
--- a/69
+++ b/69
@ -898,6 +898,75 @@ the manifest...
 allowBackup="false" took immediate effect and had no surprises...


+August 4, 2019.
+
+I finally got a dump from a user (Hammad), and it's quite distressing.
+The stack trace is roughly:
+   backtrace()
+   sigsegv_handler()
+   /system/bin/app_process64+0x2a90
+   __kernel_rt_sigreturn()
+   A5xContext::HwAddNop(unsigned int *, unsigned int)
+   EsxCmdMgr::IssuePendingIB1s(EsxFlushReason, int, int)
+   EsxCmdMgr::Flush(EsxFlushReason)
+   EsxContext::Destroy()
+   EglContext::DestroyEsxContext()
+   EglDisplay::MarkContextListForDestroy()
+   EglDisplay::Terminate(int)
+   EglDisplayList::Destroy()
+   EglDisplay::DestroyStaticListsMutexesAndTlsKeys()
+   EsxEntryDestruct()
+   /system/vendor/lib64/egl/libGLESv2_adreno.so+0x12780
+   [... cut off at 16 ...]
+
+So many questions!  I think app_process64 must be the actual C main() of
+a process, responsible for branching into all the android system
+libraries?  I imagine it's involved because it's somehow intercepted the
+SIGSEGV and re-dispatched it to my handler?  I don't see any way we could
+have branched into libGLESv2_adreno from userland, so the SIGSEGV must
+come from the UI thread, I guess?  Maybe this SIGSEGV is actually the
+sort of thing we'd get if we tried to call UI code from the non-UI
+thread??
+
+It looks like GLES is busy cleaning itself up, and it crashes.  Why's it
+crash?  Why's it trying to clean itself up?
+
+Hammad says there is no problem using sshd...I thought he meant that the
+re-start logic is working for him but his dropbear.err has multiple dumps
+in it!  The SIGSEGVs are apparently not killing the daemon.
+
+There are no timestamps on the dumps, but it looks like they're
+associated with activity anyways.  Each dump happens between "Disconnect
+received" and "sigchld".  Some of them have "server select out"
+interleaved into the dump, which I think is the result of Hammad running:
+   while true; do ssh phone 'exit'; done
+That is, it appears he starts a new connection the very instant the old
+connection ends.  So the new connection comes into the server process
+while the child process is in the act of dying.
+
+The thing is, I don't see how it could possibly be getting signals from
+the Java side of things, because it fork()s before setting up the signal
+handling.  It's not just running in a different thread, it should be a
+totally separate process.  I can test this but I don't think I'm wrong
+about that.
+
+So I guess just about the only thing that's really possible is that
+there's an atexit() which survives the fork() because it isn't followed
+up with an execve().  It's not caused by ARM, or even necessarily by
+Android 9...the reason it doesn't show up in the emulator is that the
+libGLES that registers the atexit() is vendor-supplied for specific
+hardware ("Adreno").
+
+So I need to figure out how to bypass the atexit() somehow, perhaps by
+calling _exit() directly?
+
+
+XXX - merge back into main branch, because I'll want to keep the dump facility
+XXX - make the dump go deeper in the stack
+XXX - put a crash in an atexit() to be sure it presents about this way
+XXX - test re-start mechanism, which doesn't seem to work on the first try if it crashes
+XXX - test bypassing that crash
+XXX - remove the crash, remove the debug fprintfs (select in/out, sigchld)

 --- new release