From e204c1ea7451851baa44cbba1770ad787176a648 Mon Sep 17 00:00:00 2001 From: Greg Alexander Date: Mon, 5 Aug 2019 11:15:05 -0400 Subject: [PATCH] musings about the dump, which must be caused by atexit() --- NOTES | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 69 insertions(+) diff --git a/NOTES b/NOTES index 847ab97..ba524f8 100644 --- a/NOTES +++ b/NOTES @@ -898,6 +898,75 @@ the manifest... allowBackup="false" took immediate effect and had no surprises... +August 4, 2019. + +I finally got a dump from a user (Hammad), and it's quite distressing. +The stack trace is roughly: + backtrace() + sigsegv_handler() + /system/bin/app_process64+0x2a90 + __kernel_rt_sigreturn() + A5xContext::HwAddNop(unsigned int *, unsigned int) + EsxCmdMgr::IssuePendingIB1s(EsxFlushReason, int, int) + EsxCmdMgr::Flush(EsxFlushReason) + EsxContext::Destroy() + EglContext::DestroyEsxContext() + EglDisplay::MarkContextListForDestroy() + EglDisplay::Terminate(int) + EglDisplayList::Destroy() + EglDisplay::DestroyStaticListsMutexesAndTlsKeys() + EsxEntryDestruct() + /system/vendor/lib64/egl/libGLESv2_adreno.so+0x12780 + [... cut off at 16 ...] + +So many questions! I think app_process64 must be the actual C main() of +a process, responsible for branching into all the android system +libraries? I imagine it's involved because it's somehow intercepted the +SIGSEGV and re-dispatched it to my handler? I don't see any way we could +have branched into libGLESv2_adreno from userland, so the SIGSEGV must +come from the UI thread, I guess? Maybe this SIGSEGV is actually the +sort of thing we'd get if we tried to call UI code from the non-UI +thread?? + +It looks like GLES is busy cleaning itself up, and it crashes. Why's it +crash? Why's it trying to clean itself up? + +Hammad says there is no problem using sshd...I thought he meant that the +re-start logic is working for him but his dropbear.err has multiple dumps +in it! The SIGSEGVs are apparently not killing the daemon. + +There are no timestamps on the dumps, but it looks like they're +associated with activity anyways. Each dump happens between "Disconnect +received" and "sigchld". Some of them have "server select out" +interleaved into the dump, which I think is the result of Hammad running: + while true; do ssh phone 'exit'; done +That is, it appears he starts a new connection the very instant the old +connection ends. So the new connection comes into the server process +while the child process is in the act of dying. + +The thing is, I don't see how it could possibly be getting signals from +the Java side of things, because it fork()s before setting up the signal +handling. It's not just running in a different thread, it should be a +totally separate process. I can test this but I don't think I'm wrong +about that. + +So I guess just about the only thing that's really possible is that +there's an atexit() which survives the fork() because it isn't followed +up with an execve(). It's not caused by ARM, or even necessarily by +Android 9...the reason it doesn't show up in the emulator is that the +libGLES that registers the atexit() is vendor-supplied for specific +hardware ("Adreno"). + +So I need to figure out how to bypass the atexit() somehow, perhaps by +calling _exit() directly? + + +XXX - merge back into main branch, because I'll want to keep the dump facility +XXX - make the dump go deeper in the stack +XXX - put a crash in an atexit() to be sure it presents about this way +XXX - test re-start mechanism, which doesn't seem to work on the first try if it crashes +XXX - test bypassing that crash +XXX - remove the crash, remove the debug fprintfs (select in/out, sigchld) --- new release