From e204c1ea7451851baa44cbba1770ad787176a648 Mon Sep 17 00:00:00 2001
From: Greg Alexander <gitgreg@galexander.org>
Date: Mon, 5 Aug 2019 11:15:05 -0400
Subject: [PATCH] musings about the dump, which must be caused by atexit()

---
 NOTES | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/NOTES b/NOTES
index 847ab97..ba524f8 100644
--- a/NOTES
+++ b/NOTES
@@ -898,6 +898,75 @@ the manifest...
 allowBackup="false" took immediate effect and had no surprises...
 
 
+August 4, 2019.
+
+I finally got a dump from a user (Hammad), and it's quite distressing.
+The stack trace is roughly:
+   backtrace()
+   sigsegv_handler()
+   /system/bin/app_process64+0x2a90
+   __kernel_rt_sigreturn()
+   A5xContext::HwAddNop(unsigned int *, unsigned int)
+   EsxCmdMgr::IssuePendingIB1s(EsxFlushReason, int, int)
+   EsxCmdMgr::Flush(EsxFlushReason)
+   EsxContext::Destroy()
+   EglContext::DestroyEsxContext()
+   EglDisplay::MarkContextListForDestroy()
+   EglDisplay::Terminate(int)
+   EglDisplayList::Destroy()
+   EglDisplay::DestroyStaticListsMutexesAndTlsKeys()
+   EsxEntryDestruct()
+   /system/vendor/lib64/egl/libGLESv2_adreno.so+0x12780
+   [... cut off at 16 ...]
+
+So many questions!  I think app_process64 must be the actual C main() of
+a process, responsible for branching into all the android system
+libraries?  I imagine it's involved because it's somehow intercepted the
+SIGSEGV and re-dispatched it to my handler?  I don't see any way we could
+have branched into libGLESv2_adreno from userland, so the SIGSEGV must
+come from the UI thread, I guess?  Maybe this SIGSEGV is actually the
+sort of thing we'd get if we tried to call UI code from the non-UI
+thread??
+
+It looks like GLES is busy cleaning itself up, and it crashes.  Why's it
+crash?  Why's it trying to clean itself up?
+
+Hammad says there is no problem using sshd...I thought he meant that the
+re-start logic is working for him but his dropbear.err has multiple dumps
+in it!  The SIGSEGVs are apparently not killing the daemon.
+
+There are no timestamps on the dumps, but it looks like they're
+associated with activity anyways.  Each dump happens between "Disconnect
+received" and "sigchld".  Some of them have "server select out"
+interleaved into the dump, which I think is the result of Hammad running:
+   while true; do ssh phone 'exit'; done
+That is, it appears he starts a new connection the very instant the old
+connection ends.  So the new connection comes into the server process
+while the child process is in the act of dying.
+
+The thing is, I don't see how it could possibly be getting signals from
+the Java side of things, because it fork()s before setting up the signal
+handling.  It's not just running in a different thread, it should be a
+totally separate process.  I can test this but I don't think I'm wrong
+about that.
+
+So I guess just about the only thing that's really possible is that
+there's an atexit() which survives the fork() because it isn't followed
+up with an execve().  It's not caused by ARM, or even necessarily by
+Android 9...the reason it doesn't show up in the emulator is that the
+libGLES that registers the atexit() is vendor-supplied for specific
+hardware ("Adreno").
+
+So I need to figure out how to bypass the atexit() somehow, perhaps by
+calling _exit() directly?
+
+
+XXX - merge back into main branch, because I'll want to keep the dump facility
+XXX - make the dump go deeper in the stack
+XXX - put a crash in an atexit() to be sure it presents about this way
+XXX - test re-start mechanism, which doesn't seem to work on the first try if it crashes
+XXX - test bypassing that crash
+XXX - remove the crash, remove the debug fprintfs (select in/out, sigchld)
 
 --- new release