Heap Corruption - SEGV_MAPERR in Android Native code

Heap Corruption - SEGV_MAPERR in Android Native code - android

I am trying to create a small library for stream AES encryption, I started my work based on Facebook Conceal project (https://github.com/facebook/conceal), just changing some things and improving the wrapper around the native to support ciphers with padding.
It is working and it can decipher files without problems but I get random Heap Memory Corruptions when I work with large streams, and after a lot of time debugging I have been unable to find the error.
Here is my code:
https://gist.github.com/frisco82/9782725
I have tried to find memory allocation or free problems but there are almost no malloc or free, and jni call should be safe, the same goes for openssl ones (I have compiled my own but conceal provided ones also fail)
CheckJni does not warn about anything and while the context handling is a bit out of the box it doesn't seem broken (indeed Android conscrypt seems to use something similar).
Also if someone can point me to a Android native AES multistep (multiple update calls) library I will switch to that and forget this.
The error varies from time to time but it is usually similar to his:
03-26 10:33:02.065: A/dalvikvm(2475): ### ABORTING: DALVIK: HEAP MEMORY CORRUPTION IN mspace_malloc addr=0x0
03-26 10:33:02.065: A/libc(2475): Fatal signal 11 (SIGSEGV) at 0xdeadbaad (code=1), thread 2494 (AsyncTask #1)
03-26 10:33:02.205: I/DEBUG(933): *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
03-26 10:33:02.205: I/DEBUG(933): Build fingerprint: 'generic_x86/google_sdk_x86/generic_x86:4.4.2/KK/999428:eng/test-keys'
03-26 10:33:02.205: I/DEBUG(933): Revision: '0'
03-26 10:33:02.205: I/DEBUG(933): pid: 2475, tid: 2494, name: AsyncTask #1 >>> com.proton <<<
03-26 10:33:02.205: I/DEBUG(933): signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr deadbaad
Full stack traces:
http://pastebin.com/f6mDuQEj

It is working and it can decipher files without problems but I get
random Heap Memory Corruptions when I work with large streams.
From above line it looks to me that your program is clearly overwriting the memory which was allocated implicitly or explicitly by your code. I was trying to understand your code however it was not clear to me. But I tried to look from memory corruption scenario and found that your program does have malloc/free call which might lead to memory overrun.
EVP_CIPHER_CTX *ctx = (EVP_CIPHER_CTX*) malloc(sizeof(EVP_CIPHER_CTX));
EVP_CIPHER_CTX_init(ctx);
EVP_CIPHER_CTX *ctx = (EVP_CIPHER_CTX*) malloc(sizeof(EVP_CIPHER_CTX));
EVP_CIPHER_CTX_init(ctx);
I tried to check the layout of the EVP_CIPHER_CTX structure but it was not available in your code. But I saw that these pointers are getting used in various context within your program. Now you should check that under which scenario your buffer can be overwritten as some places you have used different keyLength and depending on this your program is executing different function. I think you may want to review these codes and see whether overflow is possible!!!....
As your application would be running on android based system where we can not run any dynamic tool(Valgrind/WinDBG/Pageheap..) so I guess you need to review your code by putting some log at important place and see where you are overwriting.
Hope above information would be useful for you to understand your problem.

After all I was able to work around this problem, EVP_CipherUpdate (or jni ReleaseByteArrayElements) sometimes overflow the output buffer causing the heap corruption, nothing in my code was wrong and also it was not a problem with the caller as replacing EVP_CipherUpdate with a memcpy call with the same parameters worked as expected and there was no heap corruption.
So the solution was adding some extra length to the output buffer sent to nativeUpdate and the error was gone.
I have made the full working version of the library for others to use at:
https://github.com/frisco82/conceal

Related

Can low memory cause seg faults in native code?

I have a group of crashes in native code that are rare but happen consistently inolving SEGV_MAPERR or SEGV_ACCERR. These crashes are almost always reported by Crashlytics with very low RAM free (1-5% typically). 'Normal' crashes (ie, ones I have debugged) have no pattern in RAM free.
Is it possible these crashes are caused by a low memory condition? What would be the mechanism for this? Is there any way to tell if these are low memory related crashes or programming errors (using pointers wrongly, etc)? In many cases, the crash is happening in a library which I can't debug and I can't replicate the crashes on my devices.
Here's some of these crashes pulled from the Developer Console because it provides a little more detail than Crashlytics in the trace in these cases:
********** Crash dump: **********
Build fingerprint: 'htc/a32eul_metropcs_us/htc_a32eul:5.1/LMY47O/637541.3:user/release-keys'
pid: 10902, tid: 10989, name: .xxx.xxxx >>> com.xxx.xxxxx <<<
signal 11 (SIGSEGV), code 2 (SEGV_ACCERR), fault addr 0x97f78000
Stack frame #00 pc 0004cd80 /data/app/xxx.xxx.xxxxx-1/lib/arm/libxxx.so: Routine xxxxxMixerInterleavedFloatOutput at libgcc2.c:?
********** Crash dump: **********
Build fingerprint: 'Xiaomi/land/land:6.0.1/MMB29M/V8.1.1.0.MALMIDI:user/release-keys'
pid: 2661, tid: 2746, name: .xxx.xxxx >>> com.xxx.xxxx <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x0
Stack frame #00 pc 00016954 /system/lib/libc.so (__memcpy_base+36)
Stack frame #01 pc 0000b14c /data/app/com.xxx.xxxx-2/lib/arm/libswresample-2.so: Routine ??
??:0

There are two general possibilities:
A low memory condition in of itself is not going to somehow trigger a segfault in a running application. What can happen is that when the application asks for additional memory to be allocated to it, the memory allocation request fails. This is a well defined memory condition. It is documented that the relevant system calls can fail in allocating memory. But what often happens is that the application are not coded properly to check for a failed memory allocation request, and they crash for that reason. In that case, it is not true that a low memory condition is responsible for an application segfault, it is an application bug.
The Linux kernel overcommits the available memory. As a result of that it is possible that the kernel will have no option but to select a process to be killed, when all available RAM has been exhausted.
However, in the case of the OOM killer kicking in, the chosen victims are terminated with a SIGKILL. A SEGFAULT indicates an application bug.

Unable to open symbol file. Error (20): Not a directory

I am using ffmpeg library on android to stream live video feed. I have complied ffmpeg for android following roman10 instructions. The application is working correctly - it connects to the server, download the feed, transcode it, rescale it and displays on the device's screen. However after a certain random moment the app crashes with Fatal signal 11 (SIGSEGV), code 1. I have used ndk-stack to find the source of the problem. Here is the crash dump:
********** Crash dump: **********
Build fingerprint: 'google/hammerhead/hammerhead:5.0.1/LRX22C/1602158:user/release-keys'
pid: 25241, tid: 25317, name: AsyncTask #5 >>> com.grzebyk.streamapp <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x98e74c9c
Stack frame #00 pc 00047924 /data/app/com.grzebyk.streamapp-1/lib/arm/libswscale-3.so: Unable to open symbol file /Users/grzebyk/Documents/New_Eclipse_Projects/StreamApp/libs/armeabi/libStreamApp.so/libswscale-3.so. Error (20): Not a directory
Stack frame #01 pc 00034be8 /data/app/com.grzebyk.streamapp-1/lib/arm/libswscale-3.so (sws_scale+2648): Unable to open symbol file /Users/grzebyk/Documents/New_Eclipse_Projects/StreamApp/libs/armeabi/libStreamApp.so/libswscale-3.so. Error (20): Not a directory
My native code is located in the StreamApp.cpp file. For me it looks like the app is trying to access libswscale-3.so (part of the ffmpeg) located inside the libStreamApp.so. This seems weird for me…
All the ffmpeg's .so files are located in /libs/armeabi/lib*.so. Naturally this includes the "missing" libswscale-3.so. The most disturbing thing is a fact that the app is working perfectly, but it crashes suddenly and it does not need any specific trigger to do so.
What can I do to either put libswscale-3.so inside labStreamApp.so or to avoid referencing one .so file from another?

It's rather simple, you have passed
"/Users/grzebyk/Documents/New_Eclipse_Projects/StreamApp/libs/armeabi/libStreamApp.so/"
as a parameter for ndk-stack and obviously it's not a folder :) You should pass smth like
"/Users/grzebyk/Documents/New_Eclipse_Projects/StreamApp/libs/armeabi/"

Android native application running even after segmentation fault with same PID

I'm running custom android on i.MX51 board and observed a strange issue with an application.
I got logs in n logcat of segmentation fault of an application (native, written using NDK) :
03-19 15:26:46.763 I/DEBUG ( 2234): pid: 2257, tid: 2257 >>> /usr/bin/powerMgr <<<
03-19 15:26:46.763 I/DEBUG ( 2234): signal 8 (SIGFPE), code 0 (?), fault addr 000008d1
Even after this the application continued to run with same PID (2257) which I confirmed from both logcat and ps command. Is this possible ? If yes, how ??

That's not a segmentation fault (SIGSEGV, signal 11). You got a SIGFPE, signal 8, possibly as the result of an integer divide-by-zero. The signal handling didn't kill the process, so it just continued executing.
Many ARM CPUs lack hardware division instructions, so the SIGFPEis thrown explicitly from the software divide function. As a result you don't get a meaningful value in "fault addr".
The treatment of this has changed over time; newer versions of Android are a bit better about it.

Is "Process terminated by signal (11)" always related to NDK?

I am tracking a bug in my project. From time to time, my app gets killed and in the logcat there is always the following line:
D/Zygote ( xx): Process xxxx terminated by signal (11)
I have been searching about this error and I always find mentions to NDK.
I am developing a project that uses a third-party C library. I do not know the library in great detail, but I can tell you that it does some network communication with a server. In my project there are several Services, some of which use this library.
So my question is, does this error always imply that the problem comes from the C library?
If not, could you give me examples of Java code in Android, that could cause this error as well?
Thanks.
EDIT: By the way, in the logcat output there is no stack trace before the previous line.

Most likely, SIGSEGV was reported before in your log. It may be quite a while earlier than the final message. This is an actual snapshot of one of such crashes in my logs:
01-29 16:00:39.124 F/libc ( 3033): Fatal signal 11 (SIGSEGV) at 0x0000002c (code=1)
... many more ...
01-29 16:00:48.367 D/Zygote ( 116): Process 3033 terminated by signal (11)
You can see that it took almost 10 sec for Zygote to report process termination after the fatal signal was received. Your milage may vary.
The actual reasons for SIGSEGV may be within the third-party C library, or in your Java, or in the way you call the C library. Note, for example, that it's forbidden to make most of JNI calls while a Java Exception is raised.

Android: do I have a memory leak?

I am working on an lwp where over time, the memory usage reported thru the UI (settings, apps, running) shows an ever increasing memory usage (starts at ~18M and has climbed as high as 90M before I nuked it). That seems very bad and got me hunting down what must be a memory leak. So I used ddms and it shows that while the the app shows increasing memory usage, the Dalvik allocated heap is ~13.7M and it doesn't vary more than ~500K over time. OK, so it must be on the native heap. I included a log of Debug.getNativeHeapSize() and Debug.getNativeHeapAllocatedSize() and they both report ~5M of native heap +/- ~2M, consistently as the UI shows the increasing memory usage. What other memory usage is there besides Dalvik and native? So I read Diane Hackborn's excellent post here and realized I had a lot more to learn. I used adb shell dumpsys meminfo and the result looked like:
** MEMINFO in pid 6856 **
Shared Private Heap Heap Heap
Pss Dirty Dirty Size Alloc Free
------ ------ ------ ------ ------ ------
Native 1118 1116 1076 5072 3452 59
Dalvik 4102 16204 3584 15111 14077 1034
Cursor 0 0 0
Ashmem 2 4 0
Other dev 33344 292 820
The huge 33M "Pss" for "Other Dev" plus the Native/Pss and Dalvik/Pss is roughly what the UI is reporting for memory usage. Reading Diane's post, I don't know if I should be worried about my memory usage or not. I can see that that "Pss/Other dev" value is climbing over time but my native and Dalvik heap are remaining where ddms and the Debug.getNativeHeapSize() report they are. I'm confused and don't know if I should be concerned other than that users will see that my lwp is consuming this crazy amount of "memory" when as far as I can see, it is not actually leaking.
Does anyone understand what is going on here? This on a Galaxy Nexus, ICS 4.0.2.
EDIT: I subsequently tried to crash the app by letting the Other/Pss climb until the app reached ~200M of reported usage in the UI. It did crash:
E/IMGSRV ( 8337): :0: __map: Map device memory failed
W/GraphicBufferMapper( 8337): registerBuffer(0x3364c0) failed -14 (Bad address)
F/libc ( 8337): Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1)
I/DEBUG ( 114): *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
So I have something to fix, I have no idea where to start since I can't see any problem on the dalvik or native heap.

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.