Android OpenCV parallelize loops

Android OpenCV parallelize loops - android

I know that OpenMP is included in NDK (usage example here: http://recursify.com/blog/2013/08/09/openmp-on-android ). I've done what it says on that page but when I use: #pragma omp for on a simple for loop that scans a vector, the app crashes with the famous "fatal signal 11".
What am I missing here? Btw I use a modified example from the Android samples, it's Tutorial 2 Mixed Processing. All I want is to parallelize (multithread) some of the for loops and nested for loops that I have in the jni c++ file while using OpenCV.
Any help/suggestion is appreciated!
Edit: sample code added:
#pragma omp parallel for
Mat tmp(iheight, iwidth, CV_8UC1);
for (int x = 0; x < iheight; x++) {
for (int y = 0; y < iwidth; y++) {
int value = (int) buffer[x * iwidth + y];
tmp.at<uchar>(x, y) = value;
}
}
Based on this: http://www.slideshare.net/noritsuna/how-to-use-openmp-on-native-activity
Thanks!

I think this is a known issue in GOMP, see Bug 42616 and Bug 52738.
It's about your app will crash if you try to use OpenMP directives or functions on a non-main thread, and can be traced back to the gomp_thread() function (see libgomp/libgomp.h # line 362 and 368) which returns NULL for threads you create:
#ifdef HAVE_TLS
extern __thread struct gomp_thread gomp_tls_data;
static inline struct gomp_thread *gomp_thread (void)
{
return &gomp_tls_data;
}
#else
extern pthread_key_t gomp_tls_key;
static inline struct gomp_thread *gomp_thread (void)
{
return pthread_getspecific (gomp_tls_key);
}
#endif
As you can see GOMP uses different implementation depending on whether or not thread-local storage (TLS) is available.
If it is available, then HAVE_TLS flag is set, and a global variable is used to track the state of each thread,
Otherwise, thread-local data will be managed via the function pthread_setspecific.
In the earlier version of NDKs the thread-local storage (the __thread keyword) isn't supported so HAVE_TLS won't be defined, therefore pthread_setspecific will be used.
Remark: I'm not sure whether __thread is supported or not in the last version of NDK, but here you can read the same answers about Android TLS.
When GOMP creates a worker thread, it sets up the thread specific data in the function gomp_thread_start() (line 72):
#ifdef HAVE_TLS
thr = &gomp_tls_data;
#else
struct gomp_thread local_thr;
thr = &local_thr;
pthread_setspecific (gomp_tls_key, thr);
#endif
But, when the application creates a thread independently, the thread specific data isn't set, and so the gomp_thread() function returns NULL. This causes the crash and this isn't a problem when TLS is supported, since the global variable that's used will always be available
I remember that this issue had been fixed android-ndk-r10d, but it only works with background processes (no Java). It means when you enable OpenMP and create a native thread from JNI (what is called from Java Android) then your app will crash remains.

Related

Shared memory between NDK and SDK below API Level 26

Library written in c++ produces continuous stream of data and same has to be ported on different platforms. Now integrating the lib to android application, I am trying to create shared memory between NDK and SDK.
Below is working snippet,
Native code:
#include <jni.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <linux/ashmem.h>
#include <android/log.h>
#include <string>
char *buffer;
constexpr size_t BufferSize=100;
extern "C" JNIEXPORT jobject JNICALL
Java_test_com_myapplication_MainActivity_getSharedBufferJNI(
JNIEnv* env,
jobject /* this */) {
int fd = open("/dev/ashmem", O_RDWR);
ioctl(fd, ASHMEM_SET_NAME, "shared_memory");
ioctl(fd, ASHMEM_SET_SIZE, BufferSize);
buffer = (char*) mmap(NULL, BufferSize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
return (env->NewDirectByteBuffer(buffer, BufferSize));
}
extern "C" JNIEXPORT void JNICALL
Java_test_com_myapplication_MainActivity_TestBufferCopy(
JNIEnv* env,
jobject /* this */) {
for(size_t i=0;i<BufferSize;i = i+2) {
__android_log_print(ANDROID_LOG_INFO, "native_log", "Count %d value:%d", i,buffer[i]);
}
//pass `buffer` to dynamically loaded library to update share memory
//
}
SDK code:
//MainActivity.java
public class MainActivity extends AppCompatActivity {
// Used to load the 'native-lib' library on application startup.
static {
System.loadLibrary("native-lib");
}
final int BufferSize = 100;
#RequiresApi(api = Build.VERSION_CODES.Q)
#Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
ByteBuffer byteBuffer = getSharedBufferJNI();
//update the command to shared memory here
//byteBuffer updated with commands
//Call JNI to inform update and get the response
TestBufferCopy();
}
/**
* A native method that is implemented by the 'native-lib' native library,
* which is packaged with this application.
*/
public native ByteBuffer getSharedBufferJNI();
public native int TestBufferCopy();
}
Question:
Accessing primitive arrays from Java to native is reference only if garbage collector supports pinning. Is it true for other way around ?
Is it guaranteed by android platform that ALWAYS reference is shared from NDK to SDK without redundant copy?
Is it the right way to share memory?

You only need /dev/ashmem to share memory between processes. NDK and SDK (Java/Kotlin) work in same Linux process and have full access to same memory space.
The usual way to define memory that can be used both from C++ and Java is by creating a Direct ByteBuffer. You don't need JNI for that, Java API has ByteBuffer.allocateDirect(int capacity). If it's more natural for your logical flow to allocate the buffer on the C++ side, JNI has the NewDirectByteBuffer(JNIEnv* env, void* address, jlong capacity) function that you used in your question.
Working with Direct ByteBuffer is very easy on the C++ side, but not so efficient on the JVM side. The reason is that this buffer is not backed by array, and the only API you have involves ByteBuffer.get() with typed variations (getting byte array, char, int, …). You have control of current position in the buffer, but working this way requires certain discipline: every get() operation updates the current position. Also, random access to this buffer is rather slow, because it involves calling both positioning and get APIs. Therefore, in some cases of non-trivial data structures, it may be easier to write your custom access code in C++ and have 'intelligent' getters called through JNI.
It's important not to forget to set ByteBuffer.order(ByteOrder.nativeOrder()). The order of a newly-created byte buffer is counterintuitively BIG_ENDIAN. This applies both to buffer created from Java and from C++.
If you can isolate the instances when C++ needs access to such shared memory, and don't really need it to be pinned all the time, it's worth to consider working with byte array. In Java, you have more efficient random access. On the NDK side, you will call GetByteArrayElements() or GetPrimitiveArrayCritical(). The latter is more efficient, but its use imposes restrictions on what Java functions you can call until the array is released. On Android, both methods don't involve memory allocation and copy (with no official guarantee, though). Even though C++ side uses the same memory as Java, your JNI code must call the appropriate Release…() function, and better do that as early as possible. It's a good practice to handle this Get/Release via RAII.

Let me summarize my findings,
Accessing primitive arrays from Java to native is reference only if garbage collector supports pinning. Is it true for other way around ?
The contents of a direct buffer can, potentially, reside in native memory outside of the ordinary garbage-collected heap. And hence garbage collector can't claim the memory.
Is it guaranteed by android platform that ALWAYS reference is shared from NDK to SDK without redundant copy?
Yes, As per documentation of NewDirectByteBuffer.
jobject NewDirectByteBuffer(JNIEnv* env, void* address, jlong capacity);
Allocates and returns a direct java.nio.ByteBuffer referring to the block of memory starting at the memory address address and extending capacity bytes.

Using OpenCL with Android JNI produces slow code due to some overhead

I implemented an algorithm on android using OpenCL and OpenMP. The OpenMP implementation runs about 10 times slower than the OpenCL one.
OpenMP: ~250 ms
OpenCL: ~25 ms
But overall, if I measure the time from the java android side, I get roughly the same time to call and get my values.
For example:
Java code:
// calls C implementation using JNI (Java Native Interface)
bool useOpenCL = true;
myFunction(bitmap, useOpenCL); // ~300 ms, timed with System.nanoTime() here, but omitted code for clarity
myFunction(bitmap, !useOpenCL); // ~300 ms, timed with System.nanoTime() here, but omitted code for clarity
C code:
JNIEXPORT void JNICALL Java_com_xxxxx_myFunctionNative(JNIEnv * env, jobject obj, jobject pBitmap, jboolean useOpenCL)
{
// same before, setting some variables
clock_t startTimer, stopTimer;
startTimer = clock();
if ((bool) useOpenCL) {
calculateUsingOpenCL(); // runs in ~25 ms, timed here, using clock()
}
else {
calculateUsingOpenMP(); // runs in ~250 ms
}
stopTimer = clock();
__android_log_print(ANDROID_LOG_VERBOSE, APPNAME, "Time in ms: %f\n", 1000.0f* (float)(stopTimer - startTimer) / (float)CLOCKS_PER_SEC);
// same from here on, e.g.: copying values to java side
}
The Java code, in both cases executes roughly in the same time, around 300 ms. To be more precise, elapsedTime is a bit more for OpenCL, that is OpenCL is slower on average.
Looking at the individual run-times of the OpenMP, and OpenCL implementations, OpenCL version should be much faster overall. But for some reason, there is an overhead that I cannot find.
I also compared OpenCL vs Normal native code (no OpenMP), I still got the same results, with roughly same runtime overall, even though the calculateUsingOpenCL ran at least 10 times faster.
Ideas:
Maybe the GPU (in OpenCL case) is less efficient in general, because it has less memory available. There are few variables that we need to preallocate, which are used every frame. So, we checked the time it takes for android to draw a bitmap in both cases (OpenMP, OpenCL). In the OpenCL case, sometimes drawing a bitmap took longer (3 times longer), but not by the amount that would equalize the overall run time of the program.
Does JNI use GPU to accelerate some calls, which could cause the OpenCL version to be slower?
EDIT:
Is it possible that Java Garbage collection is triggered by OpenCL, causing the big overhead?

It turns out, clock() is unreliable (on Android), so instead we used the following method to measure time. With this method, everything is ok.
int64_t getTimeNsec() {
struct timespec now;
clock_gettime(CLOCK_MONOTONIC, &now);
return (int64_t) now.tv_sec*1000000000LL + now.tv_nsec;
}
clock_t startTimer, stopTimer;
startTimer = getTimeNsec();
function_to_measure();
stopTimer = getTimeNsec();
__android_log_print(ANDROID_LOG_VERBOSE, APPNAME, "Runtime in milliseconds (ms): %f", (float)(stopTimer - startTimer) / 1000000.0f);
This was suggested here:
How to obtain computation time in NDK

Is logging Android systrace events directly from native code possible, without JNI?

The Android systrace logging system is fantastic, but it only works in the Java portion of the code, through Trace.beginSection() and Trace.endSection(). In a C/C++ NDK (native) portion of the code it can only be used through JNI, which is slow or unavailable in threads without a Java environment...
Is there any way of either adding events to the main systrace trace buffer, or even generating a separate log, from native C code?
This older question mentions atrace/ftrace as being the internal system Android's systrace uses. Can this be tapped into (easily)?
BONUS TWIST: Since tracing calls would often be in performance-critical sections, it should ideally be possible to run the calls after the actual event time. i.e. I for one would prefer to be able to specify the times to log, instead of the calls polling for it themselves. But that would just be icing on the cake.

Posting a follow-up answer with some code, based on fadden's pointers. Please read his/her answer first for the overview.
All it takes is writing properly formatted strings to /sys/kernel/debug/tracing/trace_marker, which can be opened without problems. Below is some very minimal code based on the cutils header and C file. I preferred to re-implement it instead of pulling in any dependencies, so if you care a lot about correctness check the rigorous implementation there, and/or add your own extra checks and error-handling.
This was tested to work on Android 4.4.2.
The trace file must first be opened, saving the file descriptor in an atrace_marker_fd global:
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#define ATRACE_MESSAGE_LEN 256
int atrace_marker_fd = -1;
void trace_init()
{
atrace_marker_fd = open("/sys/kernel/debug/tracing/trace_marker", O_WRONLY);
if (atrace_marker_fd == -1) { /* do error handling */ }
}
Normal 'nested' traces like the Java Trace.beginSection and Trace.endSection are obtained with:
inline void trace_begin(const char *name)
{
char buf[ATRACE_MESSAGE_LEN];
int len = snprintf(buf, ATRACE_MESSAGE_LEN, "B|%d|%s", getpid(), name);
write(atrace_marker_fd, buf, len);
}
inline void trace_end()
{
char c = 'E';
write(atrace_marker_fd, &c, 1);
}
Two more trace types are available, which are not accessible to Java as far as I know: trace counters and asynchronous traces.
Counters track the value of an integer and draw a little graph in the systrace HTML output. Very useful stuff:
inline void trace_counter(const char *name, const int value)
{
char buf[ATRACE_MESSAGE_LEN];
int len = snprintf(buf, ATRACE_MESSAGE_LEN, "C|%d|%s|%i", getpid(), name, value);
write(atrace_marker_fd, buf, len);
}
Asynchronous traces produce non-nested (i.e. simply overlapping) intervals. They show up as grey segments above the thin thread-state bar in the systrace HTML output. They take an extra 32-bit integer argument that "distinguishes simultaneous events". The same name and integer must be used when ending traces:
inline void trace_async_begin(const char *name, const int32_t cookie)
{
char buf[ATRACE_MESSAGE_LEN];
int len = snprintf(buf, ATRACE_MESSAGE_LEN, "S|%d|%s|%i", getpid(), name, cookie);
write(atrace_marker_fd, buf, len);
}
inline void trace_async_end(const char *name, const int32_t cookie)
{
char buf[ATRACE_MESSAGE_LEN];
int len = snprintf(buf, ATRACE_MESSAGE_LEN, "F|%d|%s|%i", getpid(), name, cookie);
write(atrace_marker_fd, buf, len);
}
Finally, there indeed seems to be no way of specifying times to log, short of recompiling Android, so this doesn't do anything for the "bonus twist".

I don't think it's exposed from the NDK.
If you look at the sources, you can see that the android.os.Trace class calls into native code to do the actual work. That code calls atrace_begin() and atrace_end(), which are declared in a header in the cutils library.
You may be able to use the atrace functions directly if you extract the headers from the full source tree and link against the internal libraries. However, you can see from the header that atrace_begin() is simply:
static inline void atrace_begin(uint64_t tag, const char* name)
{
if (CC_UNLIKELY(atrace_is_tag_enabled(tag))) {
char buf[ATRACE_MESSAGE_LENGTH];
size_t len;
len = snprintf(buf, ATRACE_MESSAGE_LENGTH, "B|%d|%s", getpid(), name);
write(atrace_marker_fd, buf, len);
}
}
Events are written directly to the trace file descriptor. (Note that the timestamp is not part of the event; that's added automatically.) You could do something similar in your code; see atrace_init_once() in the .c file to see how the file is opened.
Bear in mind that, unless atrace is published as part of the NDK, any code using it would be non-portable and likely to fail in past or future versions of Android. However, as systrace is a debugging tool and not something you'd actually want to ship enabled in an app, compatibility is probably not a concern.

For anybody googling this question in the future.
Native trace events are supported since API Level 23, check out the docs here https://developer.android.com/topic/performance/tracing/custom-events-native.

android NDK mutex locking

I've been porting a cross platform C++ engine to Android, and noticed that it will inexplicably (and inconsistently) block when calling pthread_mutex_lock. This engine has already been working for many years on several platforms, and the problematic code hasn't changed in years, so I doubt it's a deadlock or otherwise buggy code. It must be my port to Android..
So far there are several places in the code that block on pthread_mutex_lock. It isn't entirely reproducible either. When it hangs, there's no suspicious output in LogCat.
I modified the mutex code like this (edited for brevity... real code checks all return values):
void MutexCreate( Mutex* m )
{
#ifdef WINDOWS
InitializeCriticalSection( m );
#else ANDROID
pthread_mutex_init( m, NULL );
#endif
}
void MutexDestroy( Mutex* m )
{
#ifdef WINDOWS
DeleteCriticalSection( m );
#else ANDROID
pthread_mutex_destroy( m, NULL );
#endif
}
void MutexLock( Mutex* m )
{
#ifdef WINDOWS
EnterCriticalSection( m );
#else ANDROID
pthread_mutex_lock( m );
#endif
}
void MutexUnlock( Mutex* m )
{
#ifdef WINDOWS
LeaveCriticalSection( m );
#else ANDROID
pthread_mutex_unlock( m );
#endif
}
I tried modifying MutexCreate to make error-checking and recursive mutexes, but it didn't matter. I wasn't even getting errors or log output either, so either that means my mutex code is just fine, or the errors/logs weren't being shown. How exactly does the OS notify you of bad mutex usage?
The engine makes heavy use of static variables, including mutexes. I can't see how, but is that a problem? I doubt it because I modified lots of mutexes to be allocated on the heap instead, and the same behavior occurred. But that may be because I missed some static mutexes. I'm probably grasping at straws here.
I read several references including:
http://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_mutex_init.html
http://www.embedded-linux.co.uk/tutorial/mutex_mutandis
http://linux.die.net/man/3/pthread_mutex_init
Android NDK Mutex
Android NDK problem pthread_mutex_unlock issue

The "errorcheck" mutexes will check a couple of things (like attempts to use a non-recursive mutex recursively) but nothing spectacular.
You said "real code checks all return values", so presumably your code explodes if any pthread call returns a nonzero value. (Not sure why your pthread_mutex_destroy takes two args; assuming copy & paste error.)
The pthread code is widely used within Android and has no known hangups, so the issue is not likely in the pthread implementation itself.
The current implementation of mutexes fits in 32 bits, so if you print *(pthread_mutex_t* mut) as an integer you should be able to figure out what state it's in (technically, what state it was in at some point in the past). The definition in bionic/libc/bionic/pthread.c is:
/* a mutex is implemented as a 32-bit integer holding the following fields
*
* bits: name description
* 31-16 tid owner thread's kernel id (recursive and errorcheck only)
* 15-14 type mutex type
* 13 shared process-shared flag
* 12-2 counter counter of recursive mutexes
* 1-0 state lock state (0, 1 or 2)
*/
"Fast" mutexes have a type of 0, and don't set the tid field. In fact, a generic mutex will have a value of 0 (not held), 1 (held), or 2 (held, with contention). If you ever see a fast mutex whose value is not one of those, chances are something came along and stomped on it.
It also means that, if you configure your program to use recursive mutexes, you can see which thread holds the mutex by pulling the bits out (either by printing the mutex value when trylock indicates you're about to stall, or dumping state with gdb on a hung process). That, plus the output of ps -t, will let you know if the thread that locked the mutex still exists.

Application is hanged after call nested function with Android NDK

I build Android project where I use Android NDK with LibXTract to extract audio features. LibXTract use fftw3 library. Project is consisted of button which runs simple example form libxtract:
JNIEXPORT void JNICALL Java_com_androidnative1_NativeClass_showText(JNIEnv *env, jclass clazz)
{
float mean = 0, vector[] = {.1, .2, .3, .4, -.5, -.4, -.3, -.2, -.1}, spectrum[10];
int n, N = 9;
float argf[4];
argf[0] = 8000.f;
argf[1] = XTRACT_MAGNITUDE_SPECTRUM;
argf[2] = 0.f;
argf[3] = 0.f;
xtract[XTRACT_MEAN]((void *)&vector, N, 0, (void *)&mean);
__android_log_print(ANDROID_LOG_DEBUG, "AndNat", "com_androidnative1_NativeClass.c before");
xtract_init_fft(N, XTRACT_SPECTRUM);
__android_log_print(ANDROID_LOG_DEBUG, "AndNat", "com_androidnative1_NativeClass.c after");
// Comment for test purpose
//xtract_init_bark(1, argf[1], 1);
//xtract[XTRACT_SPECTRUM]((void *)&vector, N, &argf[0], (void *)&spectrum[0]);
}
Libxtract function xtract_init_fft locate in jni/libxtract/jni/src/init.c execute fftw3 function fftwf_plan_r2r_1d located at jni/fftw3/jni/api/plan-r2r-1d.c
__android_log_print(ANDROID_LOG_DEBUG, "AndNat", "libxtract/src/init.c before");
fft_plans.spectrum_plan = fftwf_plan_r2r_1d(N, input, output, FFTW_R2HC, optimisation);
__android_log_print(ANDROID_LOG_DEBUG, "AndNat", "libxtract/src/init.c after");
Application hang inside fftwf_paln_r2r_1d without crash or any outher error I must force it to stop working.
fftwf_paln_r2r_1d looks like:
X(plan) X(plan_r2r_1d)(int n, R *in, R *out, X(r2r_kind) kind, unsigned flags)
{
__android_log_print(ANDROID_LOG_DEBUG, "AndNat", "fftw3/api/plan-r2r-1d.c");
return X(plan_r2r)(1, &n, in, out, &kind, flags);
}
From CatLog I can see:
07-16 18:50:09.615: D/AndNat(7313): com_androidnative1_NativeClass.c before
07-16 18:50:09.615: D/AndNat(7313): libxtract/src/init.c before
07-16 18:50:09.615: D/AndNat(7313): fftw3/api/plan-r2r-1d.c
I genereate config.h for fftw3 and libxtract with gen.sh scripts locate in source folder with success. Both librearies are build as static and linked with shared libary libcom_androidnative1_NativeClass.so
Command
nm -Ca libcom_androidnative1_NativeClass.so
shows that used function is included.
Application is built and deploys to device without any problems.
I build fftw3 with flags --disable-alloca, --enable-float and LibXTract with flags --enable-fft and --disable-dependency-tracking
Only ingerention in library source code was added dbgprint and remove define XTRACT_FFT form LibXtract beacouse it can't detect fftw library.
If somebody have any idea about this strange for me behavior please help.
Here I put entire project in github so maybe someone can help me handle this.
https://github.com/bl0ndynek/AndroidNative1

Thanks for FFTW3 maintainer problem is solved.
Solution was to change optimization level from FFTW_MEASURE to FFTW_ESTIMATE (from 1 to 0) in FFTW3,
FFTW's planner (in xtract_init_fft) actually executes and times different possible FFT algorithms in order to pick the fastest plan for a given n. In order to do this in as short a time as possible, however, the timer must have a very high resolution, and to accomplish this FFTW3 employ the hardware cycle counters that are available on most CPUs but not on Android default ARM configuration.
So this algorithm use gettimeofday() witch have low resolution and on ARM took forever on xtract_init_fft.

It looks to me like you are missing some terminating condition in your recursive function X() which would put you in an infinite loop.

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.