Using OpenCL with Android JNI produces slow code due to some overhead

Using OpenCL with Android JNI produces slow code due to some overhead - android

I implemented an algorithm on android using OpenCL and OpenMP. The OpenMP implementation runs about 10 times slower than the OpenCL one.
OpenMP: ~250 ms
OpenCL: ~25 ms
But overall, if I measure the time from the java android side, I get roughly the same time to call and get my values.
For example:
Java code:
// calls C implementation using JNI (Java Native Interface)
bool useOpenCL = true;
myFunction(bitmap, useOpenCL); // ~300 ms, timed with System.nanoTime() here, but omitted code for clarity
myFunction(bitmap, !useOpenCL); // ~300 ms, timed with System.nanoTime() here, but omitted code for clarity
C code:
JNIEXPORT void JNICALL Java_com_xxxxx_myFunctionNative(JNIEnv * env, jobject obj, jobject pBitmap, jboolean useOpenCL)
{
// same before, setting some variables
clock_t startTimer, stopTimer;
startTimer = clock();
if ((bool) useOpenCL) {
calculateUsingOpenCL(); // runs in ~25 ms, timed here, using clock()
}
else {
calculateUsingOpenMP(); // runs in ~250 ms
}
stopTimer = clock();
__android_log_print(ANDROID_LOG_VERBOSE, APPNAME, "Time in ms: %f\n", 1000.0f* (float)(stopTimer - startTimer) / (float)CLOCKS_PER_SEC);
// same from here on, e.g.: copying values to java side
}
The Java code, in both cases executes roughly in the same time, around 300 ms. To be more precise, elapsedTime is a bit more for OpenCL, that is OpenCL is slower on average.
Looking at the individual run-times of the OpenMP, and OpenCL implementations, OpenCL version should be much faster overall. But for some reason, there is an overhead that I cannot find.
I also compared OpenCL vs Normal native code (no OpenMP), I still got the same results, with roughly same runtime overall, even though the calculateUsingOpenCL ran at least 10 times faster.
Ideas:
Maybe the GPU (in OpenCL case) is less efficient in general, because it has less memory available. There are few variables that we need to preallocate, which are used every frame. So, we checked the time it takes for android to draw a bitmap in both cases (OpenMP, OpenCL). In the OpenCL case, sometimes drawing a bitmap took longer (3 times longer), but not by the amount that would equalize the overall run time of the program.
Does JNI use GPU to accelerate some calls, which could cause the OpenCL version to be slower?
EDIT:
Is it possible that Java Garbage collection is triggered by OpenCL, causing the big overhead?

It turns out, clock() is unreliable (on Android), so instead we used the following method to measure time. With this method, everything is ok.
int64_t getTimeNsec() {
struct timespec now;
clock_gettime(CLOCK_MONOTONIC, &now);
return (int64_t) now.tv_sec*1000000000LL + now.tv_nsec;
}
clock_t startTimer, stopTimer;
startTimer = getTimeNsec();
function_to_measure();
stopTimer = getTimeNsec();
__android_log_print(ANDROID_LOG_VERBOSE, APPNAME, "Runtime in milliseconds (ms): %f", (float)(stopTimer - startTimer) / 1000000.0f);
This was suggested here:
How to obtain computation time in NDK

Related

Flutter C++ Memory allocation causes jank on raster thread - Android NDK Dart FFI

I have a flutter app which uses Dart ffi to connect to my custom C++ audio backend. There I allocate around 10MB of total memory for my audio buffers. Each buffer has 10MB / 84 of memory. I use 84 audio players. Here is the ffi flow:
C++ bridge:
extern "C" __attribute__((visibility("default"))) __attribute__((used))
void *
loadMedia(char *filePath, int8_t *mediaLoadPointer, int64_t *currentPositionPtr, int8_t *mediaID) {
LOGD("loadMedia %s", filePath);
if (soundEngine == nullptr) {
soundEngine = new SoundEngine();
}
return soundEngine->loadMedia(filePath, mediaLoadPointer, currentPositionPtr, mediaID);
}
In my sound engine I launch a C++ thread:
void loadMedia(){
std::thread{startDecoderWorker,
buffer,
}.detach();
}
void startDecoderWorker(float*buffer){
buffer = new float[30000]; // 30000 might be wrong here, I entered a huge value to just showcase the problem, the calculation of 10MB / 84 code is redundant to the code
}
So here is the problem, I dont know why but when I allocate memory with new keyword even inside a C++ thread, flutters raster thread janks and I can see that my flutter UI janks lots of frames. This is also present in performance overlay as it goes all red for 3 to 5 frames with each of it taking around 30 40ms. Tested on profile mode.
Here is how I came to this conclusion:
If I instantly return from my startDecoderWorker without running new memory allocation code, when I do this there is 0 jank. Everything is smooth 60fps, performance overlay doesnt show me red bars.
Here are some screenshots from Profile mode:

The actual cause, after discussions (in the comments of the question), is not because the memory allocation is too slow, but lie somewhere else - the calculations which will be heavy if the allocation is big.
For details, please refer to the comments and discussions of the question ;)

Accurate POSIX thread timing using NDK

I'm writing a simple NDK OpenSL ES audio app that records the users touches on a virtual piano keyboard and then plays them back forever over a set loop. After much experimenting and reading, I've settled on using a separate POSIX loop to achieve this. As you can see in the code it subtracts any processing time taken from the sleep time in order to make the interval of each loop as close to the desired sleep interval as possible (in this case it's 5000000 nanoseconds.
void init_timing_loop() {
pthread_t fade_in;
pthread_create(&fade_in, NULL, timing_loop, (void*)NULL);
}
void* timing_loop(void* args) {
while (1) {
clock_gettime(CLOCK_MONOTONIC, &timing.start_time_s);
tic_counter(); // simple logic gates that cycle the current tic
play_all_parts(); // for-loops through all parts and plays any notes (From an OpenSL buffer) that fall on the current tic
clock_gettime(CLOCK_MONOTONIC, &timing.finish_time_s);
timing.diff_time_s.tv_nsec = (5000000 - (timing.finish_time_s.tv_nsec - timing.start_time_s.tv_nsec));
nanosleep(&timing.diff_time_s, NULL);
}
return NULL;
}
The problem is that even using this the results are better, but quite inconsistent. sometimes notes will delay for perhaps even 50ms at a time, which makes for very wonky playback.
Is there a better way of approaching this? To debug I ran the following code:
gettimeofday(&timing.curr_time, &timing.tzp);
__android_log_print(ANDROID_LOG_DEBUG, "timing_loop", "gettimeofday: %d %d",
timing.curr_time.tv_sec, timing.curr_time.tv_usec);
Which gives a fairly consistent readout - that doesn't reflect the playback inaccuracies whatsoever. Are there other forces at work with Android preventing accurate timing? Or is OpenSL ES a potential issue? All the buffer data is loaded into memory - could there be bottlenecks there?
Happy to post more OpenSL code if needed... but at this stage I'm trying figure out if this thread loop is accurate or if there's a better way to do it.

You should consider seconds when using clock_gettime as well, you may get greater timing.start_time_s.tv_nsec than timing.finish_time_s.tv_nsec. tv_nsec starts from zero when tv_sec is increased.
timing.diff_time_s.tv_nsec =
(5000000 - (timing.finish_time_s.tv_nsec - timing.start_time_s.tv_nsec));
try something like
#define NS_IN_SEC 1000000000
(timing.finish_time_s.tv_sec * NS_IN_SEC + timing.finish_time_s.tv_nsec) -
(timing.start_time_s.tv_nsec * NS_IN_SEC + timing.start_time_s.tv_nsec)

android NDK mutex locking

I've been porting a cross platform C++ engine to Android, and noticed that it will inexplicably (and inconsistently) block when calling pthread_mutex_lock. This engine has already been working for many years on several platforms, and the problematic code hasn't changed in years, so I doubt it's a deadlock or otherwise buggy code. It must be my port to Android..
So far there are several places in the code that block on pthread_mutex_lock. It isn't entirely reproducible either. When it hangs, there's no suspicious output in LogCat.
I modified the mutex code like this (edited for brevity... real code checks all return values):
void MutexCreate( Mutex* m )
{
#ifdef WINDOWS
InitializeCriticalSection( m );
#else ANDROID
pthread_mutex_init( m, NULL );
#endif
}
void MutexDestroy( Mutex* m )
{
#ifdef WINDOWS
DeleteCriticalSection( m );
#else ANDROID
pthread_mutex_destroy( m, NULL );
#endif
}
void MutexLock( Mutex* m )
{
#ifdef WINDOWS
EnterCriticalSection( m );
#else ANDROID
pthread_mutex_lock( m );
#endif
}
void MutexUnlock( Mutex* m )
{
#ifdef WINDOWS
LeaveCriticalSection( m );
#else ANDROID
pthread_mutex_unlock( m );
#endif
}
I tried modifying MutexCreate to make error-checking and recursive mutexes, but it didn't matter. I wasn't even getting errors or log output either, so either that means my mutex code is just fine, or the errors/logs weren't being shown. How exactly does the OS notify you of bad mutex usage?
The engine makes heavy use of static variables, including mutexes. I can't see how, but is that a problem? I doubt it because I modified lots of mutexes to be allocated on the heap instead, and the same behavior occurred. But that may be because I missed some static mutexes. I'm probably grasping at straws here.
I read several references including:
http://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_mutex_init.html
http://www.embedded-linux.co.uk/tutorial/mutex_mutandis
http://linux.die.net/man/3/pthread_mutex_init
Android NDK Mutex
Android NDK problem pthread_mutex_unlock issue

The "errorcheck" mutexes will check a couple of things (like attempts to use a non-recursive mutex recursively) but nothing spectacular.
You said "real code checks all return values", so presumably your code explodes if any pthread call returns a nonzero value. (Not sure why your pthread_mutex_destroy takes two args; assuming copy & paste error.)
The pthread code is widely used within Android and has no known hangups, so the issue is not likely in the pthread implementation itself.
The current implementation of mutexes fits in 32 bits, so if you print *(pthread_mutex_t* mut) as an integer you should be able to figure out what state it's in (technically, what state it was in at some point in the past). The definition in bionic/libc/bionic/pthread.c is:
/* a mutex is implemented as a 32-bit integer holding the following fields
*
* bits: name description
* 31-16 tid owner thread's kernel id (recursive and errorcheck only)
* 15-14 type mutex type
* 13 shared process-shared flag
* 12-2 counter counter of recursive mutexes
* 1-0 state lock state (0, 1 or 2)
*/
"Fast" mutexes have a type of 0, and don't set the tid field. In fact, a generic mutex will have a value of 0 (not held), 1 (held), or 2 (held, with contention). If you ever see a fast mutex whose value is not one of those, chances are something came along and stomped on it.
It also means that, if you configure your program to use recursive mutexes, you can see which thread holds the mutex by pulling the bits out (either by printing the mutex value when trylock indicates you're about to stall, or dumping state with gdb on a hung process). That, plus the output of ps -t, will let you know if the thread that locked the mutex still exists.

clock_gettime can not update instantly

Update
After checking the time resolution, we tried to debug the problem in kernel space.
unsigned long long task_sched_runtime(struct task_struct *p)
{
unsigned long flags;
struct rq *rq;
u64 ns = 0;
rq = task_rq_lock(p, &flags);
ns = p->se.sum_exec_runtime + do_task_delta_exec(p, rq);
task_rq_unlock(rq, &flags);
//printk("task_sched runtime\n");
return ns;
}
Our new experiment shows that the time p->se.sum_exec_runtime is not updated instantly. But if we add printk() inside the function. the time will be updated instantly.
Old
We are developing an Android program.
However, the time measured by the function threadCpuTimenanos() is not always correct on our platform.
After experimenting, we found that the time returned from clock_gettime is not updated instantly.
Even after several while loop iterations, the time we get still doesn't change.
Here's our sample code:
while(1)
{
test = 1;
test = clock_gettime(CLOCK_THREAD_CPUTIME_ID, &now);
printf(" clock gettime test 1 %lx, %lx , ret = %d\n",now.tv_sec , now.tv_nsec,test );
pre = now.tv_nsec;
sleep(1);
}
This code runs okay on an x86 PC. But it does not run correctly in our embedded platform ARM Cortex-A9 with kernel 2.6.35.13.
Any ideas?

I changed the clock_gettime to use the CLOCK_MONOTONIC_RAW , assigned the thread to one CPU and I get different values.
I am also working with a dual cortex-A9
while(1)
{
test = 1;
test = clock_gettime(CLOCK_MONOTONIC_RAW, &now);
printf(" clock gettime test 1 %lx, %lx , ret = %d\n",now.tv_sec , now.tv_nsec, test );
pre = now.tv_nsec;
sleep(1);
}

The resolution of clock_gettime is platform dependent. Use clock_getres() to find the resolution on your platform. According to the results of your experiment, clock resolutions on pc-x86 and on your target platform are different.

In the android CTS, there is a case that has the same problem. read timer twice but they are the same
testThreadCpuTimeNanos fail junit.framework.AssertionFailedError at
android.os.cts.DebugTest.testThreadCpuTimeNanos

$man clock_gettime
...
Note for SMP systems
The CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID clocks are realized on many platforms using timers from the CPUs (TSC on i386, AR.ITC on Itanium). These registers may differ between CPUs and as a consequence these clocks may return bogus results if a process is migrated to another CPU.
If the CPUs in an SMP system have different clock sources then there is no way to maintain a correlation between the timer registers since each CPU will run at a slightly different frequency. If that is the case then clock_getcpuclockid(0) will return ENOENT to signify this condition. The two clocks will then only be useful if it can be ensured that a process stays on a certain CPU.
The processors in an SMP system do not start all at exactly the same time and therefore the timer registers are typically running at an offset. Some architectures include code that attempts to limit these offsets on bootup. However, the code cannot guarantee to accurately tune the offsets. Glibc contains no provisions to deal with these offsets (unlike the Linux Kernel). Typically these offsets are small and therefore the effects may be negligible in most cases.

The CLOCK_THREAD_CPUTIME_ID clock measures CPU time spent, not realtime, and you're spending almost-zero CPU time. Also, CLOCK_THREAD_CPUTIME_ID (the thread-specific CPU time) is implemented incorrectly on Linux/glibc and likely does not even work at all on glibc. CLOCK_PROCESS_CPUTIME_ID or whatever that one's called should work better.

Running generated ARM machine code on Android gives UnsupportedOperationException with Java Bitmap objects

We ( http://www.mosync.com ) have compiled our ARM recompiler with the Android NDK which takes our internal byte code and generates ARM machine code. When executing recompiled code we see an enormous increase in performance, with one small exception, we can't use any Java Bitmap operations.
The native system uses a function which takes care of all the calls to the Java side which the recompiled code is calling. On the Java (Dalvik) side we then have bindings to Android features. There are no problems while recompiling the code or when executing the machine code. The exact same source code works on Symbian and Windows Mobile 6.x so the recompiler seems to generate correct ARM machine code.
Like I said, the problem we have is that we can't use Java Bitmap objects. We have verified that the parameters which are sent from the Java code is correct, and we have tried following the execution down in Android's own JNI systems. The problem is that we get an UnsupportedOperationException with "size must fit in 32 bits.". The problem seems consistent on Android 1.5 to 2.3. We haven't tried the recompiler on any Android 3 devices.
Is this a bug which other people have encountered, I guess other developers have done similar things.

I found the message in dalvik_system_VMRuntime.c:
/*
* public native boolean trackExternalAllocation(long size)
*
* Asks the VM if <size> bytes can be allocated in an external heap.
* This information may be used to limit the amount of memory available
* to Dalvik threads. Returns false if the VM would rather that the caller
* did not allocate that much memory. If the call returns false, the VM
* will not update its internal counts.
*/
static void Dalvik_dalvik_system_VMRuntime_trackExternalAllocation(
const u4* args, JValue* pResult)
{
s8 longSize = GET_ARG_LONG(args, 1);
/* Fit in 32 bits. */
if (longSize < 0) {
dvmThrowException("Ljava/lang/IllegalArgumentException;",
"size must be positive");
RETURN_VOID();
} else if (longSize > INT_MAX) {
dvmThrowException("Ljava/lang/UnsupportedOperationException;",
"size must fit in 32 bits");
RETURN_VOID();
}
RETURN_BOOLEAN(dvmTrackExternalAllocation((size_t)longSize));
}
This method is called, for example, from GraphicsJNI::setJavaPixelRef:
size_t size = size64.get32();
jlong jsize = size; // the VM wants longs for the size
if (reportSizeToVM) {
// SkDebugf("-------------- inform VM we've allocated %d bytes\n", size);
bool r = env->CallBooleanMethod(gVMRuntime_singleton,
gVMRuntime_trackExternalAllocationMethodID,
jsize);
I would say it seems that the code you're calling is trying to allocate a too big size. If you show the actual Java call which fails and values of all the arguments that you pass to it, it might be easier to find the reason.

I managed to find a work-around. When I wrap all the Bitmap.createBitmap calls inside a Activity.runOnUiThread() It works.

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.

Using OpenCL with Android JNI produces slow code due to some overhead - android

Related

Flutter C++ Memory allocation causes jank on raster thread - Android NDK Dart FFI

Accurate POSIX thread timing using NDK

android NDK mutex locking

clock_gettime can not update instantly

Running generated ARM machine code on Android gives UnsupportedOperationException with Java Bitmap objects

Categories

Resources