On a recent SO question, I explained how calling a RenderScript kernel multiple times will effectively force all threads to be globally synchronized between calls.
I am currently working with multiple convolutions applied in sequence to image data. Since the convolution algorithm requires reading surrounding pixel data of the input image, I have implemented a workflow where my own custom kernel is called multiple times -- to make sure that at every step, all data from the previous convolution is ready and available at the correct coordinates. This technique has worked great for me so far.
However, in my constant quest for optimization, I have noticed that there is much performance to be obtained by keeping intermediate values in local registers for a thread, instead of writing them back to the global memory allocation in between kernel calls. If I were able to chain these convolutions in such a way, things would run much quicker. The problem is obviously that accessing the registers of surrounding threads is not really possible. Furthermore, this would require threads to run in synch to make sure these intermediate values in between stages get calculated in the expected order.
In CUDA and OpenCL, these issues are very common, and are addressed by well-known barrier synchronization + shared memory tiling techniques, which in turn depend on the concept of CUDA thread blocks or OpenCL work groups. I believe these concepts are non-existent in RenderScript, as this issue is very much tied to the wildly different architectures between desktop-class GPU's and mobile SoC's.
So my obvious question here is, are such things possible in RenderScript? That is, better management of threads and possibly thread groups for quicker data sharing among them.
On the Google I/O 2013 RenderScript talk by Jason Sams and Tim Murray, it is discussed how Script Groups might be able to do some behind the scenes optimizations, such as cross-device parallelization, memory tiling, and kernel fusion; all this by analyzing at runtime the dependency DAG in the group, and either automatically creating allocations where needed or possibly optimizing them away. I'm assuming this last bit referes to fusing kernels so that they work off their own local data, kind of how I mentioned above keeping data in local registers and combining separate steps inside a single kernel.
All this seems very much in line with what I'm looking for, especially since my application is indeed a well-defined DAG of inter-dependent operations (for a Convolutional Neural Network). So if Script Groups are indeed a plausible mobile-centric alternative to these mechanisms, I'm wondering if there is any way of influencing how and where these optimizations happen. Or if not, how much can the runtime be trusted to make the correct inference from my data dependencies given the hardware its running on -- in the specific case of "surrounding" pixel data access of the convolutional algorithm.
I realize this might all still be work in pogress, and methods would be highly hardware dependent at this point. So if there is no straight solution for such matters at the present time -- I'd be very much willing to accept a speculative answer on how this kind of workflow might potentially be approached by RenderScript in future releases.
I'd be immensely grateful on some insight about this, as it would greatly affect the development direction of my own project going forward, not to mention there are surely many other people out there wondering how such general parallel computing tasks can be handled in RS.
Thank you very much!
As you've discovered, there's no way in RS to directly share data across threads. However, what you are describing can be done using a ScriptGroup. The catch is that each script in the group has to be unique, so you cannot feed your same script over and over. At least, not as it is written now. You could certainly put the "core" of your script in a RS header and include it from multiple kernels. The ScriptGroup allows you to have the output from one script become the input of another, or the output of one script becomes a global field in another. The documentation states that the kernel to kernel (output to input) is the more efficient use case. Using this approach, your synchronization issue would be resolved as the engine will execute the first script against the entire input data set before starting the second script, etc. The scripts themselves will be parallelized appropriately for the hardware (using either CPU or GPU/DSP). The engine will not have to pop back out to Java between scripts and can also manage the data allocations behind the scenes, if needed.
Something you may notice is the ScriptGroup utilizes Script.KernelID or Script.FieldID in order to identify the exact script or field in which to connect two kernels. Your custom scripts have these things auto-generated as long as you explicitly call out your kernel function using the RS compiler attribute pragma. Then you can call getKernelID_<name> (where 'name' is the kernel function name from your script) to get the kernel ID.
Related
I have a C library which I'm cross-compiling to use in Android & iOS apps.
It makes use of memcpy() and mktime() so I want to know if these functions are implicitly thread-safe when used in multi-threaded environments.
iOS apps compiled with modern Xcode and Android libraries compiled with modern Android NDK use a clang compiler which is LLVM-based.
I've reviewed the following questions, but have been unable to find a definitive answer:
Is memcpy process-safe?
Are functions in the C standard library thread safe?
POSIX requires of conforming implementations that all functions it standardizes be thread safe, with the exception of a relatively short list of functions. memcpy() and mktime() are both covered by POSIX, and neither is on the list of exceptions, so POSIX requires them to be thread safe (but read on).
Note well, however, that this is not a matter of the compiler used, but rather of the C library that supports your application. I recall Apple's C libraries being non-conforming in some areas. Nevertheless, there's nothing in particular about memcpy() and mktime() that makes them inherently risky from a thread safety perspective. That is, there's no reason to expect that they access any shared data, except any provided to them via their arguments.
And there's the rub. You can rely on memcpy() and mktime() not to, say, rely internally on static data, but POSIX's requirement for thread safety does not extend to working as documented in the face of data races you create through choice of arguments. Thus, for example, if two different threads call memcpy(), and the target region of one call overlaps either the source or target region of the other, then you need some flavor of synchronization between the threads.
The question if memcpy() is thread-safe might be discussible.
I would say that memcpy() is indeed thread-safe. It doesn't rely on a (global) state, which could be mangled up by multiple instances of memcpy() running. This, however, doesn't mean, that there is some magic preventing a memory area, which is concurrently the copy destination of multiple threads doing memcpy() gets badly mangled up, i.e. the copy process as a whole is not atomic. You would have to care yourself using mutexes to ensure atomicity.
mktime() is trivially threadsafe, since it doesn't use static buffers, use a global state or similar. The manpage mentions a few functions from that family being not threadsafe (those have corresponding *_r functions), but mktime() is not amongst those.
So I'm trying to write some low-level code for Android, and my main concern is that I want to avoid ALL optimization by the JIT compiler (or anything else). After doing some research, the best approach seems to be to:
write Java bytecode by hand
convert it to a dex file using the "dx" command
run it on the program using the "dalvikvm" command (via adb shell) with the "-Xverify:none -Xdexopt:none" paramaters specified
My question is: will this in fact avoid ALL optimization? The previous discussion here https://groups.google.com/forum/#!topic/android-platform/Y-pzP9z6xLw makes me unsure, and I can't 100% convince myself by reading the docs.
Any confirmation one way or the other is greatly appreciated.
Some of the instruction rewriting performed by dexopt cannot be disabled. For example, accesses to volatile long fields must be handled differently from access to long fields, and the specialization is handled by replacing the field-get instruction with a different instruction.
The optimizations performed by dexopt take the form of instruction replacement, usually some sort of "quickening" that allows the VM to do a little less work. All such optimizations are performed statically, ahead of time, not dynamically at run time, so you will get consistent behavior. Enabling the dexopt optimizations doesn't introduce unknowns, it just changes from one set of knowns to a different set of knowns.
The biggest source of variation is going to be Dalvik's JIT compiler, which you can disable with -Xint:fast. See this slightly outdated doc for notes on how to configure this system-wide.
Can someone explain why ashmem was created?
I'm browsing through mm/ashmem.c right now. As near as I can tell, the kernel is thinking of ashmem as file-backed memory that can be mmap'd. But then, why go to the trouble of implementing ashmem? It seems like the same functionality could be achieved by mounting a RAM fs and then using filemap/mmap to share memory.
I'm sure that ashmem can do more fancy stuff -- from looking at the code, it seems to have something to do with pinning/unpinning pages?
Ashmem allows processes which are not related by ancestry to share memory maps by name, which are cleaned up automatically.
Plain old anonymous mmaps and System V shared memory lack some of these requirements.
System V shared memory segments stick around when no longer referenced by running programs (which is sometimes a feature, sometimes a nuisance).
Anonymous shared mmaps can be passed from a parent to child processes, which is inflexible since sometimes you want processes not related that way to share memory.
Can someone explain why ashmem was created?
David Turner (a regular on Android NDK) answered this in Why was bionic/libc/include/sys/shm.h removed?:
... System V IPCs have been removed for cupcake. See
bionic/libc/docs/SYSV-IPC.TXT for details.
In brief, System V IPCs are leaky by design and do not play well in
Android's runtime environment where killing processes to make room for
other ones is just normal and very common. The end result is that any
code that relies on these IPCs could end up filling up the kernel's
internal table of SysV IPC keys, something that can only safely be
resolved by a reboot.
We want to provide alternative mechanism in the future that don't have
the same problems. One thing we provide at the moment is ashmem, which
was designed specifically for Android to avoid that kind of problem
(though it's not as well documented as it should). We probably need
something similar for semaphores and/or message queues.
I'm trying to follow Android best practices, so in debug mode I turn all the following on:
StrictMode.setThreadPolicy(new StrictMode.ThreadPolicy.Builder().detectAll().penaltyLog().build()); //detect and log all thread violations
StrictMode.setVmPolicy(new StrictMode.VmPolicy.Builder().detectAll().penaltyLog().build()); //detect and log all virtual machine violations
Android now yells at me when I try to use any sort of file access or SQL in the main (UI) thread. But I see so many recommendations to use file access and/or SQL in the main thread. For example, the main activity should load default preference values inside onCreate() in case they haven't been set yet:
PreferenceManager.setDefaultValues(context, resId, readAgain);
Oops---that results in a file access on the first application execution, because onCreate() is called on the UI thread. The only way around it I can see is to start a separate thread---which introduces a race condition with other UI code that might read the preferences and expect the default values to already be set.
Think also of services such as the DownloadManager. (Actually, it's so buggy that it's useless in real life, but let's pretend it works for a second.) If you queue up a download, you get an event (on the main thread) telling you a download has finished. To actually get information about that download (it only gives you a download ID), you have to query the DownloadManager---which involves a cursor, giving you an error if you have a strict policy turned on.
So what's the story---is it fine to access cursors in the main thread? Or is it a bad thing, and half the Android development team and Android book authors forgot about that?
The only way around it I can see is to start a separate thread---which introduces a race condition with other UI code that might read the preferences and expect the default values to already be set.
Then use an AsyncTask, putting the setDefaultValues() call in doInBackground() and the "other UI code that might read the preferences" in onPostExecute().
To actually get information about that download (it only gives you a download ID), you have to query the DownloadManager---which involves a cursor, giving you an error if you have a strict policy turned on.
So query the DownloadManager in a background thread.
So what's the story---is it fine to access cursors in the main thread?
That depends on your definition of "fine".
On Android 1.x and most 2.x devices, the filesystem used is YAFFS2, which basically serializes all disk access across all processes. The net effect is that while your code may appear sufficiently performant in isolation, it appears sluggish at times in production because of other things going on in the background (e.g., downloading new email).
While this is a bit less of an issue in Android 3.x and above (they switched to ext4), there's no question that flash I/O is still relatively slow -- it will just be a bit more predictably slow.
StrictMode is designed to point out where sluggishness may occur. It is up to you to determine which are benign and which are not. In an ideal world, you'd clean up them all; in an ideal world, I'd have hair.
Or is it a bad thing, and half the Android development team and Android book authors forgot about that?
It's always been a "bad thing".
I cannot speak for "half the Android development team". I presume that, early on, they expected developers to apply their existing development expertise to detect sluggish behavior -- this is not significantly different than performance issues in any other platform. Over time, they have been offering more patterns to steer developers in a positive path (e.g., the Loader framework), in addition to system-level changes (e.g., YAFFS2->ext4) to make this less of a problem. In part, they are trying to address places where Android introduces distinct performance-related challenges, such as the single-threaded UI.
Similarly, I cannot speak for all Android book authors. I certainly didn't focus on performance issues in early editions of my books, as I was focusing on Android features and functions. Over time, I have added more advice in these areas. I have also contributed open source code related to these topics. In 2012, I'll be making massive revisions to my books, and creating more open source projects, to continue addressing these issues. I suspect, given your tone, that I (and probably others) are complete failures in your eyes in this regard, and you are certainly welcome to your opinion.
A coworker and I were talking (after a fashion) about an article I read (HTC permission security risk). Basically, the argument came down to whether or not it was possible to log every action that an application was doing. Then someone (an abstract theroetical person) would go through and see if the app was doing what it was supposed to do and not trying to be all malicious like.
I have been programming in Android for a year now, and as far as I know if -- if -- that was possible, you would have to hack Dalvik and output what each process was doing. Even if you were to do that, I think it would be completely indecipherable because of the sheer amount of stuff each process was doing.
Can I get some input one way or the other? Is it completely impractical to even attempt to log what a foriegn application is doing?
I have been programming in Android for a year now, and as far as I know if -- if -- that was possible, you would have to hack Dalvik and output what each process was doing.
Not so much "hack Dalvik" but "hack the android.* class library, and perhaps a few other things (e.g., java.net).
Even if you were to do that, I think it would be completely indecipherable because of the sheer amount of stuff each process was doing.
You might be able to do some fancy pattern matching or something on the output -- given that you have determined patterns of inappropriate actions. Of course, there is also the small matter of having to manually test the app (to generate the output).
Is it completely impractical to even attempt to log what a foriegn application is doing?
From an SDK app? I damn well hope so.
From a device running a modded firmware with the aforementioned changes? I'd say it is impractical unless you have a fairly decent-sized development team, at which point it is merely expensive.
This is both possible and practical if you are compiling your own ROM. Android is based on Linux and I know several projects like this for Linux, like Linux Trace Toolkit. I also know of research into visualizing the results and detecting malicious apps from the results as well.
Another thing functionality like this is often used for is performance and reliability monitoring. You can read about the DTRACE functionality in Solaris to learn more about how this sort of stuff is used in business rather than academia.