ARM multi-core penalty for Java programs - android

I wonder if there is a penalty for running Dalvik+JIT on a multi-core ARM chip vs a single core chip?
E.g., if I disable multi-core support in my Android system build and execute the entire phone with a single CPU core, will I get higher performance when running a single-threaded Java benchmark?
How much is the cost of memory barrier and synchronization on multi-core?
I am asking because I vaugely remember seeing single-threaded benchmark scores from single core phones vs dual core phones. As long as the Mhz is about the same, there is no big difference between the two phones. I had expected a slow down in the dual-core phone ....

The simple answer is "why don't you try it and find out?"
The complex answer is this:
There are costs to doing multicore synchronization but there are also benefits to have multiple cores. You can undoubtedly devise a pathological case where a program suffers from the additional overhead of synchronization primitives such that it is deeply affected by their performance. This is usually due to locking at too deep of a level (inside your fast loop). But in the general case, the fact that the dozen other system programs are able to get CPU time on other cores, as well as the kernel servicing interrupts and IO on them instead of interrupting your process, are likely to greatly overwhelm the penalty incurred by MP synchronization.
In answer to your question, a DSB can take dozens or hundreds of cycles and a DMB is likely more costly. Depending on the implementation exclusive load-store instructions can be very fast or very slow. WFE can consume several microseconds, though it shouldn't be needed if you are not experiencing contention.

Background: http://developer.android.com/training/articles/smp.html
Dalvik built for SMP does have additional overhead. The Java Memory Model requires that certain guarantees be enforced, which means issuing additional memory barriers, particularly when dealing with volatile fields and immutable objects.
Whether or not the added overhead will be noticeable depends on what exactly you're doing and what device you're on, but generally speaking it's unlikely you'll notice it unless you're running a targeted benchmark.
If you build for UP and run Dalvik on a device with multiple cores, you may see flaky behavior -- see the "SMP failure example" appendix in the doc referenced above.

Related

Renderscript: Restrictions on writes to global variables for gpu compute

I am using Nexus 10, Android 4.4. I see that if I have writes to global variables in the script, then the script is executed on CPU, instead of GPU. I can see this from logcat mali driver prints.
I read somewhere that this limitation will go away in future. I was hoping 4.4 will remove this. Does anyone know more about why this limitation exists and when it might go away?
This limitation appears to be restrictive. For instance, I am using an intermittent allocation as a global variable between kernels in a scriptgroup, and my script guarantees that the kernels write at different locations in the allocation. Due to this restriction, my script now falls back to CPU, which causes significant performance delays in atleast a few cases. For instance, this performance loss is significant if one uses cosine, pow functions in the kernel. CPU(s) do a far worse job than GPU on these functions

typical android cpu split between app and kernel

What is the typical split between kernel CPU time and UserMode CPU time on an Android while executing a typical cpu-bound application ?
A typical dual core ARM android phone, while executing a common app, and not waiting for i.o from user or network
Even more helpful, if there is any data on the cpu time split between the usermode portion of system libraries, and time spent inside app actual code
(I realize this is a very subjective question, complicated by the jvm/jit and other functions but any pointers (ha!;) would be helpful.)
Well, it really depends on the application. In an application that is I/O bound, the time will be spent in syscalls like read and write. In an application that is compute bound, the CPU time will be almost all userland. In an application that's RAM bound (doing a lot of manipulation of data in RAM) that CPU will spend most of it's time waiting for RAM because of cache misses (I don't think ARM processors have very large caches).
On the other hand, if your app does a lot of UI stuff, while all of the graphics processing is done in userland, there is still a lot of I/O bound operations waiting for the frame-buffer and input devices.

GPU clock speed in Android

I am trying to find the GPU clock speed in Android.
So far no luck. Is that possible at all? I cannot find any instruction in order to get the hardware clock speed.
Android does not provide APIs for low level interaction with the GPU. Depending on the meaning of "Android" it is not entirely clear that there has to even be a GPU - the emulator would be a common example of something that does not, and basic ports to various development boards could be another.
It is possible, though sadly unlikely, that a given device vendor might choose to publicize some low-level programming information. Unfortunately, details of how to work with the GPU tend to be things that they hold quite closely and refuse to disclose - they argue it would give an advantage to their competitors - perhaps, but what it clearly does is prevent open source implementations of accelerated graphics drivers.
Even beyond the availability of information, there is the issue of access permission. The graphics hardware in Android is owned by system components such as surfaceflinger, and on secured devices not really made available for direct interaction by 3rd party application code.
Ultimately though, even if you could find a number it would not mean much. Clock speed of the internal engine does not tell you the number of clock cycles needed to complete an operation, the number of parallel operations which can be in process, what delays are encountered in moving data to/from memory and what caches are available, the efficiency of algorithms, etc. You might be better off benchmarking some performance test.

apps that uses dual core processor

There are dual core and now quad core phones in market. However i really don't know what kind of apps does truly makes use of the feature. Can anyone provide some information on the apps that can really make use the power of dual -quad cores in mobile devices.
The idea of having dual,quad or more processing is not for specific apps to use it.
It just means having more processing speed available at hand, which will only be used when completely necessary.
For example, when there is a process that can be handled by one core, which is usually the case for most apps, the other cores aren't necessary. But there are high end games or more than one process that have to be run, which need lots of calculations at a given time, other cores may also be used, if there is room for improvement in the first core.

How would you improve Dalvik? Android's Virtual Machine

I am currently writing a paper on the Android platform. After some research, it's clear that Dalvik has room for improvement. I was wondering, what do you think would be the best use of a developer's time with this goal?
JIT compilation seems like the big one, but then i've also heard this would be of limited use on such a low resource machine. Does anyone have a resource or data that backs this up?
Are there any other options that should be considered? Aside from developing a robust native development kit to bypass the VM.
For those who are interested, there is a lecture that has been recorded and put online regarding the Dalvik VM.
Any thoughts welcome, as this question appears subjective i'll clarify that the answer I'll accept must have some justification for proposed changes. Any data to back it up, such as the improvement in the Sun JVM when it was introduced, would be a massive plus.
Better garbage collection: compacting at minimum (to eliminate memory fragmentation problems experienced today), ideally less CPU intensive at doing the collection itself (to reduce the "my game frame rates suck" complaints)
JIT, as you cite
Enough documentation that, when coupled with an NDK, somebody sufficiently crazy could compile Dalvik bytecode to native code for an AOT compilation option
Make it separable from Android itself, such that other projects might experiment with it and community contributions might arrive in greater quantity and at a faster clip
I'm sure I could come up other ideas if you need them.
JIT. The stuff about it not helping is a load of crap. You might be more selective about what code you JIT but having 1/10th the performance of native code is always going to be limiting
Decent GC. Modern generational garbage collectors do not have big stutters.
Better code analysis. There are lot of cases where allocations/frees don't need to be made, locks held, and so on. It allows you to write clean code rather than doing optimizations that the machine is better at
In theory most of the higher level languages (Java, Javascript, python,...) should be within 20% of native code performance for most cases. But it requires the platform vendor to spend 100s+ developer man years. Sun Java is getting good. They have also been working on it for 10 years.
One of the main problems with Dalvik is performance, which is terrible I heard, but one of the things I would like most is the addition of more languages.
The JVM has had community projects getting Python and Ruby running on the platform, and even special languages such as Scala, Groovy and Closure developed for it. It would be nice to see these (and/or others) on the Dalvik platform as well. Sun has been working on the Da Vinci machine as well, a dynamic typing extension of the JVM, which indicates a major shift away from the "one language fits all" philosophy Sun has followed over the last 15 years.

Categories

Resources