I am using Nexus 10, Android 4.4. I see that if I have writes to global variables in the script, then the script is executed on CPU, instead of GPU. I can see this from logcat mali driver prints.
I read somewhere that this limitation will go away in future. I was hoping 4.4 will remove this. Does anyone know more about why this limitation exists and when it might go away?
This limitation appears to be restrictive. For instance, I am using an intermittent allocation as a global variable between kernels in a scriptgroup, and my script guarantees that the kernels write at different locations in the allocation. Due to this restriction, my script now falls back to CPU, which causes significant performance delays in atleast a few cases. For instance, this performance loss is significant if one uses cosine, pow functions in the kernel. CPU(s) do a far worse job than GPU on these functions
Related
I have heard that ARM processors can switch between little-endian and big-endian. What do processors need this for? Is it used on Android phones?
Depending on the processor, it can be possible to switch endianness on the fly. Older processors will boot up in one endian state, and be expected to stay there. In the latter case, the whole design will generally be set up for either big or little endian.
The primary reason for supporting mixed-endian operation is to support networking stacks where the underlying datasets being manipulated are native big-endian. This is significant for switches/routers and mobile base-stations where the processor is running a well-defined software stack, rather than operating as a general purpose applications device.
Be aware that there are several different implementations of big-endian behaviour across the different ARM Architectures, and you need to check exactly how this works on any specific core.
You can switch endianness, but you wouldn't do that after the OS is up and running. It would only screw things up. If you were going to do it, you'd do it very early on in the boot sequence. By the time your app is running, the endianness is chosen and won't be changed.
Why would you do it? The only real reason would be if you were writing embedded software that had to deal with a lot of little-endian data, or to run a program that was written assuming little endian and not fixed to be endian agnostic. This kind of data tends to come from an x86 app that wrote things out in its native byte order (x86 is little endian). There's not a lot of other reasons to do it. You'll see ARM pretty much exclusively run in big endian mode.
I recently upgraded my old Galaxy S2 phone to a brand new Galaxy S7, and was very surprised to find an old game I wrote seemed to be performing worse on the new phone. After cutting everything down to a bare bones project, I have discovered the problem - the GLES20.glFinish() call I was performing at the end of every onDrawFrame. With this in there, with a glClear but no draw calls, the FPS hovered around 40. Without the glFinish, solid 60 FPS. My old S2 had solid 60 FPS regardless.
I then went back to my game, and removed the glFinish method call, and sure enough performance went back to being perfect and there was no obvious downside to its removal.
Why was glFinish slowing down my frame rate on my new phone but not my old phone?
I think a speculative answer is as good as it's going to get, so — apologies for almost certainly repeating a lot of what you already know:
Commands sent to OpenGL go through three states, named relative to the GPU side of things:
unsubmitted
submitted but pending
completed
Communicating with the code running the GPU is usually expensive. So most OpenGL implementations accept your calls and just queue the work up inside your memory space for a while. At some point it'll decide that a communication is justified and will pay the cost to transfer all the calls at once, promoting them to the submitted state. Then the GPU will complete each one (potentially out-of-order, subject to not breaking the API).
glFinish:
... does not return until the effects of all previously called GL
commands are complete. Such effects include all changes to GL state,
all changes to connection state, and all changes to the frame buffer
contents.
So for some period when that CPU thread might have been doing something else, it now definitely won't. But if you don't glFinish then your output will probably still appear, it's just unclear when. glFlush is often the correct way forwards — it'll advance everything to submitted but not wait for completed, so everything will definitely appear shortly, you just don't bother waiting for it.
OpenGL bindings to the OS vary a lot; in general though you almost certainly want to flush rather than finish if your environment permits you to do so. If it's valid to neither flush nor finish and the OS isn't pushing things along for you based on any criteria then it's possible you're incurring some extra latency (e.g. the commands you issue one frame may not reach the GPU until you fill up the unsubmitted queue again during the next frame) but if you're doing GL work indefinitely then output will almost certainly still proceed.
Android sits upon EGL. Per the spec, 3.9.3:
... eglSwapBuffers and eglCopyBuffers perform an implicit flush operation
on the context ...
I therefore believe that you are not required to perform either a flush or a finish in Android if you're double buffering. A call to swap the buffers will cause a buffer swap as soon as drawing is complete without blocking the CPU.
As to the real question, the S7 has an Adreno 530 GPU. The S2 has a Mali T760MP6 GPU. The Malis are produced by ARM, the Adrenos by Qualcomm, so they're completely different architectures and driver implementations. So the difference that causes the blocking could be almost anything. But it's permitted to be. glFinish isn't required and is a very blunt instrument; it's probably not one of the major optimisation targets.
I'm wondering if anybody has developed a Renderscript Program that runs on GPU. I've tried some simple implementations, like doing IntrinsicBlur via RS but it turned out that it runs on CPU rather than GPU.
Intrinsics will always run on the processor that will do them the fastest. If it is running on the CPU, that means that the GPU is not suitable for running it quickly. Reasons for this might be that the GPU is usually used for drawing the screen (which takes a lot of effort too), and so there isn't additional compute bandwidth there.
On android.com they say, that if you're working in Java, the maximum memory you can use is 16 MB. At least that's the one the devices are supposed to support. If you have an older phone, you'll notice that you can't get more, you get an OutOfMemoryError instead. Not if you're doing the same thing using the NDK. In on of my applications I am trying to get 50MB and more, and so far Android was fine with that.
I havn't found anything related to that on android.com.
Is there any limit like in Java, too?
If yes: what's the limit?
If no: What is a good value for that?
Problem is, that I have to build my code depending on that size.
[Edit:]
I tried what Seva Alekseyev were suggesting.
root#android:/ # ulimit -a
ulimit -a
time(cpu-seconds) unlimited
file(blocks) unlimited
coredump(blocks) 0
data(KiB) unlimited
stack(KiB) 8192
lockedmem(KiB) 64
nofiles(descriptors) 1024
processes 7806
flocks unlimited
sigpending 7806
msgqueue(bytes) 819200
maxnice 40
maxrtprio 0
resident-set(KiB) unlimited
address-space(KiB) unlimited
root#android:/ # ulimit -v
ulimit -v
unlimited
root#android:/ #
The memory I am requesting (by using "alloc" or "new") is virtual memory (ulimit -v). So there's no chance to figure out how much I can gain?!
You're subject to three types of memory limits:
1) Artificial limits put in place to keep the system responsive when multitasking -- the VM heap limitation is the main example of this. ulimit is a potential mechanism for a the OS to provide further limitations on you, but I have not seen it being used restrictively on Android devices.
2) Physical limits based on available real memory. You should have a baseline device you're developing/testing on, and should be pretty aggressive in assume other processes (background services, other apps) need memory too. Also remember that memory in use by the OS varies with OS version (and will tend to increase over time). Stock Android doesn't swap, so if you go too far you're dead. One potential scenario is a Nexus One (512MB RAM) with an audio player and the phone app going in the background, and a "balloon" service eating another 100MB physical memory to give some leeway; in this configuration you'll still find more than 100MB available.
3) Virtual memory limits based on address space. Stock android allows overcommitment of memory, so it won't blink if you ask for a 1GB virtual allocation (via mmap, etc) on a device with 512MB of RAM, and this is often a very useful thing to do. However, when you then touch the memory, it needs to be brought into physical memory. If there are read-only pages in physical memory they can be ejected, but soon enough you're going to run out, and without swap -- dead. (The combination and overcommit and no swap leads directly to process death in out-of-memory situations, rather than recoverable errors like malloc returning null).
Finally, it's worth noting that whether calloc/malloc/new require physical allocation is allocator-dependent, but it's safer to assume yes, especially for allocations less than a large number of pages. So: If you're dealing with < 100 MB of standard, well behaved allocations, you're probably in the clear -- but test! If you're dealing with large amounts of data that you'd like memory mapped, mmap is your friend, when used carefully, and is your best friend when used with PROT_READ only. And if you're dealing with > 100 MB of physical memory allocations, expect to run quite nicely on modern devices, but you'll have to define a baseline carefully and test, test, test, since detecting out-of-memory situations on the fly is not generally possible.
One more note: APP_CMD_LOW_MEMORY exists, and is a great place to purge caches, but there's no guarantee it's called in time to save your life. It doesn't change the overall picture at all.
I wonder if there is a penalty for running Dalvik+JIT on a multi-core ARM chip vs a single core chip?
E.g., if I disable multi-core support in my Android system build and execute the entire phone with a single CPU core, will I get higher performance when running a single-threaded Java benchmark?
How much is the cost of memory barrier and synchronization on multi-core?
I am asking because I vaugely remember seeing single-threaded benchmark scores from single core phones vs dual core phones. As long as the Mhz is about the same, there is no big difference between the two phones. I had expected a slow down in the dual-core phone ....
The simple answer is "why don't you try it and find out?"
The complex answer is this:
There are costs to doing multicore synchronization but there are also benefits to have multiple cores. You can undoubtedly devise a pathological case where a program suffers from the additional overhead of synchronization primitives such that it is deeply affected by their performance. This is usually due to locking at too deep of a level (inside your fast loop). But in the general case, the fact that the dozen other system programs are able to get CPU time on other cores, as well as the kernel servicing interrupts and IO on them instead of interrupting your process, are likely to greatly overwhelm the penalty incurred by MP synchronization.
In answer to your question, a DSB can take dozens or hundreds of cycles and a DMB is likely more costly. Depending on the implementation exclusive load-store instructions can be very fast or very slow. WFE can consume several microseconds, though it shouldn't be needed if you are not experiencing contention.
Background: http://developer.android.com/training/articles/smp.html
Dalvik built for SMP does have additional overhead. The Java Memory Model requires that certain guarantees be enforced, which means issuing additional memory barriers, particularly when dealing with volatile fields and immutable objects.
Whether or not the added overhead will be noticeable depends on what exactly you're doing and what device you're on, but generally speaking it's unlikely you'll notice it unless you're running a targeted benchmark.
If you build for UP and run Dalvik on a device with multiple cores, you may see flaky behavior -- see the "SMP failure example" appendix in the doc referenced above.