Android NDK pthread multicore

Android NDK pthread multicore - android

I have a big array, iterating and doing my work over it takes about 50ms.
App i am developing will run on tegra3 or other fast cpu.
I have divided my work for four threads, using pthread, i have taken
width of my array, divided it by total core count found in system, and i am iterating for 1/fourth of array in each thread, everything is ok, but it now need 80ms to do the work.
Any idea why multithread approach is slower than single thread? If i lower cpu count to 1 everything is back on 50ms.
for(int y = 0; y<height;y++)
{
for(int x = 0; x<width; x++)
{
int index = (y*width)+x;
int sourceIndex = source->getIndex(vertex_points[index].position[0]/ww, vertex_points[index].position[1]/hh);
vertex_points[index].position[0]+=source->x[sourceIndex]*ww;
vertex_points[index].position[1]+=source->y[sourceIndex]*hh;
}
};
i am dividing first for loop of above code into four parts based on cpu count.
vertex_points is a vector with positions.
so it looks like
for(int y=start;y<end;y++)
and start/end vary on each thread

Thread startup time is typically on the order of milliseconds - that's what's eating your time.
With that in mind, 50 ms is not the kind of delay I'd worry about. If we were talking 5 seconds, that'd be a good candidate for paralellizing.
If the loop needs to be performed often, consider a solution with threads that are spun up early on and kept dormant, waiting for work to do. That'll run faster.
Also, is the CPU really 4-core? Honest cores or hyperthreading?

Related

Measuring Test case Execution Time

I'm trying to measure and increase the execution time of a test case. What I'm doing is the following:
Let's assume I'm testing the method abc() using testAbc().
I'm using android studio and junit for my software development.
At the very beginning I'm recording the timestamp in nano seconds in
start variable, and then when the method finishes, it returns the
difference between the current nano seconds and the start.
testAbc() is divided into 3 parts: initialization, testing abc() and
assertion (checking the test results).
I keep track the test time inside testAbc() same way as I do in
abc() method.
after executing the test I found that abc() method takes about
45-50% of the test time.
I modified the testAbc() as follows:
void testAbc(){
startTime = System.nanoTime();
//no modification to the initialization part
//testing abc part is modified by placing it in a for loop to increase its
//execution time.
for(int i=0 ; i < 100; i++)
{
//test abc code goes here ...
abcTime += abc();
}
//assertion part wasn't modified
testEndTime = System.nanoTime - startTime;
}
By repeating the test part I thought the ratio between abcTime and testEndTime will increase (by dramatically increasing abcTime), however, it didn't at all I mean it's 45-50%.
My questions:
Why the ratio didn't increase? I mean in principle the execution time for the initialization and assertion parts should not be affected by the for loop and therefore the time for abc should get closer to the test time after 100 repetitions.
How can I increase the ratio between abc() and testAbc()?
thank you for your time.

Renderscript c style pointer usage performance issue

In render script, I am using bound pointers to iterate over a large image.
The problem is in the array access performance.
...
for(int i=0; i < channels; i++) {
sum += (input[i*input_size]) * mulValue;
}
...
For example, when the input_size is 12288 it takes 1.5 seconds to complete script, but when the input_size is 12280 it takes ~0.5 seconds.
What can cause such a mystery behavior?

Understanding the performance implications of what you write in RenderScript (or openCL) is complex.
Just writing it in RendersScript does not guarantee performance.
Many times you encounter cache coherence issues when your memory access hop around.
Quite often it is better to structure the code as a series of kernels that process in a cache friendly manner.
Sorry if this is vague. You questing does not have enough details.

which task / program will consume lot of memory space of android device?

I want to use memory or stress memory, at its peak capability consuming most of the RAM space. I came up with tasks like parallelly searching text in many files, calling functions recursively to fill stack space.

Searching will not fill your memory on its own. To fill your heap you need to hold many or large objects in memory.
For example:
List<String> mList = new ArrayList<String>();
for (int i = 0; i < 999999; i++) {
mList.add("Garbage");
}

Understanding Interpolation

I have been reading up on game loops and am having a hard time understanding the concept of interpolation. From what I seen so far, a high level game loop design should look something like the sample below.
ASSUME WE WANT OUR LOOP TO TAKE 50 TICKS
while(true){
beginTime = System.currentTimeMillis();
update();
render();
cycleTime = System.currentTimeMillis() - beginTime;
//if processing is quicker than we need, let the thread take a nap
if(cycleTime < 50)
Thread.sleep(cycleTime);
)
//if processing time is taking too long, update until we are caught up
if(cycleTime > 50){
update();
//handle max update loops here...
}
}
Lets assume that update() and render() both take only 1 tick to complete, leaving us with 49 ticks to sleep. While this is great for our target tick rate, it still results in a 'twitchy' animation due to so much sleep time. To adjust for this, instead of sleeping, I would assume that some kind of rendering should be going on within the first if condition. Most code samples I have found simply pass an interpolated value into the render method like this...
while(true){
beginTime = System.currentTimeMillis();
update();
render(interpolationValue);
cycleTime = System.currentTimeMillis() - beginTime;
//if processing is quicker than we need, let the thread take a nap
if(cycleTime < 50)
Thread.sleep(cycleTime);
)
//if processing time is taking too long, update until we are caught up
if(cycleTime > 50){
update();
//handle max update loops here...
}
interpolationValue = calculateSomeRenderValue();
}
I just don't see how this can work due to the 49 tick sleep time? If anyone knows of an article or sample I can check out please let me know as I am not really sure what the best approach to interpolation is...

I know its a bit late, but hopefully this article will help
http://gameprogrammingpatterns.com/game-loop.html
It explains game time scheduling very well. I think the main reason you are a bit confused is because of passing the render function the current elapsed time. Oh course this depending on which system you are using but conventionally the render doesn't modify the Scene in any way, it only draws it, therefore it doesn't need to know how much time has passed.
However the update call modifies the objects in the scene, and in order to keep them in time (e.g. playing animations, Lerps ect...) then the update function needs to know how much time has passed either globally, or since the last update.
Anyway no point me going to fair into it.... that article is very useful.
Hope this helps

Pooling with least amount of GC on Scala

In a game for Android written in Scala, I have plenty of objects that I want to pool. First I tried to have both active (visible) and non active instances in the same pool; this was slow due to filtering that both causes GC and is slow.
So I moved to using two data structures, so when I need to get a free instance, I just take the first from the passive pool and add it to the active pool. I also fast random access to the active pool (when I need to hide an instance). I'm using two ArrayBuffers for this.
So my question is: which data structure would be best for this situation? And how should that (or those) specific data structure(s) be used to add and remove to avoid GC as much as possible and be efficient on Android (memory and cpu constraints)?

The best data structure is an internal list, where you add
var next: MyClass
to every class. The non-active instances then become what's typically called a "free list", while the active ones become a singly-linked list a la List.
This way your overhead is exactly one pointer per object (you can't really get any less than that), and there is no allocation or GC at all. (Unless you want to implement your own by throwing away part or all of the free list if it gets too long.)
You do lose some collections niceness, but you can just make your class be an iterator:
def hasNext = (next != null)
is all you need given that var. (Well, and extends Iterator[MyClass].) If your pool sizes are really quite small, sequential scanning will be fast enough.
If your active pool is too large for sequential scanning down a linked list and elements are not often added or deleted, then you should store them in an ArrayBuffer (which knows how to remove elements when needed). Once you remove an item, throw it on the free list.
If your active pool turns over rapidly (i.e. the number of adds/deletes is similar to the number of random accesses), then you need some sort of hierarchical structure. Scala provides an immutable one that works pretty well in Vector, but no mutable one (as of 2.9); Java also doesn't have something that's really suitable. If you wanted to build your own, a red-black or AVL tree with nodes that keep track of the number of left children is probably the way to go. (It's then a trivial matter to access by index.)

I guess I'll mention my idea. The filter and map methods iterate over the entire collection anyway, so you may as well simplify that and just do a naive scan over your collection (to look for active instances). See here: https://github.com/scala/scala/blob/v2.9.2/src/library/scala/collection/TraversableLike.scala
def filter(p: A => Boolean): Repr = {
val b = newBuilder
for (x <- this)
if (p(x)) b += x
b.result
}
I ran some tests, using a naive scan of n=31 (so I wouldn't have to keep more than a 32 bit Int bitmap), a filter/foreach scan, and a filter/map scan, and a bitmap scan, and randomly assigning 33% of the set to active. I had a running counter to double check that I wasn't cheating by not looking at the right values or something. By the way, this is not running on Android.
Depending on the number of active values, my loop took more time.
Results:
naive scanned a million times in: 197 ms (sanity check: 9000000)
filter/foreach scanned a million times in: 441 ms (sanity check: 9000000)
map scanned a million times in: 816 ms (sanity check: 9000000)
bitmap scanned a million times in: 351 ms (sanity check: 9000000)
Code here--feel free to rip it apart or tell me if there's a better way--I'm fairly new to scala so my feelings won't be hurt: https://github.com/wfreeman/ScalaScanPerformance/blob/master/src/main/scala/scanperformance/ScanPerformance.scala

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.

Android NDK pthread multicore - android

Related

Measuring Test case Execution Time

Renderscript c style pointer usage performance issue

which task / program will consume lot of memory space of android device?

Understanding Interpolation

Pooling with least amount of GC on Scala

Categories

Resources