Android: benchmarking two algorithms

Android: benchmarking two algorithms - android

I have implemented two algorithm for the same problem and want to find out, which is the best in a professional way.
The basis idea was:
final static int LOOP_COUNT = 500;
long totaTime = 0;
worm-up();
for(int i =0, i<LOOP_COUNT, i++)
{
long startTime = System.currentTimeMillis();
myMethod();
long endTime= System.currentTimeMillis();
totalTime += endTime - startTime;
}
return totalTime / LOOP_COUNT;
And do that for both Algorithm.
But:
how can I achieve, that the android System does not do any system calculations in the background and skew up the data
ist there a way i can also compare the used memory, both methods need?

If you want professional statistical and relevant results and you want to minimize the influence of Android background processes, you will need to run your algorithm a number of times and compare the averages. In that way, due the the law of large numbers, your results will be correct.
How much times depends on the standard deviation of the execution time and how certain you want to be. If you're familiar with some basic statistic knowledge, you can determine your sample size with some basic formulas and you can for example run a t-test if your sample distribution is normally distributed to compare the averages of both algorithms. This automatically incorporates the fact that you want to minimize the influence of background processes. They will appear randomly so after a number of iterations, the influence of Android will be cancelled out.
Also take a look at the garbage collector, if you have a lot of object creation during the execution of your algorithm, it will affect your results but it should, as it will also affect the real world usage of the algorithm.

You could try to analyze your code and find the time complexity. If you have a nested loop:
for(int i = 0; i< max; i++){
for(int j = 0; j< max; j++){
c = i + j;
}
}
This would have the time complexity O(n^2). The space complexity is O(1)
Another example is this:
for(int i = 0; i< max; i++){
list[i] = "hello";
}
for(int j = 0; j< max; j++){
list2[j] = "hello";
}
This would have the time complexity of O(2n) which is the same as O(n), and space complexity of O(2n) which is O(n).
The latter have a better runtime but uses more memory.

The recommended approach to measure specific inner loop performance is the Jetpack Microbenchmark library. You can find code samples on GitHub.

Related

how to get difference between two values in single arraylist?

I have arraylist like {23,45,44,78}. I have tried to get value at particular position and get difference of two but is there any best way to get diference of 23-45, 45-44 and so on

The pre-Java 8 way of doing this would be to just use a for loop:
for (int i=0; i < list.size()-1; ++i) {
System.out.println(list.get(i) - list.get(i+1));
}
Demo
There might be a slightly more compact way of doing this using streams.

Android: why is native code so much faster than Java code

In the following SO question: https://stackoverflow.com/questions/2067955/fast-bitmap-blur-for-android-sdk #zeh claims a port of a java blur algorithm to C runs 40 times faster.
Given that the bulk of the code includes only calculations, and all allocations are only done "one time" before the actual algorithm number crunching - can anyone explain why this code runs 40 times faster? Shouldn't the Dalvik JIT translate the bytecode and dramatically reduce the gap to native compiled code speed?
Note: I have not confirmed the x40 performance gain myself for this algorithm, but all serious image manipulation algorithm I encounter for Android, are using the NDK - so this supports the notion that NDK code will run much faster.

For algorithms that operate over arrays of data, there are two things that significantly change performance between a language like Java, and C:
Array bound checking: Java will check every access, bmap[i], and confirm i is within the array bounds. If the code tries to access out of bounds, you will get a useful exception. C & C++ do not check anything and just trust your code. The best case response to an out of bounds access is a page fault. A more likely result is "unexpected behavior".
Pointers: You can significantly reduce the operations by using pointers.
Take this innocent example of a common filter (similar to blur, but 1D):
for(int i = 0; i < ndata - ncoef; ++i) {
z[i] = 0;
for(int k = 0; k < ncoef; ++k) {
z[i] += c[k] * d[i + k];
}
}
When you access an array element, coef[k] is:
Load address of array coef into register;
Load value k into a register;
Sum them;
Go get memory at that address.
Every one of those array accesses can be improved because you know that the indexes are sequential. Neither the compiler, nor the JIT can know that the indexes are sequential so they cannot optimize fully (although they keep trying).
In C++, you would write code more like this:
int d[10000];
int z[10000];
int coef[10];
int* zptr;
int* dptr;
int* cptr;
dptr = &(d[0]); // Just being overly explicit here, more likely you would dptr = d;
zptr = &(z[0]); // or zptr = z;
for(int i = 0; i < (ndata - ncoef); ++i) {
*zptr = 0;
*cptr = coef;
*dptr = d + i;
for(int k = 0; k < ncoef; ++k) {
*zptr += *cptr * *dptr;
cptr++;
dptr++;
}
zptr++;
}
When you first do something like this (and succeed in getting it correct) you will be surprised how much faster it can be. All the array address calculations of fetching the index and summing the index and base address are replaced with an increment instruction.
For 2D array operations such as blur on an image, an innocent code data[r,c] involves two value fetches, a multiply and a sum. So with 2D arrays the benefits of pointers allows you to remove multiply operations.
So the language allows real reduction in the operations the CPU must perform. The cost is that the C++ code is horrendous to read and debug. Errors in pointers and buffer overflows are food for hackers. But when it comes to raw number grinding algorithms, the speed improvement is too tempting to ignore.

Another factor not mentioned above is the garbage collector. The problem is that garbage collection takes time, plus it can run at any time. This means that a Java program which creates lots of temporary objects (note that some types of String operations can be bad for this) will often trigger the garbage collector, which in turn will slow down the program (app).

Following is an list of Programming Language based on the levels,
Assembly Language ( Machine Language, Lover Level )
C Language ( Middle Level )
C++, Java, .net, ( Higher Level )
Here Lower level language has direct access to the Hardware. As long as the level gets increased the access to the hardware gets decrease. So Assembly Language's code runs at the highest speed while other language's code runs based on their levels.
This is the reason that C Language's code run much faster than the Java's code.

Android (Dalvik) member variable access performance

I just did a benchmark to compare access performance of local variables, member variables, member variables of other objects and getter setters. The benchmark increases the variable in a loop with 10 mio iterations. Here is the output:
BENCHMARK: local 101, member 1697, foreign member 151, getter setter 268
This was done on a Motorola XOOM tablet and Android 3.2. The numbers are milliseconds of execution time. Can anybody explain the deviation for the member variable to me? Especially when compared to the other object's member variable. Based on those figures it seems to be worthwhile to copy member variables to local variables before using their values in calculations. Btw, I did the same benchmark on an HTC One X and Android 4.1 and it showed the same deviation.
Are those numbers reasonable or is there a systematic error that I miss?
Here is the benchmark function:
private int mID;
public void testMemberAccess() {
// compare access times for local variables, members, members of other classes
// and getter/setter functions
final int numIterations = 10000000;
final Item item = new Item();
int i = 0;
long start = SystemClock.elapsedRealtime();
for (int k = 0; k < numIterations; k++) {
mID++;
}
long member = SystemClock.elapsedRealtime() - start;
start = SystemClock.elapsedRealtime();
for (int k = 0; k < numIterations; k++) {
item.mID++;
}
long foreignMember = SystemClock.elapsedRealtime() - start;
start = SystemClock.elapsedRealtime();
for (int k = 0; k < numIterations; k++) {
item.setID(item.getID() + 1);
}
long getterSetter = SystemClock.elapsedRealtime() - start;
start = SystemClock.elapsedRealtime();
for (int k = 0; k < numIterations; k++) {
i++;
}
long local = SystemClock.elapsedRealtime() - start;
// just make sure nothing loops aren't optimized away?
final int dummy = item.mID + i + mID;
Log.d(Game.ENGINE_NAME, String.format("BENCHMARK: local %d, member %d, foreign member %d, getter setter %d, dummy %d",
local, member, foreignMember, getterSetter, dummy));
}
Edit:
I put each loop in a function and called them 100 times randomly. Result:
BENCHMARK: local 100, member 168, foreign member 190, getter setter 271
Looks good, thx.
The foreign object was created as final class member, not inside the functions.

Well, I'd say that the Dalvik VM's optimizer is pretty smart ;-) I do know that the Dalvik VM is register-based. I don't know the guts of the Dalvik VM, but I would assume that the following is going on (more or less):
In the local case, you are incrementing a method local variable inside a loop. The optimizer recognizes that this variable isn't accessed until the loop is completed, so can use a register and applies the increments there until the loop is complete and then stores the value back into the local variable. This yields: 1 fetch, 10000000 register increments and 1 store.
In the member case, you are incrementing a member variable inside a loop. The optimizer cannot determine whether or not the member variable is accessed while the loop is running (by another method, object or thread), so it is forced to fetch, increment and store the value back into the member variable on each loop iteration. This yields: 10000000 fetches, 10000000 increments and 10000000 store operations.
In the foreign member case, you are incrementing a member variable of an object inside a loop. You have created that object within the method. The optimizer recognizes that this object cannot be accessed (by another object, method or thread) until the loop is completed, so can use a register and apply the increments there until the loop is complete and then store the value back into the foreign member variable. This yields: 1 fetch, 10000000 register increments and 1 store.
In the getter/setter case, I am going to assume that the compiler and/or optimizer is smart enough to "inline" getter/setters (ie: it doesn't really make a method call - it replaces item.setID(item.getID() + 1) with item.mID = item.mID + 1). The optimizer recognizes that you are incrementing a member variable of an object inside a loop. You have created that object within the method. The optimizer recognizes that this object cannot be accessed (by another object, method or thread) until the loop is completed, so it can use a register and apply the increments there until the loop is complete and then store the value back into the foreign member variable. This yields: 1 fetch, 10000000 register increments and 1 store.
I can't really explain why the getter/setter timing is twice the foreign member timing, but this may be due to the time it takes the optimizer to figure it out, or something else.
An interesting test would be to move the creation of the foreign object out of the method and see if that changes anything. Try moving this line:
final Item item = new Item();
outside of the method (ie: declare it as a private member variable of some object instead). I would guess that the performance would be much worse.
Disclaimer: I'm not a Dalvik engineer.

Beside varying their order, there are other things that you can do in order to try to eliminate any interference:
1- Eliminate the border effect by calculating the first item a second time; preferably by using another long variable.
2- Increase the number of iterations by 10. 1000000 seems a big number but as you can see from the first suggestion; increasing a variable 1 million times is so fast on a modern CPU that a lot of other things like filling the various caches take their importance.
3- Add spurious instructions like inserting dummy long l = SystemClock.elapsedRealtime()-start calculations. This will help showing that this 1000000 iterations is really a small number.
4- Add the volatile keyword to the mID field. This is probably the best way to factor out any compiler or CPU related optimization.

ANDROID How to reduce String allocations

I've managed to get my allocations down to next to nothing using DDMS (great tool), this has drastically reduced my GCs to about 1 or 2 every 3 minutes. Still, I'm not happy because those usually cause a noticeable delay in the game (on some phones) when you interact with it.
Using DDMS, I know what the allocations are, they are Strings being converted from integers used to display game information to the HUD.
I'm basically doing this:
int playerScore = 20929;
String playerScoreText = Integer.toString(playerScore);
canvas.drawText(playerScoreText, xPos, yPos);
This happens once each frame update and the HUD system is modular so I plug things in when I need and this can cause 4 or 5 hud elements to allocate Strings and AbstractStringBuilders in DDMS.
Any way to reduce these further or eliminate all the String allocations and just reuse a String object?
Thanks,
Albert Pucciani

Reading your question reminded me of one of Robert Greens articles that I read quite some time ago. It discusses your problem almost identically. http://www.rbgrn.net/content/290-light-racer-20-days-32-33-getting-great-game-performance . Skip down to day 33 and start reading.

Remember the last int score and its string representation. On a new frame check if the score is the same. If the same, then no need to create a new string - just use the old one.

Here's what I've done in the past. This will eliminate string allocations.
I create a char[] of a size that will be at least as large as the maximum number of characters you will need to display on the screen. This means that you should select a maximum high score that is achievable in the game. The way you have it now let's you display a score as high as 2^31-1 which is insanely huge, it's not practical with respect to the game. Keep in mind, this is your game, so it's ok to limit the max score to something more reasonable in the context of the game. Pick a number that will virtually be impossible to achieve. Setting this limit will then set you up to be able to not have to muck around with converting large integers to String objects.
Here's what's required:
First, you need to be able to separate the digits in an integer and convert them to char without creating String objects. Let's say you want to convert the integer of 324 into three separate characters '3','2','4' to be placed in the text char[]. One way you can do this is by taking the value 324 and do a mod 10 to get the lowest digit. So 324%10 = 4. Then divide the value by ten and do another mod 10 to get the next digit. So (324/10)%10 = 2, and (324/100)%10 = 3.
int score = 324;
int firstPlaceInt = score%10; // firstPlace will equal 4
int tensPlaceInt = (score/10)%10; // tensPlace will equal 2
int hundresPlaceInt = (score/100)%10; // hundredsPlace will equal 3
You will have to do the above in a loop, but this expresses the idea of what you're trying to do here.
Next, with these digits you can then convert them to chars by referencing a character map. One way to do this is you can create this character map by making a char[] of size 10 and placing values 0 - 9 in indexes 0 - 9.
char[] charMap = {'0','1','2','3','4','5','6','7','8','9',};
So doing this:
int score = 324;
char firstPlace = charMap[score%10];
char tenslace = charMap[(score/10)%10];
char hundredsPlace = charMap[(score/100)%10];
Will create the chars you need for the 3 digits in score.
Now, after all that, I would limit the highest score to say 99,999 (or whatever makes sense in your game). This means the largest "string" I would need to display is "Score: xx,xxx". This would require a char[] (call it text for this example) of size 13. Initialize the first 7 characters with "Score: ", these will never need to change.
char[] text = new char[13];
text[0] = 'S';
text[1] = 'c';
text[2] = 'o';
text[3] = 'r';
text[4] = 'e';
text[5] = ':';
text[6] = ' ';
The next 6 will vary based on the score. Note, that you may not necessarily fill in all 6 of those remaining characters, therefore you need to create an int (call it scoreCount for this example) which will tell you how many characters in the text char[] are actually relevant to the current score in the game. Let's say I need to display "Score: 324", this only takes 10 chars out of the 13. Write the 3 chars for the score of 324 into char[7] to char[9], and set scoreCount to 10 to indicate the number of valid characters in the char[].
int scoreCount = 7;
text[9] = charMap[score%10]; // This is firstPlace
text[8] = charMap[(score/10)%10]; // This is tensPlace
text[7] = charMap[(score/100)%10]; // This is hundredsPlace
scoreCount = 10;
You will probably have to do the above in a loop, but this should express the general idea of what you're trying to do here.
After that, you can just use drawText (char[] text, int index, int count, float x, float y, Paint paint). index will be 0, and count will be scoreCount which indicates how many characters in text should be drawn. In the example above, it doens't matter what's in text[10] to text[12], it's considered invalid. You can continue to update text[] using the character map, and this should not create any objects.
I hope this helps. The code above isn't very robust, but I wrote it out as more of an expression of the ideas I'm trying to convey. You will have to create your own loops and manage the data properly within your code, but this sums up the mechanics of what needs to happen to avoid the use of Strings/StringBuilder/StringBuffer/etc.

loops efficiency

I came across through a presentation(dalvik-vm-internals) on Dalvik VM, in that it is mentioned as for the below loops, we have use (2) and (3) and to avoid (7).
(1) for (int i = initializer; i >= 0; i--)
(2) int limit = calculate limit;
for (int i = 0; i < limit; i++)
(3) Type[] array = get array;
for (Type obj : array)
(4) for (int i = 0; i < array.length; i++)
(5) for (int i = 0; i < this.var; i++)
(6) for (int i = 0; i < obj.size(); i++)
(7) Iterable list = get list;
for (Type obj : list)
Comments: i feel that (1) and (2) are the same.
(3)
(4) every time it has to calculate the length of array, so this can be avoided
(5)
(6) same as (4), calculating the size everytime
(7) asked to avoid as list is of an Iterable type??
one more, in case if we have infinite data(assume data is coming as a stream) which loop should we consider for better efficiency?)
request you to please comment on this...

If that's what they recommend, that's what they've optimized the compiler and VM for. The ones you feel are the same aren't necessarily implemented the same way: the compiler can use all sorts of tricks with data and path analysis to avoid naively expensive operations. For instance, the array.length() result can be cached since arrays are immutable.
They're ranked from most to least efficient: but (1) is 'unnatural'. I agree, wouldn't you? The trouble with (7) is that an iterator object is created and has to be GC'ed.
Note carefully when the advice should be heeded. It's clearly intended for bounded iteration over a known collection, not the stream case. It's only relevant if the loop has significant effect on performance and energy consumption ('operating on the computer scale'). The first law of optimization is "Don't optimize". The second law (for experts) is "Don't optimize, yet.". Measure first (both execution times and CPU consumption), optimize later: this applies even to mobile devices.
What you should consider is the preceding slides: try to sleep as often and as long as possible, while responding quickly to changes. How you do that depends on what kind of stream you're dealing with.
Finally, note that the presentation is two years old, and may not fully apply to 2.2 devices where among other things JIT is implemented.

With infinite data, none of the examples are good enough. Best would be to do
for(;;) {
list.poll(); //handle concurrency, in java for example, use a blocking queue
}

1) and 2) are really different. 2) need an extra subtraction to compute i=0 doesn't.
Even better, on most processor (and well optimized code) no is comparison needed for i>=0. The processor can use the the negative flag, resulting for the last decrement (i--).
So the end of the loop -1 looks like (in pseudo assembler)
--i
jump-if-neg
while loop #2
++i
limit-i # set negative flag if i >limit
jump-if-neg
That doesn't make a big difference, except if the code in your loop is really small (like basic C string operation)
That might not work with interpreted languages.

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.