Programming object for processing large lists of data

Programming object for processing large lists of data - android

I recently had the task of performing a cross-selection operation on some collections, to find an output collection that was matching my criteria. (I will omit the custom logic because it is not needed).
What I did was creating a class that was taking as a parameter Lists of elements, and I was then calling a function inside that class that was responsible for processing those lists of data and returning a value.
Point is, I'm convinced I'm not doing the right thing, because writing a class holding hundreds of elements, taking names lists as parameters, and returning another collection looks unconventional and awkward.
Is there a specific programming object or paradigm that allows you to process large numbers of large collections, maybe with a quite heavy custom selection/mapping logic?
I'm building for Android using Kotlin

First of all, when we talk about the performance, there is only one right answer - write benchmark and test.
About memory: list with 1,000,000 of unique Strings with average size 30 chars will take about 120 Mb (e.g. 10^6 * 30 * 4, where last is "size of char", let's think that this is Unicode character with 4 bytes). And please add 1-3% for collateral expenses, such as link references. Therefore: if you have hundreds of Strings then just load whole data into memory and use list, because this is the fastest solution (synchronous, immutable, etc.).
If you can do streaming-like operations, you can use sequences. They are pretty lazy, the same with Java Streams and .Net Linq. Please check example below, it requires small amount of memory.
fun countOfEqualLinesOnTheSamePositions(path1: String, path2: String): Flow<String> {
return File(path1).useLines { lines1 ->
File(path2).useLines { lines2 ->
lines1.zip(lines2)
.map { (line1, line2) ->
line1 == line2
}
.count()
}
}
}
If you couldn't store whole data in memory and you couldn't work with stream-like schema, you may:
Rework algorithm to single-pass to multiple-pass, there each is stream-like. For example, Huffman Coding is two-pass algorithm, so it can be used to compress 1Tb of data by using small amount of memory.
Store intermediate data on the disk (this is much complex for this short answer).
For additional optimizations:
To cover case of merging a lot of parallel streams, please consider also Kotlin Flow. It allows you to work asynchronously, to avoid IO blocks. For example, this can be useful to merge ~100 network streams.
To keep a lot of non-unique items in memory, please consider caching logic. It can save memory (however please benchmark first).
Try operate with ByteBuffers, instead of Strings. You can get much less allocation (because you can deallocate object explicitly), however code will be too complex.

Related

What is the best way to use threading on a sorting algorithm, that when completed, creates a new activity and gives its data to the new activity?

I will start this by saying that on iOS this algorithm takes, on average, <2 seconds to complete and given a simpler, more specific input that is the same between how I test it on iOS vs. Android it takes 0.09 seconds and 2.5 seconds respectively, and the Android version simply quits on me, no idea if that would be significantly longer. (The test data gives the sorting algorithm a relatively simple task)
More specifically, I have a HashMap (Using an NSMutableDictionary on iOS) that maps a unique key(Its a string of only integers called its course. For example: "12345") used to get specific sections under a course title. The hash map knows what course a specific section falls under because each section has a value "Course". Once they are retrieved these section objects are compared, to see if they can fit into a schedule together based on user input and their "timeBegin", "timeEnd", and "days" values.
For Example: If I asked for schedules with only the Course ABC1234(There are 50 different time slots or "sections" under that course title) and DEF5678(50 sections) it will iterate through the Hashmap to find every section that falls under those two courses. Then it will sort them into schedules of two classes each(one ABC1234 and one DEF5678) If no two courses have a conflict then a total of 2500(50*50) schedules are possible.
These "schedules" (Stored in ArrayLists since the number of user inputs varies from 1-8 and possible number of results varies from 1-100,000. The group of all schedules is a double ArrayList that looks like this ArrayList>. On iOS I use NSMutableArray) are then fed into the intent that is the next Activity. This Activity (Fragment techincally?) will be a pager that allows the user to scroll through the different combinations.
I copied the method of search and sort exactly as it is in iOS(This may not be the right thing to do since the languages and data structures may be fundamentally different) and it works correctly with small output but when it gets too large it can't handle it.
So is multithreading the answer? Should I use something other than a HashMap? Something other than ArrayLists? I only assume multithreading because the errors indicate that too much is being done on the main thread. I've also read that there is a limit to the size of data passed using Intents but I have no idea.
If I was unclear on anything feel free to ask for clarification. Also, I've been doing Android for ~2 weeks so I may completely off track but hopefully not, this is a fully functional and complete app in the iTunes Store already so I don't think I'm that far off. Thanks!

1) I think you should go with AsynTask of Android .The way it handle the View into `UI
threadandBackground threadfor operations (Like Sorting` ) is sufficient enough to help
you to get the Data Processed into Background thread And on Processing you can get the
Content on UI Thread.
Follow This ShorHand Example for This:
Example to Use Asyntask
2) Example(How to Proceed):
a) define your view into onPreExecute()
b) Do your Background Operation into doInBackground()
c) Get the Result into onPostExceute() and call the content for New Activty
Hope this could help...

I think it's better for you to use TreeMap instead of HashMap, which sorts data automatically everytime you mutate it. Therefore you won't have to sort your data before start another activity, you just pass it and that's all.
Also for using it you have to implement Comparable interface in your class which represents value of Map.
You can also read about TreeMap class there:
http://docs.oracle.com/javase/7/docs/api/java/util/TreeMap.html

How to optimize views in couchdb?

I'm writing android using couchdb. I have around 1000 documents. Every DB operation invokes a view,my view is taking a lot of time. Is there a way to optimize views in couch db? If there are less documents then fetching documents is working fast.

The main things to note with views are that both map and reduce values are cached in the view index (see http://horicky.blogspot.co.uk/2008/10/couchdb-implementation.html for details), that views are only rebuilt when you look at them, and that the CouchDB JavaScript engine is not particularly fast.
There's a few options to use all this for actual performance improvements:
Accept stale data in your views, and periodically rebuild the view index asynchronously. You can query views with ?stale=ok to immediately return the currently cached view index, from the last time the view was built, and then have some other background task querying with stale != ok to actually do the rebuild. The typical strategies for this are either to rebuild the view every X minutes or watch /db/_changes rebuild the view after every Y changes. Depends on your application.
Accept stale data and then always immediately rebuild the view asynchronously afterwards. This uses ?stale=update_after, which I believe will immediately return you a value and then do the view rebuild in the background. Whether to do this or the above depends on your use case and how important up to date values are to you; this might end up with your rebuilding the view far more than is really necessary, and thereby actually slowing down your queries. This does seem easier than the previous option though.
Push as much of your code into your map function as possible. This should improve performance in quickly changing databases, because map values are cached and don't need updating until the underlying document changes, whereas reduces need recalculating whenever one of a larger set of documents changes. I'm not sure exactly how reduce recalculation is tuned in CouchDB, i.e. how big the set that needs recalculating is, but it's definitely going to happen more the map recalculations, and potentially much much more.
Use built-in reduce functions (see http://wiki.apache.org/couchdb/Built-In_Reduce_Functions) instead of rewriting them in JavaScript. These fulfil many standard reduce cases, and are much much faster than writing the equivalent function yourself.
Rewrite your map/reduce in Erlang. See http://wiki.apache.org/couchdb/EnableErlangViews. This does require you to learn Erlang, but should just take away big percentage of your view rebuilding time.

The map function in a view is executed only once per document (plus as many times as you update the document). This happens at the first time you query the view. After that the result of the map function does not have to be computed anymore and therefore the query to the view should be extremely fast. As views are already efficient there is no general way to optimize them further.
This is not the case for temporary views. If you are using these, please store them in a design document to turn them into regular views.

Emit the smallest amount of data as possible in your document in the map function. You can access the entire document using the include_docs=true url parameter if you actually need the entire document
Good
{
map: function(doc) {
emit(doc._id, null)
}
}
Bad
{
map: function(doc) {
emit(doc._id, doc)
}
}

Excessive garbage collection in arithmetic evaluator

I'm attempting to create an Android app which graphs simple mathematical functions that the user inputs (essentially a graphing calculator).
Every onDraw call requires hundreds of arithmetic evaluations per second (which are plotted on screen to produce the graph). When my code evaluates the expression the program slows down considerably, when the inbuilt methods evaluate the expression, the app runs with no issue.
According to 'LogCat', garbage collection occurs about 12 times per second, each time pausing the app for roughly 15 milliseconds, resulting in a few hundred milliseconds worth of freezes every second. I think this is the problem.
Here is a distilled version of my evaluator function. The expression to be evaluated is named "postfixEquation", the String ArrayList "list" holds the final answer at the end of the process. There are also two String arrays titled "digits" and "operators" which store the numbers and signs which are able to be used:
String evaluate(String[] postfixEquation) {
list.clear();
for (int i = 0; i < postfixEquation.length; i++) {
symbol = postfixEquation[i];
// If the first character of our symbol is a digit, our symbol is a numeral
if (Arrays.asList(digits).contains(Character.toString(symbol.charAt(0)))) {
list.add(symbol);
} else if (Arrays.asList(operators).contains(symbol)) {
// There must be at least 2 numerals to operate on
if (list.size() < 2) {
return "Error, Incorrect operator usage.";
}
// Operates on the top two numerals of the list, then removes them
// Adds the answer of the operation to the list
firstItem = Double.parseDouble(list.get(list.size() - 1));
secondItem = Double.parseDouble(list.get(list.size() - 2));
list.remove(list.size() - 1);
list.remove(list.size() - 1);
if (symbol.equals(operators[0])){
list.add( Double.toString(secondItem - firstItem) );
} else if (symbol.equals(operators[1])) {
list.add( Double.toString(secondItem + firstItem) );
} else if (symbol.equals(operators[2])) {
list.add( Double.toString(secondItem * firstItem) );
} else if (symbol.equals(operators[3])) {
if (firstItem != 0) {
list.add( Double.toString(secondItem / firstItem) );
} else {
return "Error, Dividing by 0 is undefined.";
}
} else {
return "Error, Unknown symbol '" + symbol + "'.";
}
}
}
// The list should contain a single item, the final answer
if (list.size() != 1) {
return "Error, " + list has " + list.size() + " items left instead of 1.";
}
// All is fine, return the final answer
return list.get(0);
}
The numerals used in the operations are all Strings, as I was unsure if it was possible to hold multiple types within one array (i.e. Strings and Doubles), hence the rampant "Double.parseDouble" and "Double.toString" calls.
How would I go about reducing the amount of garbage collection that occurs here?
If it's of any help, I have been using these steps to evaluate my postfix expression: http://scriptasylum.com/tutorials/infix_postfix/algorithms/postfix-evaluation/index.htm.
I have been unable to get past this issue for weeks and weeks. Any help would be appreciated. Thanks.

The rule for tight loops in Java is don't allocate anything. The fact that you're seeing such frequent GC collections is proof of this.
You appear to be doing calculations with Double, then converting to a String. Don't do that, it's terrible for performance because you create tons and tons of strings then throw them out (plus you are converting back and forth between strings and doubles a lot). Just maintain an ArrayDeque<Double> and use it as a stack -- this also saves you from doing the array resizes that are probably also killing performance.
Precompile the input equations. Convert all the input operations to enum instances -- they are faster to compare (just takes a switch statement), and may even use less memory. If you need to handle doubles, either use a generic Object container and instanceof, or a container class that contains both an operation enum and a double. Precompiling saves you from having to do expensive tests in your tight loop.
If you do these things, your loop should positively fly.

Probably your list manipulation is the source of this problem. Lists internally have arrays, which are expanded/shrunk depending on how much data is on the list. So doing lots of add and removes randomly will heavily require garbage collection.
A solution to avoid this is using the right List implementation for your problem, allocate enough space to the list at the beginning to avoid resizing the internal array and to mark unused elements instead of removing them
The freezing symptoms are because you're doing your calculations in your UIThread. If you don't want your app to freeze, you might want to check AsyncTask to do calculations on a separate thread.
PS: also looks like you're doing some useless operations in there... why parseDouble() secondItem?

The 15ms pauses are not occurring in your UI thread, so they should not be affecting performance to much. If your UI is pausing while your method is executing, consider running it on another thread (with AsyncTask)
To reduce your garbage collection you need to reduce the amount of memory allocated within the loop.
I would suggest looking at:
Performing the Arrays.asList functions outside the loop (ideally somewhere that is only executed once such as your constructor or a static constructor)
If your list is a LinkedList, consider changing it to an ArrayList
If your List is an ArrayList, make sure you initialise it with enough capacity so it won't need to be resized
Consider making your List store Objects rather than Strings, then you can store both your symbols and Doubles in it, and don't need to convert back and forward from Double to String as much
Consider writing a proper parser (but this is a'lot more work)

However, you are using a lot of strings. While this may not be the case, it's always one of those things you can check out because Java does funky stuff with String. If you are having to convert the string to double as you are outputting, then there's quite a bit of overhead going on.
Do you need to store the data as String? (Note that the answer may actually be yes) Heavy use of temporary strings can actually cause the garbage collector to get fired off often.
Be careful about premature optimization. Profilers and running through the function line-by-line can help

Pooling with least amount of GC on Scala

In a game for Android written in Scala, I have plenty of objects that I want to pool. First I tried to have both active (visible) and non active instances in the same pool; this was slow due to filtering that both causes GC and is slow.
So I moved to using two data structures, so when I need to get a free instance, I just take the first from the passive pool and add it to the active pool. I also fast random access to the active pool (when I need to hide an instance). I'm using two ArrayBuffers for this.
So my question is: which data structure would be best for this situation? And how should that (or those) specific data structure(s) be used to add and remove to avoid GC as much as possible and be efficient on Android (memory and cpu constraints)?

The best data structure is an internal list, where you add
var next: MyClass
to every class. The non-active instances then become what's typically called a "free list", while the active ones become a singly-linked list a la List.
This way your overhead is exactly one pointer per object (you can't really get any less than that), and there is no allocation or GC at all. (Unless you want to implement your own by throwing away part or all of the free list if it gets too long.)
You do lose some collections niceness, but you can just make your class be an iterator:
def hasNext = (next != null)
is all you need given that var. (Well, and extends Iterator[MyClass].) If your pool sizes are really quite small, sequential scanning will be fast enough.
If your active pool is too large for sequential scanning down a linked list and elements are not often added or deleted, then you should store them in an ArrayBuffer (which knows how to remove elements when needed). Once you remove an item, throw it on the free list.
If your active pool turns over rapidly (i.e. the number of adds/deletes is similar to the number of random accesses), then you need some sort of hierarchical structure. Scala provides an immutable one that works pretty well in Vector, but no mutable one (as of 2.9); Java also doesn't have something that's really suitable. If you wanted to build your own, a red-black or AVL tree with nodes that keep track of the number of left children is probably the way to go. (It's then a trivial matter to access by index.)

I guess I'll mention my idea. The filter and map methods iterate over the entire collection anyway, so you may as well simplify that and just do a naive scan over your collection (to look for active instances). See here: https://github.com/scala/scala/blob/v2.9.2/src/library/scala/collection/TraversableLike.scala
def filter(p: A => Boolean): Repr = {
val b = newBuilder
for (x <- this)
if (p(x)) b += x
b.result
}
I ran some tests, using a naive scan of n=31 (so I wouldn't have to keep more than a 32 bit Int bitmap), a filter/foreach scan, and a filter/map scan, and a bitmap scan, and randomly assigning 33% of the set to active. I had a running counter to double check that I wasn't cheating by not looking at the right values or something. By the way, this is not running on Android.
Depending on the number of active values, my loop took more time.
Results:
naive scanned a million times in: 197 ms (sanity check: 9000000)
filter/foreach scanned a million times in: 441 ms (sanity check: 9000000)
map scanned a million times in: 816 ms (sanity check: 9000000)
bitmap scanned a million times in: 351 ms (sanity check: 9000000)
Code here--feel free to rip it apart or tell me if there's a better way--I'm fairly new to scala so my feelings won't be hurt: https://github.com/wfreeman/ScalaScanPerformance/blob/master/src/main/scala/scanperformance/ScanPerformance.scala

Moving instances from Set to another in Scala

In a game I need to keeps tabs of which of my pooled sprites are in use. When "active" multiple sprites at once I want to transfer them from my passivePool to activePool both of which are immutable HashSets (ok, i'll be creating new sets each time to be exact). So my basic idea is to along the lines of:
activePool ++= passivePool.take(5)
passivePool = passivePool.drop(5)
but reading the scala documentation I'm guessing that the 5 that I take might be different that the 5 I then drop. Which is definitely not what I want. I could also say something like:
val moved = passivePool.take(5)
activePool ++= moved
passivePool --= moved
but as this is something I need to do pretty much every frame in realtime on a limited device (Android phone) I guess this would be much slower as I will have to search one by one each of the moved sprites from the passivePool.
Any clever solutions? Or am I missing something basic? Remember the efficiency is a primary concern here. And I can't use Lists instead of Sets because I also need random-access removal of sprites from activePools when the sprites are destroyed in the game.

There's nothing like benchmarking for getting answers to these questions. Let's take 100 sets of size 1000 and drop them 5 at a time until they're empty, and see how long it takes.
passivePool.take(5); passivePool.drop(5) // 2.5 s
passivePool.splitAt(5) // 2.4 s
val a = passivePool.take(5); passivePool --= a // 0.042 s
repeat(5){ val a = passivePool.head; passivePool -= a } // 0.020 s
What is going on?
The reason things work this way is that immutable.HashSet is built as a hash trie with optimized (effectively O(1)) add and remove operations, but many of the other methods are not re-implemented; instead, they are inherited from collections that don't support add/remove and therefore can't get the efficient methods for free. They therefore mostly rebuild the entire hash set from scratch. Unless your hash set has only a handful of elements in it, this is bad idea. (In contrast to the 50-100x slowdown with sets of size 1000, a set of size 100 has "only" a 6-10x slowdown....)
So, bottom line: until the library is improved, do it the "inefficient" way. You'll be vastly faster.

I think there may be some mileage in using splitAt here, which will give you back both the five sprites to move and the trimmed pool in a single method invocation:
val (moved, newPassivePool) = passivePool.splitAt(5)
activePool ++= moved
passivePool = newPassivePool
Bonus points if you can assign directly back to passivePool on the first line, though I don't think it's possible in a short example where you're defining the new variable moved as well.

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.