I have a Lucene Index with 50571 documents in it from 1740 books. I have two processes that create this index. The first process is to create the index on the device document by document. This process is very slow. The other process is to create a book index on the server, (The exact same way I create it on the device) and download and merge it with the master index. This one is much quicker to create the master index. Creating the index works fine either way.
The problem is when I search on the download-merge index I get an OutOfMemoryException, but when I search with the index that was created on the device I don't get that error. I went through and created the index book by book (download-merge) and searched after each book was indexed; based on that and when I get to book ~450 I start getting the OutOfMemoryException.
What is causing me to run out of memory.
Lucene is a memory hog. When writing "merging" indices together it stores the entire set of indices in memory twice. As quoted from the lucene documentation.
Note that this requires temporary free space in the Directory up to 2X
the sum of all input indexes (including the starting index).
If readers/searchers are open against the starting index, then temporary free
space required will be higher by the size of the starting index
That is a lot of memory. To mitigate this we have to shrink the size of the index by calling forceMerge(int) on the index writer. It is a slow process but it does shrink the size of the index. I am call this with an argument of 1 every time there is 50 or more files in the index directory.
Related
In my project I have one requirement to show the number of pages in Word documents (.doc, .docx) files and number of sheets in Excel documents (.xls, .xlsx). I have tried to read the .docx file using Docx4j but the performance is very poor but I need just the word count and tried using Apache POI. I am getting an error, something like:
"trouble writing output: Too many methods: 94086; max is 65536. By package:"
I want to know whether there is any paid/open source library available for android.
There is just no way to show exact number of pages in MS Word file, because it will be different for different users. The exact number depends on printer settings, paper settings, fonts, available images, etc.
Still, you can do the following for binary files:
open file use POIFSFileSystem or NPOIFSFileSystem
extract only FileInformationBlock as it is done in the constructor HWPFDocumentCore
create DocumentProperties using information from FileInformationBlock as it is done in constuctor of HWPFDocument
get value of property cPg of DOP: DocumentProperties::getCPg()
The description of this field is: "A signed integer value that specifies the last calculated or estimated count of pages in the main document, depending on the values of fExactCWords and fIncludeSubdocsInStats."
For DOCX/XLSX documents you will need to access the same (I assume) property but using SAX or StAX methods.
I have a xml format data which is came from server. Now i want to store it into database and it should load on button click. How should i do this?
enter code here
<qst_code> 7 </qst_code>
<qst_prg_code> 1 </qst_prg_code>
<qst_mod_code> 2 </qst_mod_code>
<qst_Question>What is not true about left dominant cardiology circulation? </qst_Question>
<qst_opt1>It is seen in 20% of the population</qst_opt1>
<qst_opt2>Left circumflex artery supplies the Posterior descending artery</qst_opt2>
<qst_opt3>Left circumflex artery terminates as obtuse marginal branch</qst_opt3>
<qst_opt4>Left circumflex artery may originate from right coronary sinus</qst_opt4>
<qst_opt01>1</qst_opt01>
<qst_opt02>1</qst_opt02>
<qst_opt03>1</qst_opt03>
<qst_opt04>1</qst_opt04>
<qst_CorctOpt>1</qst_CorctOpt>
<qst_Marks>10</qst_Marks>
<qst_company_code>1</qst_company_code>
<user_code>1</user_code>
One option is to store it as a string if the data is not too large, else break it into a schema that maps to sqlite and recreate it while loading.
If your XML data is large, I would rather change the data exchange type to json. XML parsing and then insert is a very expensive operation and is time-consuming.
Some issues which you will face with XML parsing and insert.
a. XML parsing is memory intensive and so you heap size will grow, you need to keep an eye on this as this might cause crash.
b. Inserts in SQLite DB will take around ~100ms per tuple (row), so you can calculate the time it will required to pump in thousands of rows of data.
If you data is not too large don't bother about using SQLite.
I have a list of 1000 words. I need to load an array with n randomly chosen words from that list (no repeats allowed). What is the best way of going about doing that?
My ideas:
1) Load the words into R.arrays to create a String array. Use collections.shuffle to shuffle the array, then pull the first n entries from it. Right now, I am having memory issues loading the initial array with the 1000 words using this method.
2) Load the words into a text file, read each word into a String array. Use same method to get first n entries.
3) Hard code the input of the words into a String array (I'd use a script to get that output of course). Use same method to get first n entries.
Is there a better way?
If you're mainly worried about memory usage and you're willing to give up computation speed, here's an algorithm that will get you there.
Keep your words in a text file, one word per line, with a fixed amount of characters per word, padding each word with spaces at the end to ensure a fixed word char size, call it s.
Create an array of max size n, call it w
Open a stream reader to the file containing the 1000 words
Get a random number between 1 and 1000, call it k
Seek to position k*s in the file stream and grab the next s characters
Add the word to w if it does not exist in the array yet
If the w array is full (ie. size=n), we're done, otherwise go back to step 3
Let us know how it goes. Happy coding!
I recently created a program that gets medi-large amounts of xml data and converts it into arrays of Strings, then displays the data.
The program works great, but it freezes when it is making the arrays (for around 16 seconds depending on the size).
Is there any way I can optimize my program (Alternatives to string arrays etc.)
3 optimizations that should help:
Threading
If the program freezes it most likely means that you're not using a separate thread to process the large XML file. This means that your app has to wait until this task finishes to respond again.
Instead, create a new thread to process the XML and notify the main thread via a Handler when it's done, or use AsyncTask. This is explained in more detail here.
Data storage
Additionally, a local SQLite database might be more appropriate to store large amounts of data, specially if you don't have to show it all at once. This can be achieved with cursors that are provided by the platform.
Configuration changes
Finally, make sure that your data doesn't have to be reconstructed when a configuration change occurs (such as an orientation change). A persistent SQLite database can help with that, and also these methods.
You can use SAX to process the stream of XML, rather than trying to parse the whole file and generating a DOM in memory.
If you find that you really are using too much memory, and you have a reason to keep the string in memory rather than caching them on disk, there are certainly ways you can reduce the memory requirements. It's a sad fact that Java strings use a lot of space. They require two objects (the string itself and an underlying char array) and use two bytes per char. If your data is mostly 7-bit ASCII, you may be better of leaving it as a UTF-8 encoded byte stream, using 1 byte per character in the typical case.
A very effective scheme is to maintain an array of 32k byte buffers, and append the UTF-8 representation of each new string onto the first empty space in one of those arrays. Your reference to the string becomes a simple integer: PTR = (buffer index * 32k) + (buffer offset). "PTR/32k" yields the index of the desired byte buffer, and "PTR % 32k" yields the location within the buffer. Use either an initial length byte or a null terminator to keep track of how long the string is. When you need to access one of the strings, don't allocate a new String object: unpack it into a mutable StringBuilder or work directly with the UTF-8 byte representation.
The above approach is obviously a lot more work, but can save you between a factor of 2 and 6 in memory usage (depending on the length of your strings). However, you should beware of premature optimization. If your problem is with the processing time to parse your input, or is somewhere else in your program, you could find that you've done a lot of work to fix something that isn't your bottleneck and thus get no improvement at all.
I am wondering how would I be able to run a SQLite order by in this manner
select * from contacts order by jarowinkler(contacts.name,'john smith');
I know Android has a bottleneck with user defined functions, do I have an alternative?
Step #1: Do the query minus the ORDER BY portion
Step #2: Create a CursorWrapper that wraps your Cursor, calculates the Jaro-Winkler distance for each position, sorts the positions, then uses the sorted positions when overriding all methods that require a position (e.g., moveToPosition(), moveToNext()).
Pre calculate string lengths and add them into separate column. Then sort entired table by that that length. Add indexes (if you can). Then add extra filters for example you don't want to compare "Srivastava Brahmaputra" to "John Smith". The length are out of wack by way too much so exclude these kind of comparison by length as a percentage of the total length. So if your word is 10 characters compare it only to words with 10+-2 or 10+-3 characters.
This way you will significantly reduce the number of times this algorithm needs to run.
Typically in the vocalbulary of 100 000 entries such filters reduce the number of comparisons to about 300. Unless you are doing a full blown record linkage and then I would wonder why use Android for that. You would still need to apply probabilistic methods for that and calculate scores and this is not a job for Android (at least not for now).
Also in MS SQL Server Jaro Winkler string distance wrapped into CLR function perform much better, since SQL Server doesn't supprt arays natively and much of the processing is around arrays. So implementation in T-SQL add too much overhead, but SQL-CLR works extremely fast.