EDIT :
I've followed your good advices and I've used a trie data structure to contain my dictionnary. The structure I have chosen is this one for interested peoples.
But for now I've another issue : the construction of my trie data structure each time I launch my application is very too long ! Maybe my dictionnary is too huge, or maybe the implementation of trie I've chosen is too not appropriate for a simple dictionnary.
So is there a way to conserve this structure even after closing the app like a registered database or if you think the issue is caused by the implementation can you recommend me another one ?
I've got a serious issue with my android's project.
The goal here is to calculate all the words that can be made with a serie of 6 letters
To do that, I've two table in my BDD :
'words' with two columns : '_id'and 'mots'
and 'temp' a temporary table
with the same columns.
'words' contains all the words of vocabulary (it's huge) and 'temp' contains all the possible combinations of letters that can be made with the 6 letters (3 letters used at least).
I'm tryng to select in the table 'temp' the word which are real so the one which are in the table 'words'. Here is my code to do that :
I do a first selection of the words which contain the good letters (at least 3 letters are used)
db.execSQL("CREATE TABLE temp2 (_id integer primary key autoincrement, mots text not null);");
db.execSQL("INSERT INTO temp2 (_id, mots) SELECT * FROM words WHERE mots like '%"+lettres.tab_char.get(0)+"%' OR mots like '%"+lettres.tab_char.get(1)+"%' "
+ "OR mots like '%"+lettres.tab_char.get(2)+"%' OR mots like '%"+lettres.tab_char.get(3)+"%' OR mots like '%"+lettres.tab_char.get(4)+"%' "
+ "OR mots like '%"+lettres.tab_char.get(5)+"%';");
(lettre.tab_char is an ArrayList(Character) which contains the letters used to make the combinations in temp)
I do a join between the tables 'temp2' and 'temp' :
String MY_QUERY = "SELECT temp2._id, temp2.mots FROM temp2 INNER JOIN temp ON temp2.mots = temp.mots;";
Cursor test = db.rawQuery(MY_QUERY, null);
After that I put my values into a listview.
It works but it's really really slow : Can you help me please ?
In general the algorithm that you're using is really quite inefficient. First you're searching through every entry 6 times using a wildcard match, and then you're joining this gigantic result with your entire dataset again.
SQL is probably not the right place to do this. SQL is good at queries, this is more of a calculation. Do the matching in code.
There are lots of ways you can go about accomplishing this, but finding the right solution depends on your requirements. Can the letters repeat? How big of a vocabulary is "huge"? Does it still fit in a few MB? Does this lookup need to happen near-instantaneously?
Update:
Given your requirements, I have to agree with Joe. It's really more of a data structure than an algorithm, but a trie is the way to go. You should be able to build the trie once while loading the app and then each "match" will be a fairly simple lookup walking down the trie.
The algorithm you're looking for is actually called a "trie" (short for retrieval). They are extremely well-suited for this type of calculation (Android actually uses them in the SMS and mail apps to do things like emoticon replacements). If done properly, you will be surprised with the performance you can get from it. I agree with Paul: you definitely should not do the query like you are currently. In fact, many implementations will even load the entire dictionary file into an in-memory trie, and use that trie for word lookup and verification throughout the application's lifetime. The scrabble word list (link is also contained in the question below: twl06.zip) is only 1.9MB, and contains 178k words. The trie in memory should actually be much smaller than 1.9MB, because multiple words will share common prefixes (e.g., "stair" and "stare" will both share the S-T-A prefix, which will then branch off into two leaves ["I" and "R"], and so on...)
Here's a good place to start: Algorithm to generate anagrams
Related
For an Android word game (with minSdkLevel=9 meaning SQLite version 3.6.22) -
I would like to deliver the dictionary as a prefilled SQLite table within the APK file (with the help of SQLiteAssetHelper).
In the SQLite database there will be just 1 table:
create table dict ( /* contains 700 000 unique words */
word text not null
);
My question please:
How to declare the table for the best performance and which kind of SQL-query to use?
(When checking if a word entered by player is present in the dict table or not - that will be the main usage of the SQLite database in the app).
Should I create index (is it possible to have index for text columns at all)?
Or should I declare the word column as primary key?
Also, some SQLite for Android guides suggest to have an _id column in each table (probably to enable fetching the last inserted record? - which I don't really need here). Should I maybe use
create table dict (
_id integer primary key,
word text unique not null
);
create index word_index on dict(word);
or will that be a waste of 4 x 700 000 bytes? (Or is it added as _rowid_ anyway?)
Quick answer: yes, you can create index on text column.
However for best performance, this may not be the best option.
Because the index created by SQLite should be simply a b-tree (binary tree), which speed up the search by binary search. i.e. with 700k words, the binary search has to run about 20 intervals. But this could be fast enough, you need to test it to actually know the performance.
Some alternative methods would be to create multiple tables (buckets), e.g. create table as wordA, wordB, wordC etc.
And use the first character to determine which table the word is put.
This drops the size of each table to contains about 27k records. (of course each bucket is not of equal size)
By doing this, it reduces the interval used performing the binary search.
And actually you should use hash function to determine the bucket, which makes the size of each buckets more balanced and you can freely control the number of buckets.
And you have to actually fine tune to know what is the optimal bucket size.
I have a biggish database ~32mb which has lots of text in 4 languages. Including Arabic and Urdu. I need to search this text in the most efficient way (speed & size).
I am considering FTS, and trying to find out how to implement it. Right now I am reading http://www.sqlite.org/fts3.html#section_1_2 about it.
It seems to me, an FTS table is just like a normal table used to index all the different words. So my questions are:
1) If to populate FTS I have to do all the inserts myself, then why not make my own indexed word table, what is the difference?
Answer : Yes there are many advantages, many built in functions that help. For example with ranking etc, searching of stems and the transparent nature of how it all works in android makes the FTS approach more appealing.
2) On the google docs I read its a virtual in memory table, now this would be massive right... but it doesnt mention this on the SQLite website. So which is it?
3) Is there an easy way to generate all the different words from my columns?
4) Will the FTS handle arabic words properly?
FTS allows for fast searching of words; normal indexes only allow to search for entire values or for the beginning of the value.
If you table has only one word in each field, using FTS does not make sense.
FTS is a virtual table, but not an in-memory table.
You can get individual terms from the full-text index with the fts4aux table.
The default tokenizer works only with ASCII text.
You have to test whether the ICU or UNICODE61 tokenizers work with your data.
1) If to populate FTS I have to do all the inserts myself, then why
not make my own indexed word table, what is the difference?
Using your own indexed word table, you would have parse words in sentences. You would then need a table for sentences and another to words. And you should do this efficiently.
2) On the google docs I read its a virtual in memory table, now this
would be massive right... but it doesnt mention this on the SQLite
website. So which is it?
Don't understand your question. Data is handled via virtual table extension, however back storage is done in database (FTS4 creates 5 tables for each virtual table). Check this:
sqlite> CREATE VIRTUAL TABLE docs USING fts4();
sqlite> .schema
CREATE VIRTUAL TABLE docs USING fts4();
CREATE TABLE 'docs_content'(docid INTEGER PRIMARY KEY, 'content');
CREATE TABLE 'docs_segments'(blockid INTEGER PRIMARY KEY, block BLOB);
CREATE TABLE 'docs_segdir'(level INTEGER,idx INTEGER,start_block INTEGER,leaves_
end_block INTEGER,end_block INTEGER,root BLOB,PRIMARY KEY(level, idx));
CREATE TABLE 'docs_docsize'(docid INTEGER PRIMARY KEY, size BLOB);
CREATE TABLE 'docs_stat'(id INTEGER PRIMARY KEY, value BLOB);
sqlite>
3) Is there an easy way to generate all the different words from my
columns?
For sure. But that's not easy. That's what FTS does.
4) Will the FTS handle arabic words properly?
I'm not sure. Does arabic languages uses ICU word boundaries? From Tokenizer:
The ICU tokenizer implementation is very simple. It splits the input
text according to the ICU rules for finding word boundaries and
discards any tokens that consist entirely of white-space. This may be
suitable for some applications in some locales, but not all. If more
complex processing is required, for example to implement stemming or
discard punctuation, this can be done by creating a tokenizer
implementation that uses the ICU tokenizer as part of its
implementation.
I have a Sqlite3 database in android, with data are sentences like: "good afternoon" or "have a nice day", now I want to have a search box, to search between them, I use something like this :
Cursor cursor = sqliteDB.rawQuery("SELECT id FROM category WHERE sentences LIKE '"+ s.toString().toLowerCase()+ "%' LIMIT 10", null);
But it only show "good afternoon" as result if user start searching with first "g" or "go" or "goo" or etc, how can I retrieve "good afternoon" as results, if user search like "a" or "af" or "afternoon".
I mean I want to show "good afternoon" result, if user search from middle of a data in sqlite3 db, not only if user searches from beginning.
thanks!
Just put the percent sign in front of your query string: LIKE '%afternoon%'. However, your approach has two flaws:
It is susceptible to SQL injection attacks because you just insert unfiltered user input into your SQL query string. Use the query parameter syntax instead by re-writing your query as follows:
SELECT id FROM category WHERE sentences LIKE ? LIMIT 10. Add the user input string as selection argument to your query method call
It will be dead slow the bigger your database grows because LIKE queries are not optimized for quick string matching and lookups.
In order to solve number 2 you should use SQLite's FTS3 extension which greatly speeds up any text-related searches. Instead of LIKE you would be using the MATCH operator that uses a different query syntax:
SELECT id FROM category WHERE sentences MATCH 'afternoon' LIMIT 10
As you can see the MATCH operator does not need percent signs. It just tries to find any occurrence of a word in the whole text that is being searched (in your case the sentences column). Read through the documentation of FTS3 I've linked to. The MATCH query syntax provides some more pretty handy and powerful options for finding text in your database table which are pretty similar to early search engine query syntax such as:
MATCH 'afternoon OR evening'
The only (minor) downside to the FTS3 extension is that it blows up the database file size by creating additional search index tables and meta-data. But I think it's well worth it for this use case.
this is more of a question of theory than anything else. I am writing an android app that uses a pre-packaged database. The purpose of the app is solely to search through this database and return values. Ill provide some abstract examples to illustrate my implementation and quandary. The user can search by: "Thing Name," and what I want returned to the user is values a, b, and c. I initially designed the database to have it all contained on a single sheet, and have column 1 be key_index, column 2 be name, column 3 be a, etc etc. When the user searches, the cursor will return the key_index, and then use that to pull values a b and c.
However, in my database "Thing alpha" can have a value a = 4 or a = 6. I do not want to repeat data in the database, i.e. have multiple rows with the same thing alpha, only separate "a" values. So what is the best way to organize the data given this situation? Do I keep all the "Thing Names" in a single sheet, and all the data separately. This is really a question of proper database design, which is definitely something foreign to me. Thanks for your help!
There's a thing called database normalization http://en.wikipedia.org/wiki/Database_normalization. You usually want to avoid redundancy and dependency in the DB entities using a corresponding design with surrogate keys and foreign keys and so on. Your "thing aplpha" looks like you want to have a many-to-many table like e.g. one or many songs belong/s to the same or different genres. You may want to create dictionary tables to hold your id,name pairs and have foreign keys referencing these tables. In your case it will be mostly a read-only DB so you might want to consider creating indexes with high FILLFACTOR percentage don't think sqlite allows it to do though. There're many ways to design the database. Everything depends on the purpose of DB. You can start with a design of your hardware like raids/file systems/db block sizes to match the F-System's block sizes in order to keep the I/O optimal and where to put your tablespaces/filegroups/indexes to balance the i/o load. The whole DB design theory/task is really a deep subject which is not to be underestimated nor is a matter of few sentences in the answer of stackoverflow. :)
without understanding your data better here is my guess at what you are looking for.
table: product
- _id
- name
table: attribute
- product_id
- a
We have about 7-8 tables in our Android application each having about 8 columns on an average. Both read and write operations are performed on the database and I am experimenting and trying to find ways to enhance the performance of the DataAccess layer. So, far I have tried the following:
Use positional arguments in where clauses (Reason: so that sqlite makes use of the same execution plan)
Enclose inserts and update with transactions(Reason: every db operation is enclosed within a transaction by default. Doing this will remove that overhead)
Indexing: I have not created any explicit index other than those created by default on the primary key and unique keys columns.(Reason: indexing will improve seek time)
I have mentioned my assumptions in paranthesis; please correct me if I am wrong.
Questions:
Can I add anything else to this list? I read somewhere that avoiding the use of db-journal can improve performance of updates? Is this a myth or fact? How can this be done, if recomended?
Are nested transactions allowed in SQLite3? How do they affect performance?
The thing is I have a function which runs an update in a loop, so, i have enclosed the loop within a transaction block. Sometimes this function is called from another loop inside some other function. The calling function also encloses the loop within a transaction block. How does such a nesting of transactions affect performance?
The where clauses on my queries use more than one columns to build the predicate. These columns might not necessarily by a primary key or unique columns. Should I create indices on these columns too? Is it a good idea to create multiple indices for such a table?
Pin down exactly which queries you need to optimize. Grab a copy of a typical database and use the REPL to time queries. Use this to benchmark any gains as you optimize.
Use ANALYZE to allow SQLite's query planner to work more efficiently.
For SELECTs and UPDATEs, indexes can things up, but only if the indexes you create can actually be used by the queries that you need speeding up. Use EXPLAIN QUERY PLAN on your queries to see which index would be used or if the query requires a full table scan. For large tables, a full table scan is bad and you probably want an index. Only one index will be used on any given query. If you have multiple predicates, then the index that will be used is the one that is expected to reduce the result set the most (based on ANALYZE). You can have indexes that contain multiple columns (to assist queries with multiple predicates). If you have indexes with multiple columns, they are usable only if the predicates fit the index from left to right with no gaps (but unused columns at the end are fine). If you use an ordering predicate (<, <=, > etc) then that needs to be in the last used column of the index. Using both WHERE predicates and ORDER BY both require an index and SQLite can only use one, so that can be a point where performance suffers. The more indexes you have, the slower your INSERTs will be, so you will have to work out the best trade-off for your situation.
If you have more complex queries that can't make use of any indexes that you might create, you can de-normalize your schema, structuring your data in such a way that the queries are simpler and can be answered using indexes.
If you are doing a large number of INSERTs, try dropping indexes and recreating them at the end. You will need to benchmark this.
SQLite does support nested transactions using savepoints, but I'm not sure that you'll gain anything there performance-wise.
You can gain lots of speed by compromising on data integrity. If you can recover from database corruption yourself, then this might work for you. You could perhaps only do this when you're doing intensive operations that you can recover from manually.
I'm not sure how much of this you can get to from an Android application. There is a more detailed guide for optimizing SQLite in general in the SQLite documentation.
Here's a bit of code to get EXPLAIN QUERY PLAN results into Android logcat from a running Android app. I'm starting with an SQLiteOpenHelper dbHelper and an SQLiteQueryBuilder qb.
String sql = qb.buildQuery(projection,selection,selectionArgs,groupBy,having,sortOrder,limit);
android.util.Log.d("EXPLAIN",sql + "; " + java.util.Arrays.toString(selectionArgs));
Cursor c = dbHelper.getReadableDatabase().rawQuery("EXPLAIN QUERY PLAN " + sql,selectionArgs);
if(c.moveToFirst()) {
do {
StringBuilder sb = new StringBuilder();
for(int i = 0; i < c.getColumnCount(); i++) {
sb.append(c.getColumnName(i)).append(":").append(c.getString(i)).append(", ");
}
android.util.Log.d("EXPLAIN",sb.toString());
} while(c.moveToNext());
}
c.close();
I dropped this into my ContentProvider.query() and now I can see exactly how all the queries are getting performed. (In my case it looks like the problem is too many queries rather than poor use of indexing; but maybe this will help someone else...)
I would add these :
Using of rawQuery() instead of building using ContentValues will fasten up in certain cases. off course it is a little tedious to write raw query.
If you have a lot of string / text type data, consider creating Virtual tables using full text search (FTS3), which can run faster query. you can search in google for the exact speed improvements.
A minor point to add to Robie's otherwise comprehensive answer: the VFS in SQLite (which is mostly concerned with locking) can be swapped out for alternatives. You may find one of the alternatives like unix-excl or unix-none to be faster but heed the warnings on the SQLite VFS page!
Normalization (of table structures) is also worth considering (if you haven't already) simply because it tends to provide the smallest representation of the data in the database; this is a trade-off, less I/O for more CPU, and one that is usually worthwhile in medium-scale enterprise databases (the sort I'm most familiar with), but I'm afraid I've no idea whether the trade-off works well on small-scale platforms like Android.