unable to retrieve special characters from sqlite fts3 - android

I am having some problems with special characters in my scenario.
I have a sqlite db created using fts3.
When I use SELECT col_1, col_2, offsets(table) FROM table WHERE table MATCH 'h*' LIMIT 50;
I am able to get words which start with h.
but when I am using
SELECT col_1, col_2, offsets(table) FROM table WHERE table MATCH '#*' LIMIT 50;
I am not getting strings which start with #.
Where am I going wrong? Any pointer regarding approach would be great.

I think the behavior you described happens because SQLite FTS3 uses tokenizer called "simple" by default. The character # gets discarded because is not an alphanumeric character and its UTF codepoint is not greater than 127. My interpretation of this is that FTS is not for searching special characters, it is for searching natural text.
The fix I suggest is not to use FTS for this kind of queries but to use LIKE operator. Or you could try to search for other tokenizers available or write your on in C.

Related

Natural sorting of alphanumeric values in sqlite using android

I have a list of names of starts with characters and end with numbers like: -
ka1, ka10, ka 2, ka, sa2, sa1, sa10, p1a10, 1kb, p1a2, p1a11, p1a.
I want to sort it in natural order, that is: -
1kb, ka, ka1, ka 2, ka10, p1a, p1a2, p1a10, p1a11, sa1, sa2, sa10.
The main problem I am seeing here is no delimiter between text and numeric part, there also a chance of without numeric part also.
I am using sqlite in android, I can do sorting using java after fetching points by cacheing cursor data, but I am using(recommended to use) cursor adapter.
Please suggest a query for sorting or is there any way to apply sorting in cursor?
I tried below query for Natural sorting:
SELECT
item_no
FROM
items
ORDER BY
LENGTH(item_no), item_no;
It worked for me in Sqlite db too. Please see this link, for more details.
I can propose using regex replacement adding zeros, creating temporary table of original and corresponding values, then follow this link for sorting it: http://www.saltycrane.com/blog/2007/12/how-to-sort-table-by-columns-in-python/
tip for regex add as many zeros after last letter, but limit the number of total digits for predicted maximum number of digits. If You need help with regex as well, provide exact info of valid and invalid values, so can halp with that too.
PS if want to be sure that zeros goes before last digits search for char from the end
Updated
You can use different ways - Some of are mentioned below:
BIN Way
SELECT
tbl_column,
BIN(tbl_column) AS binray_not_needed_column
FROM db_table
ORDER BY binray_not_needed_column ASC , tbl_column ASC
Cast Way
SELECT
tbl_column,
CAST(tbl_column as SIGNED) AS casted_column
FROM db_table
ORDER BY casted_column ASC , tbl_column ASC
or try the solution:
There are a whole lot of solutions out there if you hit up Google, and
you can, of course, just use the natsort() function in PHP, but it's
simple enough to accomplish natural sorting in MySQL: sort by length
first, then the column value.
Query: SELECT alphanumeric, integer FROM sorting_test ORDER BY LENGTH(alphanumeric), alphanumeric from here

Declaring SQLite table for Android word game with 700 000 words

For an Android word game (with minSdkLevel=9 meaning SQLite version 3.6.22) -
I would like to deliver the dictionary as a prefilled SQLite table within the APK file (with the help of SQLiteAssetHelper).
In the SQLite database there will be just 1 table:
create table dict ( /* contains 700 000 unique words */
word text not null
);
My question please:
How to declare the table for the best performance and which kind of SQL-query to use?
(When checking if a word entered by player is present in the dict table or not - that will be the main usage of the SQLite database in the app).
Should I create index (is it possible to have index for text columns at all)?
Or should I declare the word column as primary key?
Also, some SQLite for Android guides suggest to have an _id column in each table (probably to enable fetching the last inserted record? - which I don't really need here). Should I maybe use
create table dict (
_id integer primary key,
word text unique not null
);
create index word_index on dict(word);
or will that be a waste of 4 x 700 000 bytes? (Or is it added as _rowid_ anyway?)
Quick answer: yes, you can create index on text column.
However for best performance, this may not be the best option.
Because the index created by SQLite should be simply a b-tree (binary tree), which speed up the search by binary search. i.e. with 700k words, the binary search has to run about 20 intervals. But this could be fast enough, you need to test it to actually know the performance.
Some alternative methods would be to create multiple tables (buckets), e.g. create table as wordA, wordB, wordC etc.
And use the first character to determine which table the word is put.
This drops the size of each table to contains about 27k records. (of course each bucket is not of equal size)
By doing this, it reduces the interval used performing the binary search.
And actually you should use hash function to determine the bucket, which makes the size of each buckets more balanced and you can freely control the number of buckets.
And you have to actually fine tune to know what is the optimal bucket size.

Android Sqlite FTS3 how to select words that starts with?

For example, if i have these records
word
AAA
AAB
AAC
BAA AA
With a normal table i would use sql like
select * from table where word like 'AA%'order by H collate nocase asc
How do i select with FTS3 table instead?
Also i would like to know if FTS3 will still have better performance than normal table with this kind of query?
How do i select with FTS3 table instead?
Quoting the documentation:
An FTS table may be queried for all documents that contain a specified term (the simple case described above), or for all documents that contain a term with a specified prefix. As we have seen, the query expression for a specific term is simply the term itself. The query expression used to search for a term prefix is the prefix itself with a '*' character appended to it.
The documentation also gives a sample:
-- Query for all documents containing a term with the prefix "lin". This will match
-- all documents that contain "linux", but also those that contain terms "linear",
--"linker", "linguistic" and so on.
SELECT * FROM docs WHERE docs MATCH 'lin*';

What is the advantage of FTS over custom solution?

I have a biggish database ~32mb which has lots of text in 4 languages. Including Arabic and Urdu. I need to search this text in the most efficient way (speed & size).
I am considering FTS, and trying to find out how to implement it. Right now I am reading http://www.sqlite.org/fts3.html#section_1_2 about it.
It seems to me, an FTS table is just like a normal table used to index all the different words. So my questions are:
1) If to populate FTS I have to do all the inserts myself, then why not make my own indexed word table, what is the difference?
Answer : Yes there are many advantages, many built in functions that help. For example with ranking etc, searching of stems and the transparent nature of how it all works in android makes the FTS approach more appealing.
2) On the google docs I read its a virtual in memory table, now this would be massive right... but it doesnt mention this on the SQLite website. So which is it?
3) Is there an easy way to generate all the different words from my columns?
4) Will the FTS handle arabic words properly?
FTS allows for fast searching of words; normal indexes only allow to search for entire values or for the beginning of the value.
If you table has only one word in each field, using FTS does not make sense.
FTS is a virtual table, but not an in-memory table.
You can get individual terms from the full-text index with the fts4aux table.
The default tokenizer works only with ASCII text.
You have to test whether the ICU or UNICODE61 tokenizers work with your data.
1) If to populate FTS I have to do all the inserts myself, then why
not make my own indexed word table, what is the difference?
Using your own indexed word table, you would have parse words in sentences. You would then need a table for sentences and another to words. And you should do this efficiently.
2) On the google docs I read its a virtual in memory table, now this
would be massive right... but it doesnt mention this on the SQLite
website. So which is it?
Don't understand your question. Data is handled via virtual table extension, however back storage is done in database (FTS4 creates 5 tables for each virtual table). Check this:
sqlite> CREATE VIRTUAL TABLE docs USING fts4();
sqlite> .schema
CREATE VIRTUAL TABLE docs USING fts4();
CREATE TABLE 'docs_content'(docid INTEGER PRIMARY KEY, 'content');
CREATE TABLE 'docs_segments'(blockid INTEGER PRIMARY KEY, block BLOB);
CREATE TABLE 'docs_segdir'(level INTEGER,idx INTEGER,start_block INTEGER,leaves_
end_block INTEGER,end_block INTEGER,root BLOB,PRIMARY KEY(level, idx));
CREATE TABLE 'docs_docsize'(docid INTEGER PRIMARY KEY, size BLOB);
CREATE TABLE 'docs_stat'(id INTEGER PRIMARY KEY, value BLOB);
sqlite>
3) Is there an easy way to generate all the different words from my
columns?
For sure. But that's not easy. That's what FTS does.
4) Will the FTS handle arabic words properly?
I'm not sure. Does arabic languages uses ICU word boundaries? From Tokenizer:
The ICU tokenizer implementation is very simple. It splits the input
text according to the ICU rules for finding word boundaries and
discards any tokens that consist entirely of white-space. This may be
suitable for some applications in some locales, but not all. If more
complex processing is required, for example to implement stemming or
discard punctuation, this can be done by creating a tokenizer
implementation that uses the ICU tokenizer as part of its
implementation.

Search data from sqlite3 database in android

I have a Sqlite3 database in android, with data are sentences like: "good afternoon" or "have a nice day", now I want to have a search box, to search between them, I use something like this :
Cursor cursor = sqliteDB.rawQuery("SELECT id FROM category WHERE sentences LIKE '"+ s.toString().toLowerCase()+ "%' LIMIT 10", null);
But it only show "good afternoon" as result if user start searching with first "g" or "go" or "goo" or etc, how can I retrieve "good afternoon" as results, if user search like "a" or "af" or "afternoon".
I mean I want to show "good afternoon" result, if user search from middle of a data in sqlite3 db, not only if user searches from beginning.
thanks!
Just put the percent sign in front of your query string: LIKE '%afternoon%'. However, your approach has two flaws:
It is susceptible to SQL injection attacks because you just insert unfiltered user input into your SQL query string. Use the query parameter syntax instead by re-writing your query as follows:
SELECT id FROM category WHERE sentences LIKE ? LIMIT 10. Add the user input string as selection argument to your query method call
It will be dead slow the bigger your database grows because LIKE queries are not optimized for quick string matching and lookups.
In order to solve number 2 you should use SQLite's FTS3 extension which greatly speeds up any text-related searches. Instead of LIKE you would be using the MATCH operator that uses a different query syntax:
SELECT id FROM category WHERE sentences MATCH 'afternoon' LIMIT 10
As you can see the MATCH operator does not need percent signs. It just tries to find any occurrence of a word in the whole text that is being searched (in your case the sentences column). Read through the documentation of FTS3 I've linked to. The MATCH query syntax provides some more pretty handy and powerful options for finding text in your database table which are pretty similar to early search engine query syntax such as:
MATCH 'afternoon OR evening'
The only (minor) downside to the FTS3 extension is that it blows up the database file size by creating additional search index tables and meta-data. But I think it's well worth it for this use case.

Categories

Resources