SQLite: Efficient substring search in large table - android

I'm developing an Android application that has to perform substring search in a large table (about 500'000 entries with street and location names, so just a few words per entry).
CREATE TABLE Elements (elementID INTEGER, type INTEGER, name TEXT, data BLOB)
Note that only 20% of all entries contain strings in the "name" column.
Performing the following query almost takes 2 minutes:
SELECT elementID, name FROM Elements WHERE name LIKE %foo%
I now tried to use FTS3 in order to speed up the query. That was quite successful, query time decreased to 1 minute (surprisingly the database file size increased by only 5%, which is also quite good for my purpose).
The problem is, FTS3 seemingly doesn't support substring search, i.e. if I want to find "bar" in "foo bar" and "foobar", I only get "foo bar", although I need both results.
So actually I have two questions:
Is it possible to further speed up the query? My goal is 30 seconds for the query, but I don't know if that's realistic...
How can I get real substring search using FTS3?

Solution 1:
If you can make every character in your database as an individual word, you can use phrase queries to search the substring.
For example, assume "my_table" contains a single column "person":
person
------
John Doe
Jane Doe
you can change it to
person
------
J o h n D o e
J a n e D o e
To search the substring "ohn", use phrase query:
SELECT * FROM my_table WHERE person MATCH '"o h n"'
Beware that "JohnD" will match "John Doe", which may not be desired.
To fix it, change the space character in the original string into something else.
For example, you can replace the space character with "$":
person
------
J o h n $ D o e
J a n e $ D o e
Solution 2:
Following the idea of solution 1, you can make every character as an individual word with a custom tokenizer and use phrase queries to query substrings.
The advantage over solution 1 is that you don't have to add spaces in your data, which can unnecessarily increase the size of database.
The disadvantage is that you have to implement the custom tokenizer. Fortunately, I have one ready for you. The code is in C, so you have to figure out how to integrate it with your Java code.

You should add an index to the name column on your database, that should speed up the query considerably.
I believe SQLite3 supports sub-string matching like so:
SELECT * FROM Elements WHERE name MATCH '*foo*';
http://www.sqlite.org/fts3.html#section_3

I am facing some thing similar to your problem. Here is my suggestion try creating a translation table that will translate all the words to numbers. Then search numbers instead of words.
Please let me know if this is helping.

not sure about speeding it up since you're using sqllite, but for substring searches, I have done things like
SET #foo_bar = 'foo bar'
SELECT * FROM table WHERE name LIKE '%' + REPLACE(#foo_bar, ' ', '%') + '%'
of course this only returns records that have the word "foo" before the word "bar".

Related

Android sqlite fts - using Offsets function with exact phrase search

I have a book reader type app and I am using sqlite to store texts and provide search function which highlights search results returned by. My problem is that when for example I have the following text:
"Assuming N is a positive value, if no fragments can be found that
contain a phrase match corresponding to each matchable phrase, the
snippet function attempts to find two fragments of approximately N/2
tokens that between them contain at least one phrase match for each
matchable phrase matched by the current row. "
and I am searching for exact phrase "snippet function attempts", then I expect to get 1 search result, but I get 3 -> first is 'snippet', second is 'function' and third is 'attempts'.
My sqlite query is following:
'SELECT col1,col2, col3, offsets(index_table) FROM index_table WHERE col3 MATCH "snippet function attempts" '
How can I tell offsets() function to return offset for the whole phrase I am searching rather then individual parts of the phrase?
Try using :-
"SELECT col1,col2, col3, offsets(index_table) FROM index_table WHERE col3 MATCH 'snippet function attempts' "
i.e. single quotes around the phrase, as opposed to double quotes.
Enclosing the phrase in double quotes tells FTS that the values are a
list of phrases as per Phrase queries. A phrase query is a query that
retrieves all documents that contain a nominated set of terms or term
prefixes in a specified order with no intervening tokens. Phrase
queries are specified by enclosing a space separated sequence of terms
or term prefixes in double quotes ("). SQLite FTS3 and FTS4 Extensions

How can I select the last index of a column split with bigquery

There are a lot of questions about splitting a BigQuery, MySQL column, but I can't find one that fits my situation.
I am processing a large dataset (3rd party) that includes a freeform location field to normalize it for my Android app. When I run a select I'd like to split the column data by commas, take only the last segment and trim it of whitespace.
So far I've come up with the following by Googling documentation:
SELECT RTRIM(LOWER(SPLIT(location, ',')[OFFSET(-1)])) FROM `users` WHERE location <> ''
But the -1 trick to split at last element does not work (with either offset or ordinal). I can't use ARRAY_LENGTH with the same array inline and I'm not exactly sure how to structure a nested query and know the last column index of the row.
I might be approaching this from the wrong angle, I work with Android and NoSQL now so I haven't used MySQL in a long time
How do I structure this query correctly?
I'd like to split the column data by commas, take only the last segment ...
You can use below approach (BigQuery Standard SQL)
SELECT ARRAY_REVERSE(SPLIT(location))[SAFE_OFFSET(0)]
Below is an example illustrating it:
#standardSQL
WITH `project.dataset.table` AS (
SELECT '1,2,3,4,5' location UNION ALL
SELECT '6,7,8'
)
SELECT location, ARRAY_REVERSE(SPLIT(location))[SAFE_OFFSET(0)] last_segment
FROM `project.dataset.table`
with result
Row location last_segment
1 1,2,3,4,5 5
2 6,7,8 8
For trimming - you can use LTRIM(RTRIM()) - like in
SELECT LTRIM(RTRIM(ARRAY_REVERSE(SPLIT(location))[SAFE_OFFSET(0)]))
To get the last part of the split string, I use the len(string) - len(replace(string,delimeter,'')) trick to count the number of delimiters:
split(<string>,'-')[OFFSET(length(<string>)-length(replace(<string>,'-',''))]

Natural sorting of alphanumeric values in sqlite using android

I have a list of names of starts with characters and end with numbers like: -
ka1, ka10, ka 2, ka, sa2, sa1, sa10, p1a10, 1kb, p1a2, p1a11, p1a.
I want to sort it in natural order, that is: -
1kb, ka, ka1, ka 2, ka10, p1a, p1a2, p1a10, p1a11, sa1, sa2, sa10.
The main problem I am seeing here is no delimiter between text and numeric part, there also a chance of without numeric part also.
I am using sqlite in android, I can do sorting using java after fetching points by cacheing cursor data, but I am using(recommended to use) cursor adapter.
Please suggest a query for sorting or is there any way to apply sorting in cursor?
I tried below query for Natural sorting:
SELECT
item_no
FROM
items
ORDER BY
LENGTH(item_no), item_no;
It worked for me in Sqlite db too. Please see this link, for more details.
I can propose using regex replacement adding zeros, creating temporary table of original and corresponding values, then follow this link for sorting it: http://www.saltycrane.com/blog/2007/12/how-to-sort-table-by-columns-in-python/
tip for regex add as many zeros after last letter, but limit the number of total digits for predicted maximum number of digits. If You need help with regex as well, provide exact info of valid and invalid values, so can halp with that too.
PS if want to be sure that zeros goes before last digits search for char from the end
Updated
You can use different ways - Some of are mentioned below:
BIN Way
SELECT
tbl_column,
BIN(tbl_column) AS binray_not_needed_column
FROM db_table
ORDER BY binray_not_needed_column ASC , tbl_column ASC
Cast Way
SELECT
tbl_column,
CAST(tbl_column as SIGNED) AS casted_column
FROM db_table
ORDER BY casted_column ASC , tbl_column ASC
or try the solution:
There are a whole lot of solutions out there if you hit up Google, and
you can, of course, just use the natsort() function in PHP, but it's
simple enough to accomplish natural sorting in MySQL: sort by length
first, then the column value.
Query: SELECT alphanumeric, integer FROM sorting_test ORDER BY LENGTH(alphanumeric), alphanumeric from here

Searching Sqlite Database using Exact Keyword

i`m trying to search a Sqlite Database , with this condition : i want to find a string using an Exact Keyword. let me explain this to you .
i have 3 rows as follow :
catching cold
i have a cat
two cats was seen in your house yesterday
i want to search these rows with keyword "cat" and i expect this result :
i have a cat
i am using this SQL code so far :
Select * FROM MyTable WHERE Mycolumn Like '%cat%'
But Returning Result is All these 3 Rows:
catching cold
i have a cat
two cats was seen in your house yesterday
What can i do to get my expected result?
thank you in advance.
The % character in the argument of a LIKE clause matches any string, including the empty string. Unfortunately, SQLite doesn't have the REGEXP function built in (and Android's SQLite doesn't have it).
What you can do instead is use FTS (full text search). How to do so is described here: https://www.sqlite.org/fts3.html#section_1_2
Using your example, you would set it up like so:
create virtual table textsearch using fts4(content);
insert into textsearch (content) values ('catching cold'), ('i have a cat'), ('two cats was seen in your house yesterday')
Then you can do a simple text query with the MATCH operator:
select * from textsearch where content match 'cat';
If you try the above in a sqlite3 shell, you'll see it returns only 'i have a cat'. There's a lot more you can do with the match operator, explained on the page I linked above.
You can use a regular expression with a special pattern for word boundaries.
Select * FROM MyTable WHERE Mycolumn = 'cat'
Corrected my answer i think that should work.

Search data from sqlite3 database in android

I have a Sqlite3 database in android, with data are sentences like: "good afternoon" or "have a nice day", now I want to have a search box, to search between them, I use something like this :
Cursor cursor = sqliteDB.rawQuery("SELECT id FROM category WHERE sentences LIKE '"+ s.toString().toLowerCase()+ "%' LIMIT 10", null);
But it only show "good afternoon" as result if user start searching with first "g" or "go" or "goo" or etc, how can I retrieve "good afternoon" as results, if user search like "a" or "af" or "afternoon".
I mean I want to show "good afternoon" result, if user search from middle of a data in sqlite3 db, not only if user searches from beginning.
thanks!
Just put the percent sign in front of your query string: LIKE '%afternoon%'. However, your approach has two flaws:
It is susceptible to SQL injection attacks because you just insert unfiltered user input into your SQL query string. Use the query parameter syntax instead by re-writing your query as follows:
SELECT id FROM category WHERE sentences LIKE ? LIMIT 10. Add the user input string as selection argument to your query method call
It will be dead slow the bigger your database grows because LIKE queries are not optimized for quick string matching and lookups.
In order to solve number 2 you should use SQLite's FTS3 extension which greatly speeds up any text-related searches. Instead of LIKE you would be using the MATCH operator that uses a different query syntax:
SELECT id FROM category WHERE sentences MATCH 'afternoon' LIMIT 10
As you can see the MATCH operator does not need percent signs. It just tries to find any occurrence of a word in the whole text that is being searched (in your case the sentences column). Read through the documentation of FTS3 I've linked to. The MATCH query syntax provides some more pretty handy and powerful options for finding text in your database table which are pretty similar to early search engine query syntax such as:
MATCH 'afternoon OR evening'
The only (minor) downside to the FTS3 extension is that it blows up the database file size by creating additional search index tables and meta-data. But I think it's well worth it for this use case.

Categories

Resources