Sqlite fts4 search html encoded character - android

I am coding a dictionary project. Its can translate english to arabic or arabic to english. Words are stored in sqlite fts4 database.
Arabic letter in database stored html encoded like
غير
When i use fts4 query syntax in english to arabic for example => stor
SELECT * FROM fts_dic WHERE english MATCH '"^stor*"';
Returned results are good for me like
store
stored
storage
But when i search arabic to english for => غير
SELECT * FROM fts_dic WHERE english MATCH '"^غير*"';
Returned results
ظغير׾
֎׾غيرظ
But i want to see result only start with my searched html encoded text like
غيرخ
غيرٗ
I use "^" at the begining of the word to get this result as you see. In english to arabic works fine but arabic to english not working properly.

The FTS documentation says:
A term is a contiguous sequence of eligible characters, where eligible characters are all alphanumeric characters and all characters with Unicode codepoint values greater than or equal to 128. All other characters are discarded when splitting a document into terms. Their only contribution is to separate adjacent terms.
In other words, punctuation characters like &#; are completely ignored; what FTS sees are the three words 1594, 1610, and 1585.
In the FTS table, you should not HTML-encode anything; just use the plain Unicode characters.
Furthermore, ^ works only in FTS4 tables (which may not be available in all Android versions).

Related

What's the difference between Android's Html.escapeHtml and TextUtils.htmlEncode ? When should I use one or the other?

Android has two different ways to escape / encode HTML characters / entities in Strings:
Html.escapeHtml(String), added in API 16 (Android 4.1). The docs say:
Returns an HTML escaped representation of the given plain text.
TextUtils.htmlEncode(String) For this one, the docs say:
Html-encode the string.
Reading the docs, they both seem to do pretty much the same thing, but, when testing them, I get some pretty mysterious (to me) output.
Eg. With the input: <p>This is a quote ". This is a euro symbol: €. <b>This is some bold text</b></p>
Html.escapeHtml gives:
<p>This is a quote ". This is a euro symbol: €. <b>This is some bold text</b></p>
Whereas TextUtils.htmlEncode gives:
<p>This is a quote ". This is a euro symbol: €. <b>This is some bold text</b></p>
So it seems that the second escapes / encodes the quote ("), but the first doesn't, although the first encodes the Euro symbol, but the second doesn't. I'm confused.
So what's the difference between these two methods ? Which characters does each escape / encode ? What's the difference between encoding and escaping here ? When should I use one or the other (or should I, gasp, use them both together ?) ?
You can compare their sources:
This is what Html.escapeHtml uses underneath:
https://github.com/android/platform_frameworks_base/blob/d59921149bb5948ffbcb9a9e832e9ac1538e05a0/core/java/android/text/Html.java#L387
This is TextUtils.htmlEncode:
https://github.com/android/platform_frameworks_base/blob/d59921149bb5948ffbcb9a9e832e9ac1538e05a0/core/java/android/text/TextUtils.java#L1361
As you can see, the latter only quotes certain characters that are reserved for markup in HTML, while the former also encodes non-ASCII characters, so they can be represented in ASCII.
Thus, if your input only contains Latin characters (which is usually unlikely nowadays), or you have set up Unicode in your HTML page properly, and can go along with TextUtils.htmlEncode. Whereas if you need to ensure that your text works even if transmitted via 7-bit channels, use Html.escapeHtml.
As for the different treating of the quote character (") -- it only needs to be escaped inside attribute values (see the spec), so if you are not putting your text there, you should be fine.
Thus, my personal choice would be Html.escapeHtml, as it seems to be more versatile.

SQLite unicode slavic accented words Android

I'm trying to filter out accented words if user searches for them in local database. But I have problems, namely with slavic letters ČŠŽ. In my SQLite database I have a field "title" with value: "Želodček"
If I try to select LOWER(title) I always get back the same value "Želodček" whilst other words are correctly lower cased. Only if the word begins with ČŽŠ then it doesn't get lower cased. This only persists with words which have leading accented letters.
Database records
Stomach
Želodček
Uppercase with UPPER()
STOMACH
ŽELODčEK
Lowercase with LOWER()
stomach
Želodček
I've already tried setting localization with setLocale() with no luck. I also tried different collation like NOCASE, UNICODE, LOCALIZED but nothing worked. I'm wondering why when lower cased the first letter is not lower cased and when upper cased other accented words are lowercase.
I've solved the problem with LIKE searches where I replace accented words with their lower cased counterpart. But I have problem with full text(FTS3) searching because I can't use the same trick with MATCH.
-- works but it's a hack
SELECT title FROM articles WHERE REPLACE(LOWER(title),'Ž','ž') LIKE '%želodček%'
-- can't seem to get it work
SELECT title FROM articles WHERE title MATCH 'želodček' COLLATE NOCASE
Is there any solution to this or is there a bigger problem?
Update:
No optimal solution yet.
Un-optimal solution 1:
I decided to deal with the problem directly by changing data in the select query. While this doesn't work for all cases (and I would have to cover all accents) it suits my case for now. So I'm posting it:
-- LIKE query
SELECT title FROM articles WHERE (REPLACE(REPLACE(REPLACE(LOWER(title),'Č','č'),'Š','š'),'Ž','ž') LIKE ? COLLATE NOCASE))
-- MATCH query (FTS)
-- In this case I programmatically replace searched word with 2 word variation (one that starts with lowercase and one that starts with uppercase) ie: title='želodček OR Želodček'
SELECT title FROM articles WHERE title MATCH ? COLLATE UNICODE
Un-optimal solution 2:
As suggested by user CL. to insert in normalized form (didn't work for me because normalized form was basically the original unicode form). I took it futher and insert title stripped of of accents (basically ASCII form). This is maybe better than solution one in ways of general solution. Since I only cover some accents in the first.
But there are downsides:
data doubles (one unicode title and one ASCII title). Which can be a problem if you have a lot of data.
some characters are not supported (like chinese characters will be gone after normalization and stripping)
ambiguity which you get by stripping accents (ie. two words "zelo" and "želo" have different meanings but will both turn up when searching).
Here's the Java code for it:
// Gets you the ASCII version of unicode title which you insert into different column
String titleAsciiName = Normalizer.normalize(title, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "");
LIKE never uses a custom collation.
FTS can use a custom tokenizer, but you have to check whether unicode61 is available in all Android versions you want to support.
The Android database API does not allow to create custom implementations of LIKE or of a FTS tokenizer.
You might want to store a normalized version of your strings in the database.

What are my options for displaying characters that Android can't?

I discovered today that Android can't display a small handful of Japanese characters that I'm using in my Japanese-English dictionary app.
The problem comes when I attempt to display the character via TextView.setText(). All of the characters below show up as blank when I attempt to display them in a TextView. It doesn't appear to be an issue with encoding, though - I'm storing the characters in a SQLite database and have verified that Android can understand the characters. Casting the characters to (int) retrieves proper Unicode decimal escapes for all but one of the characters:
String component = cursor.getString(cursor.getColumnIndex("component"));
Log.i("CursorAdapterGridComponents", "Character Code: " + (int) component.charAt(0) + "(" + component + ")");
I had to use Character.codePointAt() to get the decimal escape for the one problematic character:
int codePoint = Character.codePointAt(component, 0);
I don't think I'm doing anything wrong, and as String's are by default UTF-16 encoded, there should be nothing preventing them from displaying the characters.
Below are all of the decimal escapes for the seven problematic characters:
⺅ Character Code: 11909(⺅)
⺌ Character Code: 11916(⺌)
⺾ Character Code: 11966(⺾)
⻏ Character Code: 11983(⻏)
⻖ Character Code: 11990(⻖)
⺹ Character Code: 11961(⺹)
𠆢 Character Code: 131490(𠆢)
Plugging the first six values into http://unicode-table.com/en/ revealed their corresponding Unicode numbers, so I have no doubt that they're valid UTF-8 characters.
The seventh character could only be retrieved from a table of UTF-16 characters: http://www.fileformat.info/info/unicode/char/201a2/browsertest.htm. I could not use its 5-character Unicode number in setText() (as in "\u201a2") because, as I discovered earlier today, Android has no support for Unicode strings past 0xFFFF. As a result, the string was evaluated as "\u201a" + "2". That still doesn't explain why the first six characters won't show up.
What are my options at this point? My first instinct is to just make graphics out of the problematic characters, but Android's highly variable DPI environment makes this a challenging proposition. Is using another font in my app an option? Aside from that, I really have no idea how to proceed.
Is using another font in my app an option?
Sure. Find a font that you are licensed to distribute with your app and has these characters. Package the font in your assets/ directory. Create a Typeface object for that font face. Apply that font to necessary widgets using setTypeface() on TextView.
Here is a sample application demonstrating applying a custom font to a TextView.

SQLite upper() alike function for international characters

Actually question was asked several times, but I didn't manage to find answer.
There's set of SQLite table(s) which are read-only - I can't change their structure or redefine collation rules. Tables consisting some international characters (Russian/Chinese, etc).
I would like to get some case-insensitive selection like:
select name from names_table where upper(name) glob "*"+constraint.toUpperCase()+"*"
It works only when name is latin/ASCII charset, for international chars it doesn't work.
SQLite's manual reads:
The upper(X) function returns a copy of input string X in which all
lower-case ASCII characters are converted to their upper-case
equivalent.
So the question is: how to resolve this issue and make international chars in upper/lower case?
This is known problem in sqlite. You can redefine built-in functions via Android NDK. This is not a simple way. Look at this question
Notice that indexes of your tables will not work (for UDF) and query can be very slow.
Instead of it you can store your data (which you look for) in other column in ascii format.
For example:
"insert into names_table (name, name_ind) values ('"+name+"',"+"'"+toAsciiEquivalent(name)+"')"
name name_ind
----------------
Имя imya
Name name
ыыы yyy
and search string by column name_ind
select name from names_table where name_ind glob "*"+toAsciiEquivalent(constraint)+"*"
This solution requires more space for data, but it is simple and fast.
Instead of providing full Unicode case support by default, SQLite provides the ability to link against external Unicode comparison and conversion routines. The application can overload the built-in NOCASE collating sequence (using sqlite3_create_collation()) and the built-in like(), upper(), and lower() functions (using sqlite3_create_function()). The SQLite source code includes an "ICU" extension that does these overloads. Or, developers can write their own overloads based on their own Unicode-aware comparison routines already contained within their project.
Reference: http://www.sqlite.org/faq.html

How to replace special characters with their equivalent in android?

While i'm retrieving data from remote server i'm getting some special
characters when i get known special characters i replaced them
with their original characters but when i don't no the
special characters how can i replace them? Some of the special
characters are as follows..
â €œ €™ € €“ â; ' & € ü Ü Û Ù
û ù Ø ß
Above are some them like this i'm getting some other . How to replace
these with their original characters in android.
It seems your question is how do you decode html encoded text?
Html.fromHtml(server_response).toString();
Html.fromHtml(server_response).toString();

Categories

Resources