How do the Unicode control characters work? - android

What I'm doing now is to show the phone number correctly under right-to-left layout. I want +111111111 but it appears like 111111111+ now. I found a solution that using LRM(left-to-right mark), which is a Unicode control character '\u200E'.
There may be several formats for phone numbers in different place of world like XXX-XXX-XXXX. To prevent further bugs, I have to understand how those control characters work, especially which changes the direction of strings.
In my understanding, for common characters:
strings are stored as bytes in memory.
the editor/textview loads the bytes and look them up in
Unicode.
the editor/textview shows those Unicode in the form of
fonts.
So, when or which step do those control characters like LRM work? How to make sure that using them does not cause further bugs?
I wish I had made it clear for you.

Related

How to get the current locale's alphabet?

Background
Today I've noticed that on Google's Contacts app, if you have both English and Hebrew contacts, and you switch to English locale as the main one, the first contacts are in English:
But, if you switch to Hebrew locale as the main one, the first contacts are in Hebrew:
The problem
I don't see which functions are used to do that. I tried to search over the Internet about this behavior and how it's done, but couldn't find it.
Comparing the values of characters will always return the same result, so the order here should be more dynamic.
What I've found
I thought this will help me:
val unicodeLocaleKeys = Locale.getDefault().unicodeLocaleKeys
But it always returns an empty set.
I also searched for such a function in classes such as Character, Unicode*, and String. I don't think it exists there.
The question
How does Google Contacts app get to sort the contacts by the current locales?
Is it possible perhaps to get the whole set of characters used by a specific locale?
Maybe it's possible to compare characters, while giving order of priorities of locales (users can choose multiple locales) ?
Maybe you are looking on the wrong thing.
Contact app seems not to have an alphabet built in (per locale), but just a collation (local sort) and display the first character. Possibly it will find "symbols" (Unicode categories) and put all symbols in the same bin.
Eventually you can get, from Unicode, the script name (and the direction). You may get the alphabet in few places (e.g. Wikipedia). It will fail for Chinese, and other rich alphabets. The problem: the "alphabet" is language specific. On some European countries you may have (some) accented characters, or character groups interpreted as a single character (also on phone books).
So, if you want to keep thing simple:
use collation and just first character
the same, but remove accent, and try to find if the letter has same priority in alphabetic order: in this case: ignore accent, else: keep it, see e.g.Å - place in alphabet. Maybe do the same with two letters, e.g. ll in the past.
find a library with handle such complex cases (and that it will updated regularly). This will help probably for Chinese and other languages with huge amount of characters.
EDIT: in short, instead of normal sorting of strings using str1.compareTo(str2), you should use :
Collator.getInstance().compare(str1,str2)

How to limit the use of certain character sets

I hope this question isnt going to be down-flagged for not showing some actual code, but thats the core of this situation. I simply have no clue where to start to solve this issue, even after trying to use several combinations of keywords on both Google, and here on SO.
My client suddenly decided that half of the Android App I'm developing for him has to be Chinese, so after I have made some changes in the Database so some fields can take in Simplified Chinese character sets, I need to make sure that my client (living in holland) only uses those characters in that particular EditText field in the app. (There are more Database fields that now only allow Simplified Chinese, however these values come from a dropdown list in the app, so I dont need to worry about wrong characters for them).
So how would one make sure that only Simplified Chinese is used in an EditText field?
Here is a project in Ruby that attempts to detect whether characters are Traditional Chinese, Simplified Chinese, or Japanese (maybe others?): https://github.com/jpatokal/script_detector
This detection is based on the Unihan Database, in which there is a file called Unihan_Variants.txt. (Download zip file containing this text file here.)
Conceivably, you could parse the txt file into a lookup table and check the unicode value as the text is entered during onTextChanged() for your EditText. However, the readme on the project linked above states: "It is important to understand that this requires long sections of text to work reliably, since a single character or even several characters may be valid Japanese, traditional Chinese and simplified Chinese simultaneously." So, weeding out characters on an individual basis might prove difficult.

What is the purpose of "[Developer] Accented English" (zz-ZZ) in Android?

In Android KitKat, if I choose Settings > Language & Input > Language, the first choice I am offered is [Developer] Accented English. This replaces each Roman letter with an accented version. You can find a list of all the character mappings here. (It helps if you can read French).
What is the purpose of this setting? Is it just to show how characters can be mapped to other characters? Or can it be used productively (to create specific phonemes in text-to-speech output for example?
It's a technique called 'Pseudolocalization', and it's used to help test that an app is handling aspects of localization correctly.
The idea is that instead of waiting for an app's string resources to be translated into other languages - which could take some time - a "fake" pseudo-language is used instead. If the app behaves well against this fake translation, then chances are it will perform well with actual translations. There's different variations of pseudolocalization out there, but most tend to do some of the following:
Add parens [ ... ] or other delimiters around the string: this makes it easier to ensure that strings are not getting clipped at either end.
Replace regular characters with accented characters: if you see a string without accented characters, than that's a sign that it might be hardcoded instead of being treated as a localizable resource. (In the past, this was also used to ensure that apps could handle non-ASCII characters correctly and didn't lose data in code page translation, though this is less of an issue now that modern platforms support Unicode.)
Add padding to the string: this is to simulate languages such as German which often have longer translations for the corresponding English string. If the padded string gets truncated instead of wrapping or flowing, then likely the German string will do similar.
Add known-to-be-tricky characters to act as 'canaries': on some platforms, symbols from specific parts of the Unicode range may be added to ensure that they are handled or supported properly. For example, a Chinese character might be added to ensure that Chinese fonts are supported: if this ends up showing as an empty square, than that would indicate a problem. Other common 'canary' characters include code points from outside the BMP, or using Combining Characters.
One advantage of using pseudolocalization over actual translation is that the testing can be performed by someone who does not understand the target language: "[Àççôûñţ Šéţţîñĝš___]" still visually appears similar to the original English text "Account Settings". If you try using it with a Screen-Reader such as TalkBack, or other wise send pseudolocalized text to Text-to-speech, you'll likely get nonsense, since it will try to treat the accented characters as actual accented characters.

How To Detect Is Text Human Readable?

I am wondering if there's a way to tell a given text is human readable. By human readable, I mean: it has some meanings, format like an article written by somebody, or at least generated by a software translator that is intended to be read by a human.
Here's the background story: recently I am making an app that allows user to upload a short text to a database. At the early stage of deployment I noticed some user always uploaded corrupted text due to a problem with encoding. This problem is fixed later, but leaves me wonder if there's a way to pick up non human readable text before serving the text back to users.
Any advice will be appreciated. The scope might be too large to include other languages, so at the moment let's limit the discussion to English only.
You can try a language identification tool, or something similar.
Basically you have to count the characters, or groups of character (character n-grams), and compare the distribution of the letters of the text submitted with the distribution of the letters of a collection of texts written in good english. (Make sure that such collection of texts is representative of the expected input).
In the continuity of a N-gram approach you might want to try a dictionary based approach and check for the presence of 'stop words' (e.g. 'the', 'a', 'an', 'of') in the input text.
Most of the NLP-Libraries will do the job (Spacy is a very common one). You can also go for language detection: Langdetect will support you on this
(https://pypi.org/project/langdetect/) as many others will do. If you need to be less specific (more math than language) you should look for Phonotactics (with BLICK for Python: https://github.com/mmcauliffe/python-BLICK) that looks into the construction of character order in a string.
Do a hexdump and make sure each character is less than or equal to 0x7f.

Is there any unicode character whose glyph is missing in all fonts? [duplicate]

This question already has answers here:
Is there a "glyph not found" character?
(8 answers)
Closed 7 years ago.
On Android, I want to be able to detect if the font used can display a certain character or not, but as I understand it this is not possible with conventional means as indicated by Check if custom font can display character
To detect this I'm writing the character I want to check to a bitmap and then I write another character that I know is missing to another bitmap and compare the content of the bitmaps. If they are equal the character is missing.
The question is, is there any unicode character whose glyph is (more or less) guaranteed to be missing on fonts typically used on Android phones?
The Unicode replacement character sounds promising when reading about it on Wikipedia:
It is used to indicate problems when a system is not able to render a
stream of data to a correct symbol. It is most commonly seen when a
font does not contain a character, but is also seen when the data is
invalid and does not match any character
However after doing a bit of testing I see that this character is not used to represent missing glyphs on either my Windows 7 computer or the Android phone I've tested with (Motorola Atrix).
There isn't any designated Unicode value for the glyph that is used to render glyphs that are missing in the font used. In the actual font, glyph id 0 should always be the .notdef glyph which is used for all characters that are missing a glyph. However it is not possible this information from the fonts on Android, so it's not possible to use the .notdef glyph directly.
In Unicode there are many reserved/unassigned code points and my limited testing indicate that these code points are rendered using the .notdef glyph. So by using U+0978, which is a reserved code point in the middle of the Devanagari block, I can detect if some other valid, known character exists in the font I want to test.
This is not a future proof solution since new glyphs may be added to reserved code points by the Unicode Consortium in the future. But for my needs it's good enough since what I want to do is a temporary thing that is not relevant any more in the near future.
Update:
The solution to look at U+0978 did not work long. That character was added in the Unicode 7.0 release in June 2014. Another option is to use a glyph that exists in unicode but that is very unlikely to be used in a normal font.
U+124AB in the Early Dynastic Cuneiform block is probably something that doesn't exist in many fonts at all.

Categories

Resources