sprintf() handling of %s extended ASCII (ISO 8859-1) on some runtimes? - android

I'm using ISO 8859-1 (Latin extended ASCII char set) in my C application. When I strcpy/strcat the portions of the string together, it works fine. But when I use sprintf("%s %s"), on some runtimes (particularly certain versions of Android), the string will truncate when an extended ASCII character (specifically é, although I haven't tried others) is hit.
I thought %s was just supposed to copy the bytes until '\0' was hit. I suspect that strcpy/strcat works because it does do just that, without any formatting. What could possibly be going on here?
I should note that I'm not viewing the text using printf(), rather my own text rendering engine which handles ISO-8859-1 just fine.
UPDATE:
To clarify, I have an NDK app, which is keeping the string in C, and passing it to my OpenGL based text rendering engine. If I pass the full string as a char* literal, it displays fine. If I sprintf() the portions together, it gets truncated at the é character.
For example:
char buffer[1024];
strcpy(buffer, "This is ");
strcat(buffer, "the string I want to diésplay.");
That shows up fine. But this:
sprintf(buffer, "%s%s", "This is ", "the string I want to diésplay.");
Prints as:
This is the string I want to di

The behavior of s[n]printf() is specified differently than the behavior of string-manipulation functions such as strcpy() and strcat(). The printf-family functions are all required to produce the same byte sequences when presented identical formats and print items. The only difference is in where those bytes are sent. Thus, if your C library were built such that it performed a transformation on string data (maybe a transcoding) when printing to the standard streams via printf(), then it would perform that same transformation when printing to a string via sprintf().
The "f" in "printf" is for "formatted". The standard neither says nor implies that formatting a string must mean dumping its bytes to the output verbatim, so a transcoding or other transformation such as I hypothesized above is not out of the question. In fact, the docs for some versions of these functions indicate locale-dependence ("Note that the length of the strings produced is locale-dependent and difficult to predict"), so transcoding in particular is a real possibility.
Any specific explanation of the third-party observations you describe would necessarily be speculative, as you have not presented nearly enough code or data to make a confident diagnosis. I am inclined to suspect an issue revolving around running the program in a locale that uses a character encoding differing from the one used internally by the program. If so, then you may be able to reproduce the problem locally by varying the locale in which you run, and you may be able to address it by ensuring one way or another that your program always runs in a suitable locale. Among other things, you might use the getlocale() and setlocale() functions to help here, especially if you want to limit the scope in which you exercise locale control.
Since ultimately you are relying on printf-family functions only for string manipulation, however, I think it would be better to use the workaround presented in the question: as much as possible, use C's dedicated string-manipulation functions, such as strcpy() and strncat(), to perform your string building. Since you are not relying on the stdio functions for your actual output, this should be fine.

Related

mXparser result rounding

I am trying out mXparser in an android app and I almost have it working. But, if I parse the following expression "10/3" then it returns: 3.33333333335. Why this rounding in the end? and how do I tell mXparser to return 3.33333333333 instead?
I am writing the app using kotlin and has added mXparser through Maven.
Alternatively, do you know of a better/more used/more maintained math parser library for Android?
The reason is that computers calculate in base 2, not base 10. The number 10/3 has an infinite expansion in both base 2 and base 10 meaning it must be truncated. The decimal expansion of 10/3 is 3.333..., which when you cut it off simplifies to a bunch of 3's; while the binary expansion is 11.010101010101... and when you cut it off and convert back to decimal, it's totally believable that you could get the 5 at the end.
I'm not sure you can get around that when using a computer since computers have to use binary and they also have to truncate the binary expansion.
Any system based around IEEE 754 double precision will give the same answer. That includes all major programming languages. This is a very frequent SO question. See for example Is floating point math broken?
The solution is to never use the default Double.toString() method for your output. Use and output with specific number of decimal places and the problem goes away.
A more complex solution is to use a ration representation of your numbers, so the result of 10/3 is stored internally as a rational number {numerator:10,denominator:3}. This works for basic arithmetic but can't work with function like cos(x) or sqrt(x). The Jep parsing evaluation library does have options to allow ration number. (disclaimer I'm one of the authors of Jep).

How to speed up searching alphabetized word list for leading wildcard matches

I'm a word puzzle junky in my spare time, so I've spent a LOT of other spare time working on a helper program that allows wildcards in search patterns. It works great. On my Dell Laptop (i5, 8GB RAM) the search of a 140,000-word "dictionary" for wildcard matches for words has an almost imperceptible and definitely acceptable delay that occurs only if tens of thousands of words are returned. Java rules. So does its implementation of regex and match().
I was hoping to port it to Android. I worked all day getting a more-or-less equivalent app to compile. No chance with given code architecture.
The problem is that leading wildcard characters can (must) be allowed. E.g., ???ENE returns 15 matches--from achENE to xylENE and *RAT returns 22 matches--from aristocRAT through `zikuRAT--i.e., all 140,000 words must (?) be searched, which is going to take aaaaaaaaawhiiiiiiiiile on most (all?) Android devices. (Each took less than a second on my laptop.) (It takes my PC 3 seconds to return all 140,000 words and a little longer to eyeball them all.)
Since some word puzzles allow variable numbers of letters in words, disallowing leading wildcards cuts the heart out of the app for such puzzles. But if the search pattern had to start with a letter it would be easy enough to then do a binary search (or something quicker). (And it still might be unacceptably slow.)
Anyway, I was wondering if anybody might know some algorithm or can think of some approach that might be applied to speed up searches with leading wildcard characters.
I believe that the optimized version of what you are trying to do is widely known as the Unix/Linux utility "grep", which, if I remember correctly, uses the Boyer-Moore search algorithm.
Under the covers, Java's Pattern class uses Boyer-Moore. And it supports regex, so if you can write something to turn your wildcard search patterns into regular expressions, you can use Pattern.
There's an interesting Java implementation of grep at http://www.java2s.com/Code/Java/Regular-Expressions/AnotherGrep.htm
It uses memory-mapped files. I'm guessing that you won't be able to fit your entire word list into memory, but you could split it up into a bunch of smaller files - the implementation above memory-maps one file at a time. You'd have to do some testing to find the optimal size of a file.
I just Googled and found having a second list reverse alphabetized might be a way to then have a leading wildcard become trailing, opening door to binary search for pattern start. Interesting. But *a???ene* is also a legal search pattern in the program. What then? (Yeah. How often would you need such a search.)
I just found this about Apache Lucene:
Leading wildcards (e.g. *ook) are not supported by the QueryParser by default. As of Lucene 2.1, they can be enabled by calling QueryParser.setAllowLeadingWildcard( true ). Note that this can be an expensive operation: it requires scanning the list of tokens in the index in its entirety to look for those that match the pattern.

What is the purpose of "[Developer] Accented English" (zz-ZZ) in Android?

In Android KitKat, if I choose Settings > Language & Input > Language, the first choice I am offered is [Developer] Accented English. This replaces each Roman letter with an accented version. You can find a list of all the character mappings here. (It helps if you can read French).
What is the purpose of this setting? Is it just to show how characters can be mapped to other characters? Or can it be used productively (to create specific phonemes in text-to-speech output for example?
It's a technique called 'Pseudolocalization', and it's used to help test that an app is handling aspects of localization correctly.
The idea is that instead of waiting for an app's string resources to be translated into other languages - which could take some time - a "fake" pseudo-language is used instead. If the app behaves well against this fake translation, then chances are it will perform well with actual translations. There's different variations of pseudolocalization out there, but most tend to do some of the following:
Add parens [ ... ] or other delimiters around the string: this makes it easier to ensure that strings are not getting clipped at either end.
Replace regular characters with accented characters: if you see a string without accented characters, than that's a sign that it might be hardcoded instead of being treated as a localizable resource. (In the past, this was also used to ensure that apps could handle non-ASCII characters correctly and didn't lose data in code page translation, though this is less of an issue now that modern platforms support Unicode.)
Add padding to the string: this is to simulate languages such as German which often have longer translations for the corresponding English string. If the padded string gets truncated instead of wrapping or flowing, then likely the German string will do similar.
Add known-to-be-tricky characters to act as 'canaries': on some platforms, symbols from specific parts of the Unicode range may be added to ensure that they are handled or supported properly. For example, a Chinese character might be added to ensure that Chinese fonts are supported: if this ends up showing as an empty square, than that would indicate a problem. Other common 'canary' characters include code points from outside the BMP, or using Combining Characters.
One advantage of using pseudolocalization over actual translation is that the testing can be performed by someone who does not understand the target language: "[Àççôûñţ Šéţţîñĝš___]" still visually appears similar to the original English text "Account Settings". If you try using it with a Screen-Reader such as TalkBack, or other wise send pseudolocalized text to Text-to-speech, you'll likely get nonsense, since it will try to treat the accented characters as actual accented characters.

Android pocketsphinx & Fsg model

Context
I am currently building an sdk/service on wich applications can access to voice based command,
For the moment i'm using android pocketsphinx to detect a keyword (which is "wake"), and then analyse whole sentence with google voice recognition,
But my problem is i want to make it all offline! So i'm in my way to replace google voice recognition by a full utilisation of pocketsphinx...
My Problem
The user define which is the word he want to detect, and previously i just compared the said-word and what google voice speech-to-text returned me...
So know I want to update the grammar that pocket sphinx use with just the word given by the user, which problematic because (following the javadoc of android pocket sphinx) it can only take grammar files!
Question
Are there any way i can update android pocketsphinx grammar on the fly?
Edit
I forgot to talk about this method:
public void addFsgSearch(String searchName, FsgModel fsgModel) (in github pocketsphinx)
wich seem to deosn't take a grammar file like any other grammar setter method, but rather a class/struct? but the problem it's it isn't documented.....
If you need to detect just one word, consider using addKeywordSearch.
I had the same issue, and more. Perhaps these undocumented discoveries can help you.
Using the overloaded method "addGrammarSearch(String name, String fsgString)" allows you to put your entire FSG or JSGF grammar definition in a string, rather than sourcing it from a file if you wish (only a small file open/read time advantage).
"addKeyphraseSearch(String name, String keyphrase)" // only accommodates ONE WORD or PHRASE, no threshold, no grammar.
"addKeywordSearch(String name, File keywordList)" // accommodates MULTIPLE key WORDS or PHRASES, adding thresholds for each.
Several caveats include:
The grammar searches use JSGF format, parsing the defined syntax correctly. However:
1.1 Tags are not implemented
1.2 Unclear if weights (though the same // syntax as in keyword lists) actually apply recognizer thresholds (they have different meanings in PocketSphinx versus Sun Microsystems).
1.3 Rule names are also not implemented either.
1.4 In other words, you provide a grammar in JSGF, and your Hypothesis as well as FinalResult strings still give you the recognized lowest-level phrase detected in the grammar -- NOT the grammar tags, nor even rule metasymbols.
1.3.1 IMHO, that makes grammars pointless, and actually less efficient and less flexible than keyword list files (which are actually words or phrases) due to the option to provide a threshold for recognizer scrutiny, per phrase. Further, if the RULE & TAG names are not returned, then there is zero information regarding the structure of the grammar that was recognized. So as syntactically complex and flexible as it is, I do not see the advantage of bothering with a grammar definition at all in PocketSphinx; the best multiple keyphrase approach is simply to expand your grammar into a keyword list file. Please correct me if I am mistaken.
Search methods, whether containing the word "phrase" or the word "word", actually accommodate both phrases or single words.
I have assumptions re: the undocumented fsgModel class, but we're not allowed to give assumptions.
Though this may help clarify some aspects,the above fails to add any functionality to the package. Lastly, the C source code has methods getRuleName() and getTagName(). But, discussions regarding this topic between users and developers seems to stonewall -- their is no motivation to add tags or rule name associations to recognized words or phrases in a defined grammar, apparently because the developers believe grammars are old-school and nobody uses them anymore.

non ASCII SSIDs in android

Apparently SSIDs can contain UTF-8 chars and also control characters etc. IIUC to contain UTF-8 chars they must specify the SSIDEncoding field. I was under the impression that I can only get ASCII bytes till now.
How should I handle the situation in Android ? Namely, how can I check the SSIDEncoding field from the ScanResult ? Do I need to ? Also what does ScanResult.SSID contain in these cases (including the case an SSID includes non printable characters) ?
Related
Why can't I detect a wifi SSID with unicode characters on Android?
The short answer is that you can't detect it, but you don't need to.
Android only returns a String in ScanResult and WifiConfiguration (There is no documented encoding field). Since Java Strings can contain accept different encodings, but internally store unicode (How to check the charset of string in Java?) then the original encoding is lost in translation. But if all you need is a String and you are using APIs and storage mechanisms that accept Java String then all of those things should already support encoding that String to whatever format they require and you probably have nothing to worry about.
I can't speak to what Android does under the covers with respect to giving you that String, but the link you provided provides some ideas. If Android does or does not support different encoding types for the SSID, there's nothing you can do about it.

Categories

Resources