Android pocketsphinx & Fsg model - android

Context
I am currently building an sdk/service on wich applications can access to voice based command,
For the moment i'm using android pocketsphinx to detect a keyword (which is "wake"), and then analyse whole sentence with google voice recognition,
But my problem is i want to make it all offline! So i'm in my way to replace google voice recognition by a full utilisation of pocketsphinx...
My Problem
The user define which is the word he want to detect, and previously i just compared the said-word and what google voice speech-to-text returned me...
So know I want to update the grammar that pocket sphinx use with just the word given by the user, which problematic because (following the javadoc of android pocket sphinx) it can only take grammar files!
Question
Are there any way i can update android pocketsphinx grammar on the fly?
Edit
I forgot to talk about this method:
public void addFsgSearch(String searchName, FsgModel fsgModel) (in github pocketsphinx)
wich seem to deosn't take a grammar file like any other grammar setter method, but rather a class/struct? but the problem it's it isn't documented.....

If you need to detect just one word, consider using addKeywordSearch.

I had the same issue, and more. Perhaps these undocumented discoveries can help you.
Using the overloaded method "addGrammarSearch(String name, String fsgString)" allows you to put your entire FSG or JSGF grammar definition in a string, rather than sourcing it from a file if you wish (only a small file open/read time advantage).
"addKeyphraseSearch(String name, String keyphrase)" // only accommodates ONE WORD or PHRASE, no threshold, no grammar.
"addKeywordSearch(String name, File keywordList)" // accommodates MULTIPLE key WORDS or PHRASES, adding thresholds for each.
Several caveats include:
The grammar searches use JSGF format, parsing the defined syntax correctly. However:
1.1 Tags are not implemented
1.2 Unclear if weights (though the same // syntax as in keyword lists) actually apply recognizer thresholds (they have different meanings in PocketSphinx versus Sun Microsystems).
1.3 Rule names are also not implemented either.
1.4 In other words, you provide a grammar in JSGF, and your Hypothesis as well as FinalResult strings still give you the recognized lowest-level phrase detected in the grammar -- NOT the grammar tags, nor even rule metasymbols.
1.3.1 IMHO, that makes grammars pointless, and actually less efficient and less flexible than keyword list files (which are actually words or phrases) due to the option to provide a threshold for recognizer scrutiny, per phrase. Further, if the RULE & TAG names are not returned, then there is zero information regarding the structure of the grammar that was recognized. So as syntactically complex and flexible as it is, I do not see the advantage of bothering with a grammar definition at all in PocketSphinx; the best multiple keyphrase approach is simply to expand your grammar into a keyword list file. Please correct me if I am mistaken.
Search methods, whether containing the word "phrase" or the word "word", actually accommodate both phrases or single words.
I have assumptions re: the undocumented fsgModel class, but we're not allowed to give assumptions.
Though this may help clarify some aspects,the above fails to add any functionality to the package. Lastly, the C source code has methods getRuleName() and getTagName(). But, discussions regarding this topic between users and developers seems to stonewall -- their is no motivation to add tags or rule name associations to recognized words or phrases in a defined grammar, apparently because the developers believe grammars are old-school and nobody uses them anymore.

Related

Is there any where I can upload two strings xml files to be translated?

My app is going to be translated by several amateur translators for several languages. I can send them the xml file with all the strings that need to be translated. But, is there a cleaner way to have two files uploaded, the one in English and the one to be translated, to easily identify the strings that are still missing? Basically is like having the Translation Editor of Android Studio but online.
Maybe using google docs? How do you do this?
You can use Google Docs, but that's quite an outdated way to handle this.
The major cons:
it would be cumbersome to update strings this way
no easy way to make sure the new ones have new translations, not the old ones, etc.
no good way to provide context, if needed (typically translators have questions). You can create a column with context and take any discussions into comments, but it can get messy
A few pros:
it's fast to create (although slow to keep up-to-date)
you cooperate online and have shared access
Most developers use localization platforms, which makes updating content and online cooperation much faster.
Main pros:
it's easy to identify strings that are missing
any number of translators can translate simultaneously
track work that is done by each of translators
you can add a review/proofreading step to the process to ensure the quality of translations
leverage Machine Translations and then just have translators review them (saves lots of time)
update content, as most platforms support agile workflow
you can see who's the top translator (give some rewards, invite to other projects, etc.)
integrations (with your Git tool, Android Studio, etc), so you can automate content updates, no manual copy-pasting
Cons:
some of them are paid (still, if you're open source, you can expect a free plan)
Regarding the tools, I can suggest looking at Crowdin or Poedit.
There are many alternatives you can research, some are listed on Wikipedia.
At my work we had to translate english into Norwegian, we've done that by working with an python script that generated an ui from an csv file, after that the file could be exported in several formates as well. But your question indicates that you would like to deploy only on android, so this might be an overkill.
a simple python xml filter would fit your aproach and you could work as well with git as long as the lines stays in the same order.
if you need an quick example please comment, and ill edit this answer as soon as i get time.
At one point I have also had the same question. I need the translation for my vernacular app, also I had the requirement to maintain such that I could easily compare the translation. Here I could suggest a few things that worked out for me.
First, take the string XML file and convert it in an Excel sheet, You may generate multiple excel sheet and having a copy, paste and merge all the translations into a single sheet.
Going forward it will be easy for you to maintain all the translation. Just share a single sheet which has a string key and multiple language column. So you could easily have a look at all language translations.
In the long run, it will be helpful to you.
Few links for the conversion of XML to excel -
Convert string XML to Excel sheet
Using the below online tool works for me. Free and Opensource easy and best.
https://asrt.gluege.boerde.de/

How to limit the use of certain character sets

I hope this question isnt going to be down-flagged for not showing some actual code, but thats the core of this situation. I simply have no clue where to start to solve this issue, even after trying to use several combinations of keywords on both Google, and here on SO.
My client suddenly decided that half of the Android App I'm developing for him has to be Chinese, so after I have made some changes in the Database so some fields can take in Simplified Chinese character sets, I need to make sure that my client (living in holland) only uses those characters in that particular EditText field in the app. (There are more Database fields that now only allow Simplified Chinese, however these values come from a dropdown list in the app, so I dont need to worry about wrong characters for them).
So how would one make sure that only Simplified Chinese is used in an EditText field?
Here is a project in Ruby that attempts to detect whether characters are Traditional Chinese, Simplified Chinese, or Japanese (maybe others?): https://github.com/jpatokal/script_detector
This detection is based on the Unihan Database, in which there is a file called Unihan_Variants.txt. (Download zip file containing this text file here.)
Conceivably, you could parse the txt file into a lookup table and check the unicode value as the text is entered during onTextChanged() for your EditText. However, the readme on the project linked above states: "It is important to understand that this requires long sections of text to work reliably, since a single character or even several characters may be valid Japanese, traditional Chinese and simplified Chinese simultaneously." So, weeding out characters on an individual basis might prove difficult.

How to speed up searching alphabetized word list for leading wildcard matches

I'm a word puzzle junky in my spare time, so I've spent a LOT of other spare time working on a helper program that allows wildcards in search patterns. It works great. On my Dell Laptop (i5, 8GB RAM) the search of a 140,000-word "dictionary" for wildcard matches for words has an almost imperceptible and definitely acceptable delay that occurs only if tens of thousands of words are returned. Java rules. So does its implementation of regex and match().
I was hoping to port it to Android. I worked all day getting a more-or-less equivalent app to compile. No chance with given code architecture.
The problem is that leading wildcard characters can (must) be allowed. E.g., ???ENE returns 15 matches--from achENE to xylENE and *RAT returns 22 matches--from aristocRAT through `zikuRAT--i.e., all 140,000 words must (?) be searched, which is going to take aaaaaaaaawhiiiiiiiiile on most (all?) Android devices. (Each took less than a second on my laptop.) (It takes my PC 3 seconds to return all 140,000 words and a little longer to eyeball them all.)
Since some word puzzles allow variable numbers of letters in words, disallowing leading wildcards cuts the heart out of the app for such puzzles. But if the search pattern had to start with a letter it would be easy enough to then do a binary search (or something quicker). (And it still might be unacceptably slow.)
Anyway, I was wondering if anybody might know some algorithm or can think of some approach that might be applied to speed up searches with leading wildcard characters.
I believe that the optimized version of what you are trying to do is widely known as the Unix/Linux utility "grep", which, if I remember correctly, uses the Boyer-Moore search algorithm.
Under the covers, Java's Pattern class uses Boyer-Moore. And it supports regex, so if you can write something to turn your wildcard search patterns into regular expressions, you can use Pattern.
There's an interesting Java implementation of grep at http://www.java2s.com/Code/Java/Regular-Expressions/AnotherGrep.htm
It uses memory-mapped files. I'm guessing that you won't be able to fit your entire word list into memory, but you could split it up into a bunch of smaller files - the implementation above memory-maps one file at a time. You'd have to do some testing to find the optimal size of a file.
I just Googled and found having a second list reverse alphabetized might be a way to then have a leading wildcard become trailing, opening door to binary search for pattern start. Interesting. But *a???ene* is also a legal search pattern in the program. What then? (Yeah. How often would you need such a search.)
I just found this about Apache Lucene:
Leading wildcards (e.g. *ook) are not supported by the QueryParser by default. As of Lucene 2.1, they can be enabled by calling QueryParser.setAllowLeadingWildcard( true ). Note that this can be an expensive operation: it requires scanning the list of tokens in the index in its entirety to look for those that match the pattern.

Tesseract - OCR issues with typewriter style fonts

We are using Tesseract.NET (and the Android version too) to recognize and extract document data. It worked really good with Arial and Cambria fonts, but now we have to recognize documents like that:
Tesseract cannot recognize it. Absolutely nothing (except the big sized serial number on the right upper corner).
We tried to train it, but - maybe it's our fault - it's still unstable.
What can we do?
(Btw the font is use by national offices, we cannot get it as true type or other font format.
In the current form it is very hard for an OCR tool to recognize any letters.
Serif fonts are hard to ocr.
Letters are very close together. Some are joined.
A dictionary is not of any help.
You might be able to improve the result with the following:
As this looks like an vehicle registration certificate you should be able to predict the positions of the textstrings of interest and then ocr they separatly.
Thereby using the -psm=7 or 8 option (assume single line or word).
As some strings seem to be numbers only you can help tesseract by using the digits argument.
For the alphanumeric strings it might help to reduce the dictionary pruning (or completely remove the dawg files.)
If those strings like 'ETZ' or 'MZ' are abbreviations you could also build an dictionary with those.
Reducing the yellow and green color is also an (easy) option you could test.
Use the barcode instead of trying to ocr the string.
For tesseract questions it always helps if you specify the version used and, if you do image preprocessing, provide a sample image of the processed input.

How To Detect Is Text Human Readable?

I am wondering if there's a way to tell a given text is human readable. By human readable, I mean: it has some meanings, format like an article written by somebody, or at least generated by a software translator that is intended to be read by a human.
Here's the background story: recently I am making an app that allows user to upload a short text to a database. At the early stage of deployment I noticed some user always uploaded corrupted text due to a problem with encoding. This problem is fixed later, but leaves me wonder if there's a way to pick up non human readable text before serving the text back to users.
Any advice will be appreciated. The scope might be too large to include other languages, so at the moment let's limit the discussion to English only.
You can try a language identification tool, or something similar.
Basically you have to count the characters, or groups of character (character n-grams), and compare the distribution of the letters of the text submitted with the distribution of the letters of a collection of texts written in good english. (Make sure that such collection of texts is representative of the expected input).
In the continuity of a N-gram approach you might want to try a dictionary based approach and check for the presence of 'stop words' (e.g. 'the', 'a', 'an', 'of') in the input text.
Most of the NLP-Libraries will do the job (Spacy is a very common one). You can also go for language detection: Langdetect will support you on this
(https://pypi.org/project/langdetect/) as many others will do. If you need to be less specific (more math than language) you should look for Phonotactics (with BLICK for Python: https://github.com/mmcauliffe/python-BLICK) that looks into the construction of character order in a string.
Do a hexdump and make sure each character is less than or equal to 0x7f.

Categories

Resources