Identify type of regex pattern

Identify type of regex pattern - android

In my application I am adding edittext based on response provided by server. For each edittext server also provides regex pattern to match. I am able to successfully able to match pattern and do validations. But I want to identify type of regex pattern so that I can open keyboard according to value edittext should accept.
For example,
If edittext should accept email address then keyboard with # sign opens up and if edittext accept numeric values it should open numeric keypad.
Is there any library which can return type from its regex pattern such as "Email", "Number" etc. from regular expressions as there can be several different types of regex pattern?
EDIT: I know how to set input type for edittext but I need to find out type from regex pattern. I am not able to make changes in server I have to handle this on client side.

There most probably isn't one. The reason is - there is no way to tell for sure. Anything you can come up with will be heuristic.
Heuristic one:
If the pattern looks for something containing a dot, followed by # sign, followed by something containing a dot - it's an email validation.
If the pattern contains only \d, or number ranges ([1-5]), or single numbers (7) plus repetition meta characters (?, *, +, {4, 12}), it's a number validation.
If the pattern contains \w and no # sign, it's a regular text.
Continue in the same spirit.
+ high control. You can always add new guesses when you see that your results aren't accurate in some case
- requires more code to implement
- requires very good knowledge of regexes
Heuristic two:
Use a list of strings, which you know the type of and try to match them with the regex. Aka, for emails try example#gmail.com.
+ easy to implement. Small chance problematic logic
- least amount of control. If the server is giving you email patterns for different domains you can't guess that this is an email pattern, unless you know all possible domains
Heuristic three:
Use a library that can generate example strings from regex and match them with your own regexes to determine the type. Here is one for Java and another one for JavaScript.
+ gives a good combination of high control and easy implementation
- you still have to write your own regexes (not as trivial as the 2nd heuristic)
- people sometimes write regexes that allow some false positives. Therefore, generated strings might not be in the perfect format (not as much control as the 1st heuristic)
Are the regexes static?
If yes - you should make a mapping and use that.
If no - use a heuristic like one of the above and improve it over time as you gain more statistics about how the generated regexes usually look.

Related

How to limit the use of certain character sets

I hope this question isnt going to be down-flagged for not showing some actual code, but thats the core of this situation. I simply have no clue where to start to solve this issue, even after trying to use several combinations of keywords on both Google, and here on SO.
My client suddenly decided that half of the Android App I'm developing for him has to be Chinese, so after I have made some changes in the Database so some fields can take in Simplified Chinese character sets, I need to make sure that my client (living in holland) only uses those characters in that particular EditText field in the app. (There are more Database fields that now only allow Simplified Chinese, however these values come from a dropdown list in the app, so I dont need to worry about wrong characters for them).
So how would one make sure that only Simplified Chinese is used in an EditText field?

Here is a project in Ruby that attempts to detect whether characters are Traditional Chinese, Simplified Chinese, or Japanese (maybe others?): https://github.com/jpatokal/script_detector
This detection is based on the Unihan Database, in which there is a file called Unihan_Variants.txt. (Download zip file containing this text file here.)
Conceivably, you could parse the txt file into a lookup table and check the unicode value as the text is entered during onTextChanged() for your EditText. However, the readme on the project linked above states: "It is important to understand that this requires long sections of text to work reliably, since a single character or even several characters may be valid Japanese, traditional Chinese and simplified Chinese simultaneously." So, weeding out characters on an individual basis might prove difficult.

Android pocketsphinx & Fsg model

Context
I am currently building an sdk/service on wich applications can access to voice based command,
For the moment i'm using android pocketsphinx to detect a keyword (which is "wake"), and then analyse whole sentence with google voice recognition,
But my problem is i want to make it all offline! So i'm in my way to replace google voice recognition by a full utilisation of pocketsphinx...
My Problem
The user define which is the word he want to detect, and previously i just compared the said-word and what google voice speech-to-text returned me...
So know I want to update the grammar that pocket sphinx use with just the word given by the user, which problematic because (following the javadoc of android pocket sphinx) it can only take grammar files!
Question
Are there any way i can update android pocketsphinx grammar on the fly?
Edit
I forgot to talk about this method:
public void addFsgSearch(String searchName, FsgModel fsgModel) (in github pocketsphinx)
wich seem to deosn't take a grammar file like any other grammar setter method, but rather a class/struct? but the problem it's it isn't documented.....

If you need to detect just one word, consider using addKeywordSearch.

I had the same issue, and more. Perhaps these undocumented discoveries can help you.
Using the overloaded method "addGrammarSearch(String name, String fsgString)" allows you to put your entire FSG or JSGF grammar definition in a string, rather than sourcing it from a file if you wish (only a small file open/read time advantage).
"addKeyphraseSearch(String name, String keyphrase)" // only accommodates ONE WORD or PHRASE, no threshold, no grammar.
"addKeywordSearch(String name, File keywordList)" // accommodates MULTIPLE key WORDS or PHRASES, adding thresholds for each.
Several caveats include:
The grammar searches use JSGF format, parsing the defined syntax correctly. However:
1.1 Tags are not implemented
1.2 Unclear if weights (though the same // syntax as in keyword lists) actually apply recognizer thresholds (they have different meanings in PocketSphinx versus Sun Microsystems).
1.3 Rule names are also not implemented either.
1.4 In other words, you provide a grammar in JSGF, and your Hypothesis as well as FinalResult strings still give you the recognized lowest-level phrase detected in the grammar -- NOT the grammar tags, nor even rule metasymbols.
1.3.1 IMHO, that makes grammars pointless, and actually less efficient and less flexible than keyword list files (which are actually words or phrases) due to the option to provide a threshold for recognizer scrutiny, per phrase. Further, if the RULE & TAG names are not returned, then there is zero information regarding the structure of the grammar that was recognized. So as syntactically complex and flexible as it is, I do not see the advantage of bothering with a grammar definition at all in PocketSphinx; the best multiple keyphrase approach is simply to expand your grammar into a keyword list file. Please correct me if I am mistaken.
Search methods, whether containing the word "phrase" or the word "word", actually accommodate both phrases or single words.
I have assumptions re: the undocumented fsgModel class, but we're not allowed to give assumptions.
Though this may help clarify some aspects,the above fails to add any functionality to the package. Lastly, the C source code has methods getRuleName() and getTagName(). But, discussions regarding this topic between users and developers seems to stonewall -- their is no motivation to add tags or rule name associations to recognized words or phrases in a defined grammar, apparently because the developers believe grammars are old-school and nobody uses them anymore.

How To Detect Is Text Human Readable?

I am wondering if there's a way to tell a given text is human readable. By human readable, I mean: it has some meanings, format like an article written by somebody, or at least generated by a software translator that is intended to be read by a human.
Here's the background story: recently I am making an app that allows user to upload a short text to a database. At the early stage of deployment I noticed some user always uploaded corrupted text due to a problem with encoding. This problem is fixed later, but leaves me wonder if there's a way to pick up non human readable text before serving the text back to users.
Any advice will be appreciated. The scope might be too large to include other languages, so at the moment let's limit the discussion to English only.

You can try a language identification tool, or something similar.
Basically you have to count the characters, or groups of character (character n-grams), and compare the distribution of the letters of the text submitted with the distribution of the letters of a collection of texts written in good english. (Make sure that such collection of texts is representative of the expected input).
In the continuity of a N-gram approach you might want to try a dictionary based approach and check for the presence of 'stop words' (e.g. 'the', 'a', 'an', 'of') in the input text.

Most of the NLP-Libraries will do the job (Spacy is a very common one). You can also go for language detection: Langdetect will support you on this
(https://pypi.org/project/langdetect/) as many others will do. If you need to be less specific (more math than language) you should look for Phonotactics (with BLICK for Python: https://github.com/mmcauliffe/python-BLICK) that looks into the construction of character order in a string.

Do a hexdump and make sure each character is less than or equal to 0x7f.

Handling phone numbers properly (storing, ideally using a unique form)

This question is not specific to Android but I have included the tag.
I need to be able to store phone numbers in some sort of standard form (ideally a string) where equality can be tested/evaluated quickly (hence a string would be ideal)
I found some answers already, the best ones pointed to http://developer.android.com/reference/android/telephony/PhoneNumberUtils.html (I'm fine with using a library to do it for me)
BUT this isn't really good enough, I've tried a variety of format numbers, learnt about the Editable factory to use some of the static methods in that class, but they don't seem to return the form I was expecting.
I was expecting something like a phone-number-hash, that two inputs representing the same number would yield the same in this "standard form" and that one could dial this standard form and be fine. I thought that all the various +s and whatnot would be short-hands for this standard form.
I'm not sure if such a thing exists now.
I understand that some things mean "current area" (or country) which is why land-lines can ommit area codes, I expected a function that would return the format for the current location (but this doesn't apply to mobiles, if it were a land line to prepend the area code for example, this would be (closer) to the "standard form" I keep assuming exists)
I am pretty sure that some full-form for phone numbers exists, thinking about how the telephone system works (which I infer I admit) there ought to be a form that identifies a number uniquely across the whole planet, and when this is not the case (such as local calls from land-lines without area codes) it is an optimisation.
So I have two questions:
How can I "expand" a phone number to a unique string for that number, such that any alternate forms of writing that number (with spaces, an 0 or +44....) "expand" to this unique number?
Are there any ISO(/IEC?) (what's the O stand for?) standard documents with drafts open to the public? I've read the Wikipedia page (ages ago, I've spent so many hours wiki-browsing, and opened hundreds of tabs) but it covers history, or some information on formatting), I'd like to know more about the thing I've taken for granted now for some 8 years or so.
Additionally, why is Windows Phone 8 a tag? To make the 12 proud Lumina owners not feel left out? (It was suggested as a tag!)
Addendum
Unfortunately Any API in android to normalize phone number there are no solutions there (this includes libphonenumber) and my quest to find out has lead to some interesting reads:
http://en.wikipedia.org/wiki/Panel_switch
http://en.wikipedia.org/wiki/Nonblocking_minimal_spanning_switch
http://en.wikipedia.org/wiki/Telephone_exchange
and I still cannot conclude there isn't some "full form" for numbers.
I dare not create a solution that simply swaps +44 for an 0 and such.

After reading your question, I was reminded of Google's library called libphonenumber. Its Google's common library for parsing, formatting, storing and validating international phone numbers. It does the following things ( some of which seem what you might be able to use):
Parsing/formatting/validating phone numbers for all countries/regions
of the world.
getNumberType - gets the type of the number based on
the number itself; able to distinguish Fixed-line, Mobile, Toll-free,
Premium Rate, Shared Cost, VoIP and Personal Numbers (whenever
feasible).
isNumberMatch - gets a confidence level on whether two
numbers could be the same.
isPossibleNumber - quickly guessing whether a number is a possible
phonenumber by using only the length information, much faster than a
full validation.
isValidNumber - full validation of a phone number
for a region using length and prefix information.
AsYouTypeFormatter - formats phone numbers on-the-fly when users enter each digit.
PhoneNumberOfflineGeocoder - provides geographical information related to a phone number.
As far as international format of phone number is concerned, E.164 format is an recommended by International Telecommunication Union. It defines a numbering plan for the world-wide public switched telephone network and is a general format for international telephone numbers ( usually stats with + followed by country code, Area code and the number).
Using the above library, validity of all the phone numbers can be checked if you mention the international code along with the phone number ( example 1 for US & Canada). If you don't have the code but you know the country's name for which you want to check the number, then also you can validate. You can also convert all the valid numbers of 1 standard E.164 format using this library. You can also 'expand' a number in Local National format of that particular country. You can save it as String as well. Although it does use PhoneNumberUtils that you mentioned in your question.
I am not sure if this is what you are looking for but I hope this information helps you.

Best Character for splitting Android

I started an android project, just like chat program.
Data downloaded from my server just like this
1~my name~my username~message
Nah, my question is, is there any character that compatible with android
to replace the delimiter (~) above. Im afraid, if in other day, user use the
character ~, program will crashed.
I used character ÷, but my android cant read it, it turned to '?'.
Did someone had the same problem ??

First of all it is almost bad idea to create your own format for client-server communication, my best advice is to give a shot to json or xml. There are lots of library available both on client side and server side to form/parse them all you have to do is use you back-end language to return either one of the format.
For python : http://docs.python.org/library/json.html
For php : http://php.net/manual/en/book.json.php
For Android : http://developer.android.com/reference/org/json/JSONObject.html
You can easily find other languages with simple search.

If you're using also Java on the server side, you could define an object like ChatMessage and just send it per Socket and an Object Stream to the Server.
As Burak noted, your way is the wrong way... but there are several other ways, IMHO an object stream might be the easiest solution for you.

If you use a delimiter which is a possible content of the data put into the flow you are delimiting, you will have a problem.
To prevent that, you need to prevent the character from occurring in a way that could be misinterpreted.
At the input side, detect occurrences and replace them either with a special code, or with an escaping prefix character, or quote the contents (though then you have to handle literal occurences of the quote characters)
If you use an escaping character, your splitting code must ignore any delimiter following an escape character or within a quoted sequence.
At the output side you should replace the codes or escape sequences with a literal instance of the encoded character or remove any quoting characters.
As others have mentioned, there are a number of standard schemes and functions for handling them.

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.