Create regex pattern for a specified String

Create regex pattern for a specified String - android

I want to check if a String has a specified structure. I think regex would be the best way to test the String, but I have never used regex before and have sadly no clue how it works. I watched some explanations on stackoverflow, but I couldn't find a good explanation how the regex pattern was created.
My String gets returned from a DataMatrix scanner. For example
String contained = "~ak4,0000D"
Now I want to test this String, if it matches the pattern from the regex.
The String starts everytime with the "~".
After this, two lower cased charactes follow in this example "ak".
After this, there follows a six character long value "4,0000". Main problem here, because the comma can sit anywhere in this value, but the comma must be contained in it. For example it can be ",16000" or "150,00" or "2,8000".
At the last position there must be one of this characters A B C D E F G H J K L M in uppercase contained.
I hope some of you guys can help me.

The regex would be ~[a-z]{2}(?=[\d\,]{6})((\d)*\,(\d)*)[A-H|J-M]{1}$ You can create and test expressions here
boolean isMatch(String STRING_YOU_WANT_TO_MATCH)
{
Pattern patt = Pattern.compile(YOUR_REGEX_PATTERN);
Matcher matcher = patt.matcher(STRING_YOU_WANT_TO_MATCH);
return matcher.matches();
}

You need to use a positive lookahead based regex like below.
System.out.println("~ak4,0000D".matches("~[a-z]{2}(?=\\d*,\\d*.$)[\\d,]{6}[A-HJ-M]"));
System.out.println("~fk,10000D".matches("~[a-z]{2}(?=\\d*,\\d*.$)[\\d,]{6}[A-HJ-M]"));
System.out.println("~jk400,00D".matches("~[a-z]{2}(?=\\d*,\\d*.$)[\\d,]{6}[A-HJ-M]"));
System.out.println("~ak4,0000D".matches("~[a-z]{2}(?=\\d*,\\d*.$)[\\d,]{6}[A-HJ-M]"));
System.out.println("~fk10000,D".matches("~[a-z]{2}(?=\\d*,\\d*.$)[\\d,]{6}[A-HJ-M]"));
System.out.println("~jk400,00I".matches("~[a-z]{2}(?=\\d*,\\d*.$)[\\d,]{6}[A-HJ-M]"));
System.out.println("~ak40000,Z".matches("~[a-z]{2}(?=\\d*,\\d*.$)[\\d,]{6}[A-HJ-M]"));
System.out.println("~fky,10000D".matches("~[a-z]{2}(?=\\d*,\\d*.$)[\\d,]{6}[A-HJ-M]"));
System.out.println("~,jk40000D".matches("~[a-z]{2}(?=\\d*,\\d*.$)[\\d,]{6}[A-HJ-M]"));
Output:
true
true
true
true
true
false
false
false
false

One thing you need to know about regular expressions are that they are a family of things, not one specific thing. There are rather a lot of distinct but similar regular expression languages, and the facilities supporting them vary from programming language to programming language.
Here is a regex pattern that will work in most regex languages to match your strings:
"^~[a-z][a-z]((,[0-9][0-9][0-9][0-9][0-9])|([0-9],[0-9][0-9][0-9][0-9])|([0-9][0-9],[0-9][0-9][0-9])|([0-9][0-9][0-9],[0-9][0-9])|([0-9][0-9][0-9][0-9],[0-9])|([0-9][0-9][0-9][0-9][0-9],))[A-HJ-M]$"
The '^' anchors the pattern to the beginning of the string, and the '$' anchors it to the end, so that the pattern must match the whole string as opposed to a substring. Characters enclosed in square brackets represent "character classes" matching exactly one character from among a set, with the two characters separated by a '-' representing a range of characters. The '|' separates alternatives, and parentheses serve to group subpatterns. For some regex engines, the parentheses and '\' symbols need to be escaped via a preceeding '\' character to have these special meanings instead of representing themselves.
A more featureful regex language can allow that to be greatly simplified; for example:
"^~[a-z]{2}[0-9,]{6}(?<=[a-z][0-9]*,[0-9]*)[A-HJ-M]$"
The quantifiers "{2}" and "{6}" designate that the preceding subpattern must match exactly the specified number of times (instead of once), and the quantifier "*" designates that the preceding subpattern may match any number of times, including zero. Additionally, the "(?<= ...)" is a zero-length look-behind assertion, which tests whether the previous characters of the input match the given sub-pattern (in addition to having already matched the preceding sub-pattern); the characters must also match the subsequent sub-pattern (which does consume them). The '.' metacharacter and '*' quantifier are supported in pretty much all regex languages, but assertions and curly-brace quantifiers are less widely supported. Java's and Perl's regular expression languages will both understand this pattern, however.

~[a-z]{2}[\d|\,]{6}[A-M]
I'm no pro at regex though,but I used this site everytime to build my pattern:
RegExr
Use it like this in your code:
Pattern pattern = Pattern.compile(yourPatternAsAString);
Matcher matcher = pattern.matcher(yourInputToMatch);
if(matcher.matches()) {
// gogogo
}

Related

Kotlin regex not working for polish char ("ł") which I get at runtime

I've declared a regex like this:
"(^\\d{1,}\\,\\d{2}|^0) zł$"
Unfortunately it doesn't match below value (but it should)
508,00 zł
NOTE1: I've discovered, that the problem is probably with the ł character
NOTE2: The problem is, that i am getting this String from an API and check it at runtime (it has exact value as I described)
NOTE3: I've also tried to manually match my pattern in the debugger evaluation (when I just typed the "508, 00zł" by hand) and it matched. Unfortunately the string itself that I get doesn't match at runtime. What can be the possible problem?
Code:
val value = getFromApi() // 508,00 zł
val regex = "(^\\d{1,}\\,\\d{2}|^0) zł$".toRegex()
regex.matches(value) // returns false

The letter ł is not a culprit here since there is one Unicode representation for it.
The most common issue is the whitespace: it can be any Unicode whitespace there and from the looks of it, you will never be able to tell.
To match any ASCII whitespace, you may use \s. Here, you had this kind of whitespace, so my top comment below the question worked for you.
To match any Unicode whitespace, you may use \p{Z} to match any one whitespace character, or \p{Z}* to match 0 or more of their occurrences:
val value = "508,00 zł"
val regex = """^(\d+,\d{2}|0)\p{Z}zł$""".toRegex()
// val regex = """^(\d+,\d{2}|0)\p{Z}*zł$""".toRegex()
println(regex.matches(value)) // => True
See Kotlin demo
Also, note the use of the raw string literals (delimited with triple double quotation marks), they enable the use of a single backslash as the regex escape char.
Note {1,} is the same as + quantifier that matches 1 or more repetitions.

Custom Regular Expression in Java

I have to implement a function that check if a string is compliant to a regular expression, I have wrote a method that parse a list of filename, for each file name I need to check if respect the regexp.
The filename is composed like as follow (just an example):
verbale.pdf.001.001
image.jpg.002.001
The string is always composed by:
extension (only jpg or pdf) "." a group of three number "." a group of three number
With this regexp I need to check if the string in input end as described above, I have currently implemented this:
Pattern rexExp = Pattern.compile("((\\.jpg)|(\\.pdf))\\.[0-9]{3}\\.[0-9]{3}");
But not work properly, is it a good idea implement a regExp to check if a filename end with a certain path ?

Less greedy than the other answer, think it suits you:
\\w+\\.(jpg|pdf)(\\.\\d{3}){2}
file name, only composed of letters, numbers and _
dot
jpg or pdf formats
another dot
three digits
the dot and the three digits repeated

This should work :
.*\\w{3}\\.\\d{3}\\.\\d{3}
.* = any Characters (like "verbale123")
\\w{3} = any 3 alphabetic\numeric characters
\\. = a dot
\\d{3} = any three numeric characters

To check if a string ends with pdf or jpg and two sequences of . and 3 digits, you may use
(?i)(?:jpg|pdf)(?:\.[0-9]{3}){2}$
See the regex demo
Details
(?i) - case insensitive flag
(?:jpg|pdf) - either jpg or pdf
(?:\.[0-9]{3}){2} - 2 repetitions of a . and 3 digits
$ - end of string.
Use with Matcher#find() (as matches() anchors the match at the start and end of the string, while a partial match is required when using this pattern), example demo:
String s = "verbale.pdf.001.001";
Matcher matcher = Pattern.compile("(?i)(?:jpg|pdf)(?:\\.[0-9]{3}){2}$").matcher(s);
if (matcher.find()){
System.out.println("Valid!");
}

Combination of rules with Regex

In an android project, im trying to validate a password that the user inputs, and it must follow some rules
The rules are:
it must have 7 characters and 3 of the following conditions
**
-One lowercase character
-One uppercase character
-One number
-One special character
**
for example:
asd123!!!
PPPppp000
TTT999###
i was trying with this regex
^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[#?!#$%^&*-]).{7,}+$
but this enforces all rules at same time.

The approach is wrong here. The regex you created looks like a monster from under the bed, and is highly illegible even for someone regex-literate.
Why not split it into 4 (or as much as there are rules) regexes and check against whether 3 of them return a match? Not only will you make your regexes cleaner, but you will be able to add more rules if need be without changing whole regex.
You can also use inbuilt methods for checking (if applicable under Android development kit).
Some pseudocode would look like this:
result1 = Regex.IsMatch(password, rule1regex)
result2 = Regex.IsMatch(password, rule2regex)
...
resultN = Regex.IsMatch(password, rule3regex)
if(three_out_of_four_rules_apply)
password_valid = true
You can also apply method suggested in comments by #pskink and iterate over each character of a password and set the output accordingly.

Without going into the details of your lookaheads (which seem correct), here's how you would need to implement "three out of four criteria" in pure regex :
(?=.*A)(?=.*B)(?=.*C)|(?=.*A)(?=.*B)(?=.*D)|(?=.*A)(?=.*C)(?=.*D)|(?=.*B)(?=.*C)(?=.*D)
You can test it here.
Factorizing doesn't really make it better :
(?=.*A)(?:(?=.*B)(?=.*(?:C|D))|(?=.*C)(?=.*D))|(?=.*B)(?=.*C)(?=.*D)
I obviously recommend using a higher level language to implement these sorts of constraints.

Google Sheets App Script to remove Bracketed word from String

Hmm, I can't find the man page for 'replace' in Googles App scripts, I only see 'replaceText'. Anyway, from what I gather from the SO posts, the below should work, hopefully someone can spot it easily.
The String in the Cell is "[pro] all, everybody" and I want to remove the bracketed word '[pro]' so the result is 'all, everybody'.
It does work just fine with:
Cell = Cell.toString().replace("\[pro\]","");
but when I try to make it generic, it fails with all these (not sure what the pattern matching rules are, thus the question for the man page):
Cell = Cell.toString().replace("\[pr.\]","");
Cell = Cell.toString().replace("\[pr.*\]","");
Cell = Cell.toString().replace("\[.*\]","");
they should work, no ? What am I missing ?
Also, how would I use 'replaceText', I can't seem to apply it directly to the 'Cell' object.

The String#replace is a JavaScript function where you need to use a regex with a regex literal notation or with new RegExp("pattern", "modifiers") constructor notation:
Cell = Cell.toString().replace(/\[pr[^\]]*]/,"");
When using a regex literal, backslashes are treated as literal backslashes, and /\d/ matches a digit. The constructor notation equivalent is new RegExp("\\d").
The /\[pr[^\]]*]/ regex matches the first instance of:
\[pr - literal substring [pr
[^\]]* - 0+ chars other than ]
] - a literal ] symbol.
And replaces with an empty string.

What regex can be used to filter out dalvikvm AND dalvikvm-heap messages from the logcat

Using this link I was able to create a filter using the regex (?!dalvikvm\b)\b\w+ to filter out messages with the tag dalvikvm, but I have tried several variations of the regex such as (?!dalvikvm-heap\b)\b\w+, (?!dalvikvm\\-heap\b)\b\w+, (?!dalvikvm[-]heap\b)\b\w+, and many others and I can't seem to get rid of the dalvikvm-heap messages. Ideally I would like to filter them both, but I haven't figured that part out yet either.
Any help would be appreciated.

Use ^(?!dalvikvm) in the tag field instead. That will show only messages whose tag doesn't start with "dalvikvm".
The below is notes about how this works; you can skip them if you're not interested. To start, you have to remember that the question, "Does this string match the regex?" really means, "Is there any position in this string where the regex matches?"
The tricky thing about (?!x) negative assertions is that they match wherever the next part of the string doesn't match x: but that's true of every place in the string "dalvikvm" except the start. The blog post you linked to adds a \b at the end so that the expression matches only at a place that's not just before "dalvikvm" and is a word boundary. But this would still match, because the end of the string is a word boundary, and it doesn't have "dalvikvm" after it. So the blog post adds the \w+ after it, to say that after the word boundary there have to be more word characters.
It works for exactly that case, but it's a bit of an odd way of making a regex, and it's relatively expensive to evaluate. And as you've noticed, you can't adapt it to (?!dalvikvm-heap\b)\b\w+. "-" is a non-word character, so immediately after it, there is a word boundary, followed by word characters, and not followed by "dalvikvm-heap", so the regex matches at that point.
Instead, I use ^, which only matches at the start of the string, along with the negative assertion. Overall, the regex only matches at the start of the string, and only then if the start of the string isn't followed by "dalvikvm". That means it won't match "dalvikvm" or "dalvikvm-heap". It's also cheaper to evaluate, because the regex engine knows it can only possibly match at the start.
Making the regex this way, you can filter out multiple tags by just putting them together. For example, ^(?!dalvikvm)(?!IInputConnectionWrapper) will filter out tags that start with "dalvikvm" or "IInputConnectionWrapper", because the start of the string has to not be followed by the first and not be followed by the second.
BTW, thanks for your link. I didn't realise that you could use the logcat filters that way, so I wouldn't have come up with my answer without it.

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.