I've done a bunch of searching but I'm terrible with regex statements and my google-fu in this instance as not been strong.
Scenario:
In push notifications, we're passed a URL that contains a 9-digit content ID.
Example URL: http://www.something.com/foo/bar/Some-title-Goes-here-123456789.html (123456789 is the content ID in this scenario)
Current regex to parse the content ID:
public String getContentIdFromPathAndQueryString(String path, String queryString) {
String contentId = null;
if (StringUtils.isNonEmpty(path)) {
Pattern p = Pattern.compile("([\\d]{9})(?=.html)");
Matcher m = p.matcher(path);
if (m.find()) {
contentId = m.group();
} else if (StringUtils.isNonEmpty(queryString)) {
p = Pattern.compile("(?:contentId=)([\\d]{9})(?=.html)");
m = p.matcher(queryString);
if (m.find()) {
contentId = m.group();
}
}
}
Log.d(LOG_TAG, "Content id " + (contentId == null ? "not found" : (" found - " + contentId)));
if (StringUtils.isEmpty(contentId)) {
Answers.getInstance().logCustom(new CustomEvent("eid_url")
.putCustomAttribute("contentId", "empty")
.putCustomAttribute("path", path)
.putCustomAttribute("query", queryString));
}
return contentId;
}
The problem:
This does the job but there's a specific error scenario that I need to account for.
Whoever creates the push may put in the wrong length content ID and we need to grab it regardless of that, so assume it can be any number of digits... the title can also contain digits, which is annoying. The content ID will ALWAYS be followed by ".html"
While the basic answer here would be just "replace {9} limiting quantifier matching exactly 9 occurrences with a + quantifier matching 1+ occurrences", there are two patterns that can be improved.
The unescaped dot should be escaped in the pattern to match a literal dot.
If you have no overlapping matches, no need to use a positive lookahead with a capturing group before it, just keep the capturing group and grab .group(1) value.
A non-capturing group (?:...) is still a consuming pattern, and the (?:contentId=) equals contentId= (you may remove (?: and )).
There is no need wrapping a single atom within a character class, use \\d instead of [\\d]. That [\\d] is actually a source of misunderstandings, some may think it is a grouping construct, and might try adding alternative sequences into the square brackets, while [...] matches a single char.
So, your code can look like
Pattern p = Pattern.compile("(\\d+)\\.html"); // No lookahead, + instead of {9}
Matcher m = p.matcher(path);
if (m.find()) {
contentId = m.group(1); // (1) refers to Group 1
} else if (StringUtils.isNonEmpty(queryString)) {
p = Pattern.compile("contentId=(\\d+)\\.html");
m = p.matcher(queryString);
if (m.find()) {
contentId = m.group(1);
}
}
Related
This question already has an answer here:
Select part of line in regular expression
(1 answer)
Closed 4 years ago.
I have this strings: https://regex101.com/r/7Er0Ch/6
I want put all my http://esupb.tabriz.ir:808x/srvSC.svc into array list.So in order to i used matcher like blow:
String regx= "#\\d+#";
Pattern pattern = Pattern.compile(regx);
Matcher matcher = pattern.matcher(url);
String[] metadata = new String[4];
while (matcher.find()) {
metadata[0] = matcher.group(1);
metadata[1] = matcher.group(2);
metadata[2] = matcher.group(3);
metadata[3] = matcher.group(4);
}
but I got not appropriate result. What is my mistake?
From requirement your regex will be
"(#\d+#)(http[^#]*svc)(#\d+#)"
group(0): (#\\d+#)(http[^#]*svc)(#\\d+#)
group(1): (#\\d+#)
group(2): (http[^#]*svc)
group(3): (#\\d+#)
Change your code to
List<String> urls = new ArrayList<>();
String url =
"#1#http://test.com:8080/srv.svc#1# " +
"#2#http://test.com:8081/srv.svc#2# " +
"#3#http://test.com:8082/srv.svc#3# " +
"#4#http://test.com:8083/srv.svc#4# " +
"#5#http://test.com:8084/srv.svc#5# ";
String regx = "(#\\d+#)(http[^#]*svc)(#\\d+#)";
Pattern pattern = Pattern.compile(regx);
Matcher matcher = pattern.matcher(url);
int from = 0;
while (matcher.find(from)) {
urls.add(matcher.group(2));
from = matcher.start() + 1;
}
You regex #\\d+#matches # followed by matching one or more times a digit and then another # .It does not use capturing groups.
For your example data you could remove that match from the string giving you the desired result leaving out matching any pattern for the string that is left. It could also match inside the string instead of only at the start and the end.
To match your example string(s) like http://esupb.tabriz.ir:808x/srvSC.svc you might use your regex to match the start and the end, and capture in a group what is in between.
^#\d+#(https?://test.ir:808\d/srvSC\.svc)#\d+#$
In Java
^#\\d+#(https?://test.ir:808\\d/srvSC\\.svc)#\\d+#$
Regex demo
Demo Java
Explanation
^ Assert the start of the string
#\d+# Match #, one or more times a digit and another #
( Start capturing group
https?://test.ir:808\d Match the start of the url with an optional s s? and a digit after 808. Use \d+ to match one or more digits.
/srvSC\.svc Match /srvSC.svc
#\d+# Match #, one or more times a digit and another #
) Close caputring group
$ Assert the end of the string
The regex
.*([0-9]{3}\\.[0-9]{2}).*
finds one match in "some short sentence 111.01 ", but it failed to match the first occurrence "111.01" in "some short sentence 111.01 & 222.02 "
I tried the lazy quantifier .*([0-9]{3}\\.[0-9]{2})?.* or .*([0-9]{3}\\.[0-9]{2}).*? for no avail.
Please help, I need to get both occurrences, here is my code.
Thank you
Pattern myPattern = Pattern.compile(".*([0-9]{3}\\.[0-9]{2}).*");
Matcher m = myPattern.matcher(mystring);
while (m.find()) {
String found = m.group(1);
}
you need to remove ".*"s. Try this:
String mystring = "some short sentence 111.01 & 222.02 ";
Pattern myPattern = Pattern.compile("([0-9]{3}\\.[0-9]{2})");
Matcher m = myPattern.matcher(mystring);
while(m.find()) {
System.out.println("Found value: " + m.group(1) );
}
output:
Found value: 111.01
Found value: 222.02
The leading and trailing ".*" cause you to match the entire string in one match. All the lazy quantifier does in your case is controls you getting the first, not last, occurrence in the subject.
I try to get only this part "9916-4203" in "Region Code:9916-4203 " in android. How can I do this?
I tried below code, I used substring method but it doesn't work:
firstNumber = Integer.parseInt(message.substring(11, 19));
If you know that string contains "Region Code:" couldn't you do a replace?
message = message.replace("Region Code:", "");
Assumed that you have only one phone number in your String, the following will remove any non-digit characters and parse the resulting number:
public static int getNumber(String num){
String tmp = "";
for(int i=0;i<num.length();i++){
if(Character.isDigit(num.charAt(i)))
tmp += num.charAt(i);
}
return Integer.parseInt(tmp);
}
Output in your case: 99164203
And as already mentioned, you won't be able to parse any String to Integer in case there are any non-digit characters
Im going to guess that what you want to extract is the full region code text minus the title. So maybe using regex would be a good simple fit for you?
String myString = "Region Code:9916-4203";
String match = "";
String pattern = "\:(.*)";
Pattern regEx = Pattern.compile(pattern);
Matcher m = regEx.matcher(myString);
// Find instance of pattern matches
Matcher m = regEx.matcher(myString);
if (m.find()) {
match = m.group(0);
}
Variable match will contain "9916-4203"
This should work for you.
Java code sourced from http://android-elements.blogspot.in/2011/04/regular-expressions-in-android.html
In Java the substring() method works with the first parameter being inclusive and the second parameter being exclusive. Meaning "Hello".substring(0, 2); will result in the string He.
In addition to excluding the parsing of something that isn't a number like #Opiatefuchs mentioned, your substring method should instead be message.substring(12, 21).
I am trying to get substrings from the string which are between apostrophes using regex.
Format of the string: Duplicate entry 'bla#bla.bl' for key 'email'.
The regex I am using: '([^']*).
Code:
Pattern pattern = Pattern.compile("'([^']*)");
Matcher matcher = pattern.matcher(duplicated);
Log.d(TAG, matcher.group()));
I am not also sure about matcher.group(), which returns a single string, that matched the whole regex. In my case, it should return two substrings.
Can somebody correct this regex and give me an explanation?
Thanks in advance
Better to use .split() instead of Pattern Matching. Its simply hard-coding. Do as below:
String[] strSplitted = <Your String>.split("`");
Then, the strSplitted Array contains the Strings splitted between `.
I would use this regex. It is almost exactly like yours but I include the closing single quote. This is to prevent the closing single quote from being used in the next match.
'([^']*)'
And to get the contents inside the single quotes use a line similar to this:
matcher.group(1)
Here is a Java example:
Pattern regex = Pattern.compile("'([^']*)'", Pattern.MULTILINE);
Matcher matcher = regex.matcher(duplicated);
while (matcher.find()) {
Log.d(TAG, matcher.group(1)));
}
Here's my tested solution. You have to call find
Pattern pattern = Pattern.compile("'([^']*)'");
String duplicated = "Duplicate entry 'bla#bla.bl' for key 'email'";
Matcher matcher = pattern.matcher(duplicated);
String a = "";
while (matcher.find()) {
a += matcher.group(1) + "\n";
}
Result:
bla#bla.bl
email
I invent my solution like following.
int second_index = 0;
String str = "Duplicate entry 'bla#bla.bl' for key 'email'";
while (true) {
if (second_index == 0)
first_index = str.indexOf("'", second_index);
else
first_index = str.indexOf("'", second_index + 1);
if (first_index == -1)
break;
second_index = str.indexOf("'", first_index + 1);
if (second_index == -1)
break;
String temp = str.substring(first_index + 1, second_index);
Log.d("TAG",temp);
}
Output
06-25 17:25:17.689: bla#bla.bl
06-25 17:25:17.689: email
I've been trying to solve this issue for some time but still didn't find the answer. The aim is to get some data from a HTML webpage. I can do all the internet related part but i've got a problem. This is the string i have:
class="datastream-graph-value">
496
The problem are those quotation marks because otherwise my app would be able to get the "496" which is the important data, but with them there i can't get my data.
Which would be a good way to get that data? (Note that after the ">" symbol there is a "\n")
Thank you mates!
While I don't normally recommend regular expressions to read xml but HTML with an XML parser can be nightmare.
With the below sample.
<a class="datastream-graph-value" href="http=blah" > 496</a>
<a class="other"> 496</a>
Use the below regular expression it should handle it well.
(class=["][^>"]*["])
Gives a great example of how to use that regex.
http://www.vogella.com/articles/JavaRegularExpressions/article.html
If you need a code sample reply back and we will see what we can't work out.
edit:
I was bored so I thought why not put a sample together
package temp;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTestPatternMatcher {
public static final String EXAMPLE_TEST = "<a class=\"datastream-graph-value\" href=\"http=blah\" > 496</a> <a class=\"other\"> 496</a>";
public static void main(String[] args) {
Pattern pattern = Pattern.compile("(class=[\"][^>\"]*[\"])");
// In case you would like to ignore case sensitivity you could use this
// statement
// Pattern pattern = Pattern.compile("\\s+", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(EXAMPLE_TEST);
// Check all occurance
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end() + " ");
String match = matcher.group();
match = match.replace("class=", "");
System.out.println(match);
}
// Now create a new pattern and matcher to replace whitespace with tabs
Pattern replace = Pattern.compile("\\s+");
Matcher matcher2 = replace.matcher(EXAMPLE_TEST);
System.out.println(matcher2.replaceAll("\t"));
}
}