Is this below code correct I am getting error if I print matcher.group(0). Am I print the values in correct way?Plz help.Logcat error is below.
String GetCddata=<p><a href="http://myimagefactorycollection.files.wordpress.com/2014/09/2db83fcf95c5fc036a00abfb412f50e4.jpg">
<img class="alignnone size-full wp-image-12" src="http://myimagefactorycollection.files.wordpress.com/2014/09/2db83fcf95c5fc036a00abfb412f50e4.jpg?w=529" alt="2db83fcf95c5fc036a00abfb412f50e4" />
</a><a href="https://myimagefactorycollection.files.wordpress.com/2014/09/0e397a47f88e18f8fb91d17db18c7edd-copy.jpg"><img class="alignnone size-full wp-image-4" src="http://myimagefactorycollection.files.wordpress.com/2014/09/0e397a47f88e18f8fb91d17db18c7edd-copy.jpg?w=529" alt="0e397a47f88e18f8fb91d17db18c7edd - Copy" />
</a></p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/myimagefactorycollection.wordpress.com/3/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/myimagefactorycollection.wordpress.com/3/" />
</a> <img alt="" border="0" src="http://pixel.wp.com/b.gif?host=myimagefactorycollection.wordpress.com&blog=75018866&post=3&subd=myimagefactorycollection&ref=&feed=1" width="1" height="1" />
]];
Pattern pattern = Pattern.compile("(?<=\\<a href=)(.*?)\\>");
Matcher matcher = pattern.matcher(GetCddata);
Log.v("dd",matcher.group(0));
http://i.stack.imgur.com/GKpOD.png
group:
public String group()
Returns the input subsequence matched by the previous match. For a
matcher m with input sequence s, the expressions m.group() and
s.substring(m.start(), m.end()) are equivalent.
Note that some patterns, for example a*, match the empty string. This
method will return the empty string when the pattern successfully
matches the empty string in the input.
Specified by: group in interface MatchResult Returns: The (possibly
empty) subsequence matched by the previous match, in string form
Throws: IllegalStateException - If no match has yet been attempted, or if the previous match operation failed
Source :docs.oracle.com
Suggestion: check your Regex pattern, enclose your Log.v(); inside a try...catch block
Related
Using jsoup select i've managed to extract the following html.I'm trying to get all the html code between <a id="dd_start"></a> and <a id="dd_end"></a>.
I've used obj.first().getElementsByClass("div.dd_outer").remove() with no luck.
Any suggestions?
<div class="entry-content" itemprop="text">
<a id="dd_start"></a>
<p><img class="size-full wp-image-21501 aligncenter" src="http://blablabla.com/wp-content/uploads/16/01/google1.jpg" alt="google-icon" width="100%"></p>
<p>blablabla.<br> <span id="more-21499"></span><br> blablabla.</p>
<p>blablabla blablabla. </p>
<a id="dd_end"></a>
<div class="dd_outer">
<div class="dd_inner">
<div id="dd_ajax_float">
<div class="dd_button_v">
</div>
</div>
</div>
</div>
This works for the snippet you posted. You might want to make some changes to handle edge cases, errors etc.
public static void main(String[] args) throws Exception {
String html = new String(Files.readAllBytes(Paths.get("input.html")));
Document doc = Jsoup.parse(html);
Elements section = new Elements();
Element sibling = doc.getElementById("dd_start").nextElementSibling();
do {
section.add(sibling);
sibling = sibling.nextElementSibling();
} while(!sibling.id().equals("dd_end"));
System.out.println(section);
}
As far as the question of removing a certain section, you can do this:
Document doc = Jsoup.parse(html);
doc.select("div.dd_outer").first().remove();
System.out.println(doc);
This will remove the section from your Document object. Mind the first() that is being called before remove(). This is important. You want to call the remove() of Element which is what first() returns. This will remove the element from the parentNode thus affecting the Document object. If you do not call first() and change it to this
doc.select("div.dd_outer").remove();
You will be calling the remove() of Elements which is what select() returns. This is just a collection (list wrapper of sorts). So if you call remove() you will be affecting the result of select() and not the Document. So if you try to print the doc variable, you will see that what you removed is still there.
Hi I trying to get values from two hidden inputs. (__VIEWSTATE and __EVENTVALIDATION)
<form name="FormLogin" method="post" action="Same.aspx" id="FormLogin">
<div>
<input type="hidden" name="__OTHER" id="__OTHER" value="SOME not importent value" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/someLongValuewhatIwant=" />
</div>
<div>
<input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="/someOtherValueWhatIwant" />
</div>
</form>
My code
doc = Jsoup.connect("http://example.com/index.aspx").get();
Elements input = doc.select("input[type=hidden]");
Element viewst = input.select("#__VIEWSTATE").get(0);
Element eventvd = input.select("#__EVENTVALIDATION]").get(0);
viewstate = viewst.val();
eventvalidation = eventvd.val();
But I always got only __VIEWSTATE value and my app crashed when i try to get __EVENTVALIDATION value. Can someone please explain me why ? and How to make it work ?
Jsoup always crashes android when the select matching expression cannot match any element in the given doc which in your case#__EVENTVALIDATION in not on your input element.
Check in your Elements input if #__EVENTVALIDATION exists.
Btw:In your code you can directly access any element by selecting #id tag. for example
doc = Jsoup.connect("http://instantgram.ic.cz/index.html").get();
Elements eventvd = doc.select("input[id =__EVENTVALIDATION");
I am trying to make "Highlighter" for epub reader in one my Android project using webview.
I am using Rangy for getting Selected text.
The serialize functions gives me this value after selecting text from the below sample HTML:
2/5/3/5/1/2:0,13/5/3/5/1/2:24
I am storing this in DB. When user returns to this page, i am retrieving the same selection and trying to deserialize but the deserialize function throws the following error:
Error in Rangy Serializer module: deserializePosition failed: node <DIV>[7] has no child with index 13, 0
I am getting why this is happenig??
Even i am trying to do the same thing using XPath but still the same issue.
<html>
<head>
<script></script>
</head>
<body>
<div id="mainpage" class="highlighter-context">
<div> Some text here also....... </div>
<div> Some text here also.........</div>
<div>
<h1 class="heading"></h1>
<div class="left_side">
<ol></ol>
<h1></h1>
<div class="text_bio">
In human beings, height, colour of eyes, complexion, chin, etc. are
some recognisable features. A feature that can be recognised is known as
character or trait. Human beings reproduce through sexual reproduction. In this
process, two individuals one male and another female are involved. Male produces
male gamete or sperm and female produces female gamete or ovum. These gametes fuse
to form zygote which develops into a new young one which resembles to their parent.
During the process of sexual reproduction
</div>
</div>
<div class="righ_side">
Some text here also.........
</div>
<div class="clr">
Some text here also.......
</div>
</div>
</div>
</body>
</html>
Any guesses??
you are probably using the following:
highlighter.highlightSelection();
rangy.serializeSelection();
if you are running highlightSelection before serializing it will not work.
this is because highlighting is actually wrapping text in tags which means that DOM is manipulated.
deserializing on original DOM will obviously won't work.
try change the order of commands so you will serialize first and only then use highlight.
Correct Order:
rangy.serializeSelection();
highlighter.highlightSelection();
I am making an Android application that can fetch the new announcements from the website of my university.
This is the HTML code in the website:
sample_html_code http://img690.imageshack.us/img690/1079/88210050.png
Text version:
<table border="1" width="90%" class="duyuru">
<tbody>
<tr>
<td>
<h3 class="duyuru">Additional Quotas for the Technical Electives</h3>
"19/09/2012"
<h4 class="duyuru">"Additional Quotas for Technical Electives offered in...</h4>
<span class="duyuru"></span>
<br>
Download
</td>
</tr>
</tbody>
</table>
I can get the first and third lines "Additional Quotas for Technical Electives" and "Additional Quotas for ..." by using the piece of code below. However, I cannot get the date information (19/09/2012) located between h3 and h4 lines.
String patternStr ="\\<h3 class=\"duyuru\".*?\\>(.*?)\\</h3\\>";
patternStr+="(.*?)"; // This line is problematic
patternStr+=".*?\\<h4 class=\"duyuru\".*?\\>(.*?)\\</h4\\>";
Pattern pattern = Pattern.compile(patternStr, Pattern.DOTALL);
Matcher matcher = pattern.matcher(content);
String name = "";
String date = "";
String details = "";
while (matcher.find()){
name = matcher.group(1);
date = matcher.group(2);
details = matcher.group(3);
Announcement announcement = new Announcement();
announcement.setName(name);
announcement.setDate(date);
announcement.setDetails(details);
announcements.add(announcement);
}
I tried using
.*?\"(.*?)\"
but it didn't work. When I do this, it gets the string "duyuru" from the line starting with h4 tag instead of the date information.
Anyone have an idea how can I grab the date information?
Thanks in advance.
Your regular expression misses the newlines and whitespace in the input.
The simplest possible match I could come up with is:
"\\<h3 class=\"duyuru\".*?\\>\\n?\\s*(.*?)\\n?\\s*\\</h3\\>"
But keep in mind that such a regular expression is highly specific to your HTML.
My advice would be to have a look at a real HTML parser for Java, such as TagSoup. Once you start using one of those, parsing this type of HTML document becomes a breeze.
I am using Jsoup in my application and I am attempting to parse information from an a few input tags in order to add them to a url and post data automatically.
The portion of HTML I am attempting to parse is as follow:
<div class='theDivClass'>
<form method="post" id="handlePurchase" name="makePurchase" action="/shop.php">
<input type="hidden" name="ProductCode" value="A1223MN" />
<input type="hidden" name="SystemVersion" value="3" >
<input type="hidden" name="ProductClass" value="BOOK" />
</form>
</div>
The desired output would be
x = A1223MN
y = 3
z = BOOK
I am halfway familiar with JSOUP in the sense that I am able to parse out text, images, and urls but for some reason this is not clicking for me.
Any help would be greatly appreciated.
You should be able to use this:
Elements hidden = doc.select("input[type=hidden]");
And then just pull the attr values from each element in hidden. I've just tried it and it seems to work as expected.
For completeness:
Map<String,String> hiddenList = new HashMap<String, String>();
Elements hidden = doc.select("input[type=hidden]");
for (Element el1 : hidden){
hiddenList.put(el1.attr("name"),el1.attr("value");
}
Will give you a Map of all hidden input fields in the document.
Element.select("input[name=productCode]").attr("value");
Element.select("input[name=SystemVersion]").attr("value");
Element.select("input[name=ProductClass]").attr("value");
There's another way I found:
FormElement f = (FormElement) doc.select("form#handlePurchase").first();
System.out.println(f.formData());
Result:
[ProductCode=A1223MN, SystemVersion=3, ProductClass=BOOK]
Closing this question as it appears from all of the research I have done, you cannot pull data from "hidden" input types.