Using jsoup select i've managed to extract the following html.I'm trying to get all the html code between <a id="dd_start"></a> and <a id="dd_end"></a>.
I've used obj.first().getElementsByClass("div.dd_outer").remove() with no luck.
Any suggestions?
<div class="entry-content" itemprop="text">
<a id="dd_start"></a>
<p><img class="size-full wp-image-21501 aligncenter" src="http://blablabla.com/wp-content/uploads/16/01/google1.jpg" alt="google-icon" width="100%"></p>
<p>blablabla.<br> <span id="more-21499"></span><br> blablabla.</p>
<p>blablabla blablabla. </p>
<a id="dd_end"></a>
<div class="dd_outer">
<div class="dd_inner">
<div id="dd_ajax_float">
<div class="dd_button_v">
</div>
</div>
</div>
</div>
This works for the snippet you posted. You might want to make some changes to handle edge cases, errors etc.
public static void main(String[] args) throws Exception {
String html = new String(Files.readAllBytes(Paths.get("input.html")));
Document doc = Jsoup.parse(html);
Elements section = new Elements();
Element sibling = doc.getElementById("dd_start").nextElementSibling();
do {
section.add(sibling);
sibling = sibling.nextElementSibling();
} while(!sibling.id().equals("dd_end"));
System.out.println(section);
}
As far as the question of removing a certain section, you can do this:
Document doc = Jsoup.parse(html);
doc.select("div.dd_outer").first().remove();
System.out.println(doc);
This will remove the section from your Document object. Mind the first() that is being called before remove(). This is important. You want to call the remove() of Element which is what first() returns. This will remove the element from the parentNode thus affecting the Document object. If you do not call first() and change it to this
doc.select("div.dd_outer").remove();
You will be calling the remove() of Elements which is what select() returns. This is just a collection (list wrapper of sorts). So if you call remove() you will be affecting the result of select() and not the Document. So if you try to print the doc variable, you will see that what you removed is still there.
Related
I'm trying to parse data from HTML. I need to get specific content from the html code which the ordering or the html content may be different.
<h1>Latest Deals</h1>\r\n </div>\r\n </div>\r\n</div>\r\n\r\n
<div class=\"breadcrumb-wrapper\">\r\n
<ul class=\"breadcrumb\">\r\n
<li>Home</li>\r\n
<li>Deals</li>\r\n
<li class=\"active\">Mau Mudik Hemat? Nikmati Diskon Hingga 20%</li>\r\n
</ul>\r\n</div>\r\n\r\n
<div class=\"article outer clearfix\">\r\n
<div class=\"col-sm-12\">\r\n
<img alt=\"Mau Mudik Hemat? Nikmati Diskon Hingga 20%\" title=\"Mau Mudik Hemat? Nikmati Diskon Hingga 20%\" src=\"/images/slider/id/special-raya-offer-id-v2.jpg\">\r\n
<h1>Mau Mudik Hemat? Nikmati Diskon Hingga 20%</h1>\r\n
<p class=\"date\">May 18th, 2018</p>\r\n
<p><strong class=\"text-red\"></strong></p>\r\n\r\n
<p>This is the first paragraph</p>\r\n\r\n
<p>This is the second paragraph.</p>\r\n\r\n
<p>This is the third paragraph</p>\r\n\r\n
<p>Below is the point form start:</p>\r\n\r\n
<ol>\r\n
<li>Point form A</li>\r\n
<li>Point form B</li>\r\n
<li>Point form C</li>\r\n
<li>Point form D</li>\r\n
</ol>\r\n\r\n\r\n\r\n
<div class=\"m-top30 m-bottom20\">\r\n
Home\r\n\r\n \r\n\r\n\r\n</div>\r\n\r\n\r\n
Previously i had successfully get the content i want via:
Document doc = Jsoup.parse(content);
Element eTitle = doc.getElementsByTag("h1").get(1);
Elements eBody = doc.getElementsByTag("p");
for (Element body : eBody) {
detailContent += "<p>" + body.html() + "</p>";
The code above i getting the first "h1" and all element with "p" from my long html code. However, now in some case i might have element "ol" in between of those "p". For example:
<div class=\"col-sm-12\">\r\n <img alt=\"abc\" title=\"abcd\" src=\"/images/slider/id/abcd.jpg\">\r\n
<h1>This is the header</h1>\r\n
<p class=\"date\">November 4th, 2015</p>\r\n
<p><strong class=\"text-red\">Sorry, this promotion has expired.</strong></p>\r\n
<p> Paragraph 1 </p>\r\n
<p> Paragraph 2 </p>\r\n
<ol>\r\n
<li> Point 1 </li>\r\n
<li> Point 2 </li>\r\n
</ol>\r\n
<p> Paragraph 3 </p>\r\n
<p> Paragraph 4 </p>\r\n
<ol>\r\n
<li> Point 1 </li>\r\n
<li> Point 2 </li>\r\n
</ol>\r\n
<div class=\"m-top30 m-bottom20\">
How should i create my code to get all these item?
*P.s All i want to do is
1) To get the element in "col-sm-12" div / the last element before "m-top30 m-bottom20"
2) Ignore certain element contain in "col-sm-12"
Changing the selectors to CSS and adding the filter such as 'p' under the first div can help you. However from the above html it is not clear whether the first div ends before the starting of the second div. If you share more details about the html, may be we can refine the selectors. I have stated the assumptions/my understanding in the code comment.
String eTitle = doc.select("div.col-sm-12 > h1").text(); //I'm assuming you are trying to fetch the title text.
Elements eBody = doc.select("div.col-sm-12 > p , ol"); //This CSS selector will limit the 'p' elements to this div alone.
for (Element body : eBody) {
//work with the 'body' element here.
My website contains of 149 of these tags
<!-- Begin Module Image -->
<div class="module-img">
<a href="http://prodigy.co.id/news/events/youtube-viewer-event/" >
<img src="http://prodigy.co.id/wp-content/uploads/Prodigy_Sticky_YoutubeViewer.png" width="280" height="150" alt="Youtube Viewer Event!" />
<span></span>
</a>
<div class="lightboxLink">
<a class="popLink boxLink" href="http://prodigy.co.id/wp-content/uploads/Prodigy_Sticky_YoutubeViewer.png" data-rel="prettyPhoto[Youtube Viewer Event!]" title="Youtube Viewer Event!"></a>
</div>
<div class="thumbLink">
<a class="popLink" href="http://prodigy.co.id/news/events/youtube-viewer-event/" title="Full Post"></a>
</div>
</div>
<!-- End Module Image -->
Here's my parser:
Document document = Jsoup.connect(Server.EXPLORE_LINK).timeout(10 * 1000).get();
Elements divs = document.select("div[class=module-img] a[href]");
for (Element div : divs) {
try {
href = div.attr("href");
Elements a = document.select("a[href=" + href + "] img[src]");
src = a.attr("src");
if (!src.startsWith("http://"))
src = src.substring(src.indexOf("http://"));
hrefs.add(href);
srcs.add(src);
} catch (Exception any) {
any.printStackTrace();
}
}
I want my href to be http://prodigy.co.id/news/events/youtube-viewer-event/, and src to be http://prodigy.co.id/wp-content/uploads/Prodigy_Sticky_YoutubeViewer.png for 149 times. At this point I'm completely confused that the size of element divs are 444, not 149 as it should be.
Forgive my laziness but I'm new in this JSON thing and I've been googling around for hours looking for answers.
Are you sure the size is 444? It would make sense if it is 447.
Your selector is valid for all three links in your HTML code. A space means that there can be any number of elements in between. If you want to select direct child nodes only you have to use '>' in between:
Elements divs = document.select("div[class=module-img] > a[href]");
PS: you could use
.classname
instead of
div[class=classname]
I've never used this jsoup API but looking at the selector you used, it seems that you're querying for ALL tags that are DESCENDED from <div class="module-img">. Note that there are 3 <a> inside each module. This would explain the number 444 as 148x3=444. (You said there are 149, but perhaps the first occurrence or the last is not being counted.)
Anyway, try this:
Elements divs = document.select("div[class=module-img] > a[href]");
It should list only <a> children that are DIRECT DESCENDING from given <div>.
Here's more about selectors and combinators.
I am working with Android Application using Jsoup for extracting text from website, in the below Html I want to get text of the parent div only. What I want is to display Date & Time which is in the parent div of class "fr".
<div class="fr">
<div id="newssource">
<a href="http://nhl.com" class="newssourcelink" target="_blank">
Philadelphia Flyers
</a>
</div>
April 15, 2014, 11:13 a.m.
</div>
What I have tried.
for(Element detailsDate:document.getElementsByClass("fr")){
newsDate.add(detailsDate.clone().children().remove().last().text().trim());
}
It only get text from child div i.e. "Philadelphia Flyers" which is in the "a" tag, but I want to display the Date & Time only.
use below jQuery code to get only Date and Time form "fr" div
<script>
jQuery('.fr').each(function(){
var textValue = jQuery(this).clone() //clone the element
.children() //select all the children
.remove() //remove all the children
.end() //again go back to selected element
.text();
alert(textValue);
});
</script>
I have found the answer by myself. Just Posted here for some one else with same issue.
for(Element detailsDate:document.getElementsByClass("fr")){
newsDate.add(detailsDate.getElementById("newssource").nextElementSibling());
}
I have multiple pages in my one html file.
I trying to implement the pageinit event handler on the second data-role="page".
So I declared pageinit inside it's specific data-role="page".
<div data-role="page" id="foo3" data-dom-cache="false">
<script>
$(document).on('pageinit','#foo3' , function(){
abcsong_file_path = '/android_asset/www/audio/abcsong.mp3';
my_abc = new Media(abcsong_file_path);
my_abc.play();
var i =0;
var time;
function my_loop(){
setTimeout(function (){
var my_alphabets = ["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"];
$('#content_loop2').append('<img src="img/alphabets/'+my_alphabets[i]+'.png" />');
i++;
time = 700;
if(i<26)
{
my_loop();
}
}, time)
}
my_loop();
});
</script>
<div data-role="header" data-theme="b">
</div>
<div data-role="content" >
<div id="content_loop2" data-inset="true">
</div>
</div>
<div data-role="footer" >
</div>
</div>
What I expected was that it would initialize everytime I visit this page. But it runs correctly only the first time I open it. Every other time it just show the output of previously executed code.
Please help me how to go about it.
Pageinit should run only once, it was made to be just like document ready.
If you want your code to run every time page is visited then use pageshow or pagebeforeshow.
Read more about it here.
Pageinit should fire only once, according to the docs, but at least in previous versions that was not the actual truth.
I am currently using jQM 1.3.2 and no longer experiencing this problem on Android or in desktop browser. Pay attention to it though, especially if you are also using Phonegap.
Document pageinit fires more than once on iOS (jQueryMobile)
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How-to store variable beetween jQM pages?
I am new to jquery mobile and a bit confused of sending data to another page. In my html file it consists of listview. when i click on list it needs to redirect to another page and at the same time i want to send data in that list to second page.
Actually my code is..
Polls.html
<body>
<section id="pollsscreen" data-role="page">
<header data-role="header" >
<h1> Video </h1>
Back
</header>
<div data-role="content">
<ul id="pollslist" data-role="listview" data-theme="e" >
</ul>
</div>
</section>
</body>
Polls.js (my javascript file)
var addPollsList = function addPollsList() {
$("#pollslist").prepend("<li><a href='pollsdetails.html'>What is your name?</a></li>");
$("#pollslist").prepend("<li><a href='pollsdetails.html'>Give review of new movie..</a></li>");
$("#pollslist").listview("refresh");
}
pollsdetails.html
<section id="pollsdetailspage" data-role="page">
<header data-role="header">
<h1> Polls </h1>
Back
</header>
<div data-role="content" class="pollsdetailsscreen">
<p>Polls details page</p>
</div>
</section>
</body>
As per the code everything is working good. when i click on list it is redirecting to "pollsdetails.html" file. But here i am not getting any idea how to send the data to that html page and i want to set that data to tag in pollsdetails.html page. can anyone help me with this.......
Thanks in advance...
you can use global variables for this..
In Polls.js
var addPollsList = function addPollsList() {
$("#pollslist").prepend('<li>What is your name?</li>');
$("#pollslist").prepend('<li>Give review of new movie..</li>');
$("#pollslist").listview("refresh");
}
function test1(data){
// set urs global variable here
global_variable = data;
$.mobile.changePage( "pollsdetails.html");
}
function test2(data){
// set urs global variable 2 here
global_variable_2 = data;
$.mobile.changePage( "pollsdetails.html");
}
and in pollsdetails.js you can access global_variable and global_variable_2.
You could use Query string parameters if you are trying to pass simple name-value pairs to the next page.
var addPollsList = function addPollsList() {
$("#pollslist").prepend("<li><a href='pollsdetails.html?myparam=value1'>What is your name?</a></li>");
$("#pollslist").prepend("<li><a href='pollsdetails.html?myparam=value2'>Give review of new movie..</a></li>");
$("#pollslist").listview("refresh");
}
I would go with Localstorage
var userReview = {
"Name":"Bob",
"Review":"Awesome party movie!!",
"MovieTitle":"Project X"
};
//Store the object in local storage
localStorage.setItem('UserReview',JSON.stringify(userReview));
then you can always call the localstorage to get the info back when you navigate to a different page.
var userReview = JSON.parse(localStorage.getItem("UserReview"));
var userName = userReview.Name;
var review = userReview.Review;
var movieTitle = userReview.MovieTitle;