My website contains of 149 of these tags
<!-- Begin Module Image -->
<div class="module-img">
<a href="http://prodigy.co.id/news/events/youtube-viewer-event/" >
<img src="http://prodigy.co.id/wp-content/uploads/Prodigy_Sticky_YoutubeViewer.png" width="280" height="150" alt="Youtube Viewer Event!" />
<span></span>
</a>
<div class="lightboxLink">
<a class="popLink boxLink" href="http://prodigy.co.id/wp-content/uploads/Prodigy_Sticky_YoutubeViewer.png" data-rel="prettyPhoto[Youtube Viewer Event!]" title="Youtube Viewer Event!"></a>
</div>
<div class="thumbLink">
<a class="popLink" href="http://prodigy.co.id/news/events/youtube-viewer-event/" title="Full Post"></a>
</div>
</div>
<!-- End Module Image -->
Here's my parser:
Document document = Jsoup.connect(Server.EXPLORE_LINK).timeout(10 * 1000).get();
Elements divs = document.select("div[class=module-img] a[href]");
for (Element div : divs) {
try {
href = div.attr("href");
Elements a = document.select("a[href=" + href + "] img[src]");
src = a.attr("src");
if (!src.startsWith("http://"))
src = src.substring(src.indexOf("http://"));
hrefs.add(href);
srcs.add(src);
} catch (Exception any) {
any.printStackTrace();
}
}
I want my href to be http://prodigy.co.id/news/events/youtube-viewer-event/, and src to be http://prodigy.co.id/wp-content/uploads/Prodigy_Sticky_YoutubeViewer.png for 149 times. At this point I'm completely confused that the size of element divs are 444, not 149 as it should be.
Forgive my laziness but I'm new in this JSON thing and I've been googling around for hours looking for answers.
Are you sure the size is 444? It would make sense if it is 447.
Your selector is valid for all three links in your HTML code. A space means that there can be any number of elements in between. If you want to select direct child nodes only you have to use '>' in between:
Elements divs = document.select("div[class=module-img] > a[href]");
PS: you could use
.classname
instead of
div[class=classname]
I've never used this jsoup API but looking at the selector you used, it seems that you're querying for ALL tags that are DESCENDED from <div class="module-img">. Note that there are 3 <a> inside each module. This would explain the number 444 as 148x3=444. (You said there are 149, but perhaps the first occurrence or the last is not being counted.)
Anyway, try this:
Elements divs = document.select("div[class=module-img] > a[href]");
It should list only <a> children that are DIRECT DESCENDING from given <div>.
Here's more about selectors and combinators.
Related
I'm trying to parse data from HTML. I need to get specific content from the html code which the ordering or the html content may be different.
<h1>Latest Deals</h1>\r\n </div>\r\n </div>\r\n</div>\r\n\r\n
<div class=\"breadcrumb-wrapper\">\r\n
<ul class=\"breadcrumb\">\r\n
<li>Home</li>\r\n
<li>Deals</li>\r\n
<li class=\"active\">Mau Mudik Hemat? Nikmati Diskon Hingga 20%</li>\r\n
</ul>\r\n</div>\r\n\r\n
<div class=\"article outer clearfix\">\r\n
<div class=\"col-sm-12\">\r\n
<img alt=\"Mau Mudik Hemat? Nikmati Diskon Hingga 20%\" title=\"Mau Mudik Hemat? Nikmati Diskon Hingga 20%\" src=\"/images/slider/id/special-raya-offer-id-v2.jpg\">\r\n
<h1>Mau Mudik Hemat? Nikmati Diskon Hingga 20%</h1>\r\n
<p class=\"date\">May 18th, 2018</p>\r\n
<p><strong class=\"text-red\"></strong></p>\r\n\r\n
<p>This is the first paragraph</p>\r\n\r\n
<p>This is the second paragraph.</p>\r\n\r\n
<p>This is the third paragraph</p>\r\n\r\n
<p>Below is the point form start:</p>\r\n\r\n
<ol>\r\n
<li>Point form A</li>\r\n
<li>Point form B</li>\r\n
<li>Point form C</li>\r\n
<li>Point form D</li>\r\n
</ol>\r\n\r\n\r\n\r\n
<div class=\"m-top30 m-bottom20\">\r\n
Home\r\n\r\n \r\n\r\n\r\n</div>\r\n\r\n\r\n
Previously i had successfully get the content i want via:
Document doc = Jsoup.parse(content);
Element eTitle = doc.getElementsByTag("h1").get(1);
Elements eBody = doc.getElementsByTag("p");
for (Element body : eBody) {
detailContent += "<p>" + body.html() + "</p>";
The code above i getting the first "h1" and all element with "p" from my long html code. However, now in some case i might have element "ol" in between of those "p". For example:
<div class=\"col-sm-12\">\r\n <img alt=\"abc\" title=\"abcd\" src=\"/images/slider/id/abcd.jpg\">\r\n
<h1>This is the header</h1>\r\n
<p class=\"date\">November 4th, 2015</p>\r\n
<p><strong class=\"text-red\">Sorry, this promotion has expired.</strong></p>\r\n
<p> Paragraph 1 </p>\r\n
<p> Paragraph 2 </p>\r\n
<ol>\r\n
<li> Point 1 </li>\r\n
<li> Point 2 </li>\r\n
</ol>\r\n
<p> Paragraph 3 </p>\r\n
<p> Paragraph 4 </p>\r\n
<ol>\r\n
<li> Point 1 </li>\r\n
<li> Point 2 </li>\r\n
</ol>\r\n
<div class=\"m-top30 m-bottom20\">
How should i create my code to get all these item?
*P.s All i want to do is
1) To get the element in "col-sm-12" div / the last element before "m-top30 m-bottom20"
2) Ignore certain element contain in "col-sm-12"
Changing the selectors to CSS and adding the filter such as 'p' under the first div can help you. However from the above html it is not clear whether the first div ends before the starting of the second div. If you share more details about the html, may be we can refine the selectors. I have stated the assumptions/my understanding in the code comment.
String eTitle = doc.select("div.col-sm-12 > h1").text(); //I'm assuming you are trying to fetch the title text.
Elements eBody = doc.select("div.col-sm-12 > p , ol"); //This CSS selector will limit the 'p' elements to this div alone.
for (Element body : eBody) {
//work with the 'body' element here.
I am creating a very simple WebView application on android. However, I want to edit the html file before displaying it in the WebView.
For example, if the original html source looked like :
<html>
<body>
<h1> abc </h1>
<h2> abc </h2>
......
<h6> abc </h6>
</body>
</html>
And I want to change it to:
<html>
<body>
<h1> cba </h1>
<h2> cba </h2>
......
<h6> cba </h6>
</body>
</html>
(all "abc" become "cba")
And then, I want to display that new code in my WebView. How can I do this? thanks
I am not sure why do you need this and what kind of app it is to need this. But if you have to do it check foll code:
$(function() {
for(var i =0;i<101;i++) {
if(jQuery('h'+i).length)
jQuery('h'+i).html(jQuery('h'+i).html().split("").reverse().join(""));
}
});
First, a note on your header tags: <h100> is a common misconception for newcomers. <h_> tags are simply an organizational item for your page, and only go out to <h6> You can have multiple <h1> tags on the same page, which are just headings for that section of content (with <h2> implying a subsection of <h1>, etc).
From there, when you say "original source", I assume you mean this is your own code, correct? Not a WebView sourced from another site? If this is the case, and you are only looking to change a specific instance of a specific string in your own code, a Find and Replace should be sufficient via any text or code editor you are using.
But if this is the case, you might want to look into first learning HTML and being able to render it in a basic web browser before moving on to also trying to learn Android.
I am working with Android Application using Jsoup for extracting text from website, in the below Html I want to get text of the parent div only. What I want is to display Date & Time which is in the parent div of class "fr".
<div class="fr">
<div id="newssource">
<a href="http://nhl.com" class="newssourcelink" target="_blank">
Philadelphia Flyers
</a>
</div>
April 15, 2014, 11:13 a.m.
</div>
What I have tried.
for(Element detailsDate:document.getElementsByClass("fr")){
newsDate.add(detailsDate.clone().children().remove().last().text().trim());
}
It only get text from child div i.e. "Philadelphia Flyers" which is in the "a" tag, but I want to display the Date & Time only.
use below jQuery code to get only Date and Time form "fr" div
<script>
jQuery('.fr').each(function(){
var textValue = jQuery(this).clone() //clone the element
.children() //select all the children
.remove() //remove all the children
.end() //again go back to selected element
.text();
alert(textValue);
});
</script>
I have found the answer by myself. Just Posted here for some one else with same issue.
for(Element detailsDate:document.getElementsByClass("fr")){
newsDate.add(detailsDate.getElementById("newssource").nextElementSibling());
}
I am creating "Highlighter" for Android in WebView.
I am getting XPath expression for the selected Range in HTML through a function as follows
/HTML[1]/BODY[1]/DIV[1]/DIV[3]/DIV[1]/DIV[1]/text()[5]
Now i am evaluating the above XPath expression through this function in javascript
var resNode = document.evaluate('/HTML[1]/BODY[1]/DIV[1]/DIV[3]/DIV[1]/DIV[1]/text()[5]',document,null,XPathResult.FIRST_ORDERED_NODE_TYPE ,null);
var startNode = resNode.singleNodeValue;
but I am getting the startNode 'null'.
But, here is the interesting point:
if I evaluate this '/HTML[1]/BODY[1]/DIV[1]/DIV[3]/DIV[1]/DIV[1]' XPath expression using the same function, it gives the proper node i.e. a 'div'.
The difference between the two XPaths is the previous ones contains a textNode and later only div.
But the same thing is working fine on Desktop browsers.
Edited
Sample HTML
<html>
<head>
<script></script>
</head>
<body>
<div id="mainpage" class="highlighter-context">
<div> Some text here also....... </div>
<div> Some text here also.........</div>
<div>
<h1 class="heading"></h1>
<div class="left_side">
<ol></ol>
<h1></h1>
<div class="text_bio">
In human beings, height, colour of eyes, complexion, chin, etc. are
some recognisable features. A feature that can be recognised is known as
character or trait. Human beings reproduce through sexual reproduction. In this
process, two individuals one male and another female are involved. Male produces
male gamete or sperm and female produces female gamete or ovum. These gametes fuse
to form zygote which develops into a new young one which resembles to their parent.
During the process of sexual reproduction
</div>
</div>
<div class="righ_side">
Some text here also.........
</div>
<div class="clr">
Some text here also.......
</div>
</div>
</div>
</body>
</html>
getting XPath:
var selection = window.getSelection();
var range = selection.getRangeAt(0);
var xpJson = '{startXPath :"'+makeXPath(range.startContainer)+
'",startOffset:"'+range.startOffset+
'",endXPath:"'+makeXPath(range.endContainer)+
'",endOffset:"'+range.endOffset+'"}';
function to make XPath:
function makeXPath(node, currentPath) {
currentPath = currentPath || '';
switch (node.nodeType) {
case 3:
case 4:return makeXPath(node.parentNode, 'text()[' + (document.evaluate('preceding-sibling::text()', node, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null).snapshotLength + 1) + ']');
case 1:return makeXPath(node.parentNode, node.nodeName + '[' + (document.evaluate('preceding-sibling::' + node.nodeName, node, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null).snapshotLength + 1) + ']' + (currentPath ? '/' + currentPath : ''));
case 9:return '/' + currentPath;default:return '';
}
}
I am not working with XML but with HTML in webview.
I tried using Rangy serialize and deserialize but the Rangy "Serialize" works properly but not the "deserialize".
Any ideas guys, whats going wrong?
UPDATE
Finally got the root cause of the problem (not solution yet :( )
`what exactly is happening in android webview. -->> Somehow, the android webview is changing the DOM structure of the loaded HTML page. Even though the DIV doesn't contains any TEXTNODES, while selecting the text from DIV, i am getting TEXTNODE for every single line in that DIV. for example, for the same HTML page in Desktop browser and for the same text selection, the XPath getting from webview is entirely different from that of given in Desktop Browser'
XPath from Desktop Browser:
startXPath /HTML[1]/BODY[1]/DIV[1]/DIV[3]/DIV[1]/DIV[1]/text()[1]
startOffset: 184
endXPath: /HTML[1]/BODY[1]/DIV[1]/DIV[3]/DIV[1]/DIV[1]/text()[1]
endOffset: 342
Xpath from webview:
startXPath :/HTML[1]/BODY[1]/DIV[1]/DIV[3]/DIV[1]/DIV[1]/text()[3]
startOffset:0
endXPath:/HTML[1]/BODY[1]/DIV[1]/DIV[3]/DIV[1]/DIV[1]/text()[4]
endOffset:151
Well in your sample the path /HTML[1]/BODY[1]/DIV[1]/DIV[3]/DIV[1]/DIV[1]/text()[5] selects the fifth text child node of the div element
<div class="text_bio">
In human beings, height, colour of eyes, complexion, chin, etc. are
some recognisable features. A feature that can be recognised is known as
character or trait. Human beings reproduce through sexual reproduction. In this
process, two individuals one male and another female are involved. Male produces
male gamete or sperm and female produces female gamete or ovum. These gametes fuse
to form zygote which develops into a new young one which resembles to their parent.
During the process of sexual reproduction
</div>
That div has a single text child node so I don't see why text()[5] should select anything.
My problem appears on the internal Android browser in combination with JQuery Mobile. When I reload the current page the content shrinks to fit text into the listview.
More in Detail:
The code works fine on IPhone, mobile Desktop Tools and Androids Firefox. However on the internal Android browser I have this weird issue with the code beneath. See my Edit below.
What I've tried so far:
I've played a lot with the viewport meta tag. Anyhow, I don't think that's the problem, because the content gets displayed correct on every other site in my app.
<meta name='viewport' content='width=device-width,initial-scale=1,maximum-scale=1'>
$('meta[name=viewport]').attr('content','width='+$(window).width()+',user-scalable=no');
like these posts suggest:
JQuery Mobile Device Scaling
Full webpage and disabled zoom viewport meta tag for all mobile browsers
My Code:
<html>
<head>
<meta name="viewport" content="width=650">
<!-- CSS and Scripts-->
</head>
<body>
<!-- Page Wrapper -->
<div data-role="page">
<section data-role="content">
<h2>
Code Sample
</h2>
<div class="ui-grid-solo">
<p style="margin-bottom: 38px;">
A
B
C
</p>
</div>
<!-- Dynamic content-->
<ul data-role="listview" data-inset="false">
<!-- Use ?id to grab and display data (CodeBehind.vb)-->
</ul>
</section>
</div>
</body>
</html>
Has anyone an idea, or did fight with a similar problem?
Edit:
I'm on to something, the problem appears to happen in this peace of code:
<!-- Dynamic content-->
<ul data-role="listview" data-inset="false">
<!-- Use ?id to grab and display data (CodeBehind.vb)-->
</ul>
Normally the listView replaces to big text items with "dot dot dot" at the end so that they fit on the screen. In my case it still does that, but the text has way to many characters, before the shortening is happening. The result is, that everything scales down. How should I solve this?
Since I got no answers on this one, I post my fix:
Only on mobile safari browsers listView items don't seem to get shortened. Now I'm calling a function which does that manually on pageinit:
fixListView: function () {
var brokenAgent = "Safari";
var currentUserAgent = navigator.userAgent;
if (currentUserAgent.indexOf(brokenAgent) != -1) {
var listItemList = $('.long-text');
for (var i = 0; i < listItemList.length; i++) {
var text = listItemList[i].innerText;
if (text.length > 40) {
var newText = text.substr(0, 40);
listItemList[i].innerText = newText + "...";
}
}
}
}
Still not that happy with my fix, any ideas for improvement are welcomed!