Jsoup : How to escape from this simple img selection nightmare? - android

I am using jsoup lib.
So I want to take img from this website
here
so I used select element and my code was like that
Elements news2 = document.select("div.contentcolumn");
// Elements title2 = news2.select("div.catitems");
Log.d("MainActivity", "This is news = " + title);
for (Element el : news2) {
news_object = new item();
news_object.setTitle(el.select("h1").text());
news_object.setauther(el.select("a").attr("abs:href"));
news_object.setimg(el.select("ImageArea").attr("abs:src"));
Log.d("newsdetail", "humam" + news_object.getimg());
and here is the source code of web site
<div id="ImageArea">
<a href="/filestorage/contentfiles/2016/04_16/090416102307_140_1.jpg"
target="_blank">
<img src="/filestorage/contentfiles/2016/04_16/090416102307_140_1.jpg"
alt="المالكي: الإصلاح محاولة لإفشال المشروع الإسلامي وضرب المتدينين"
style="max-width:620px;">
</a>
</div>
I want to select img and put it in view img and select the text.

The problem lies in this code line: el.select("ImageArea").attr("abs:src").
Let's see what happen:
el.select("ImageArea") // Here we select an element with tag name ImageArea
.attr("abs:src")
ImageArea is the id of a div containing an anchor containing a targeted image.
Try this instead:
news_object.setimg(el.select("#ImageArea img").attr("abs:src"));
If the ImageArea element may not have an img, use the code below:
Element img = el.select("#ImageArea img").first();
if ( img != null ) {
news_object.setimg(img.attr("abs:src"));
} else {
// ...
}

Related

Android: how to search a word or a phrase with Jsoup

My problem is: how can I search a word or a phrase in the page selected with Jsoup.
For example if the word or phrase in in a span how can I find per example the text next to this <span>? For example a link?
Html example code:
...
<div class="div">
<span>my y favourite text </span>
my link
</div>
....
From this example how to find that my word is favourite and I also want to retrieve the link in <a href>?
Target: get text in a span and href attribute of a sibling a element, if the span contains a specified search word.
One way is to look for a a having the href attribute set, that has a preceding sibling span element. Then select the parent element and therein the span element to compare the content. For the parsing of a DOM tree, jsoup is a good option.
Example Code
String source = "<div class=\"div\"><span>my y favourite text </span>my link </div>" +
"<div class=\"div\"><span>my y favourite 2 text </span>my link 1</div>" +
"<div class=\"div\"><span>my y text </span>my link 2</div>";
String searchWord = "favourite";
Document doc = Jsoup.parse(source, "UTF-8");
doc.setBaseUri("http://some-source.com"); // only for absolute links in local example
Element parent;
String spanContent="";
String link = "";
for (Element el : doc.select("span ~ a[href]")) {
parent = el.parent();
if(parent.select("span").text().contains(searchWord)){
spanContent = parent.select("span").first().text();
link = parent.select("a[href]").first().absUrl("href");
System.out.println(spanContent + " -> " + link); // do something useful with the matches
}
}
Output
my y favourite text -> http://www.mylink.com
my y favourite 2 text -> http://some-source.com/some-link.html

How to get data from instagram profile page by Jsoup in android

In an instagram profile page is a button "LOAD MORE" which load more posts.
Image Description
I want to get "href" attr of this button by jsoup in android. when I check view source code I can't find its html code but in Browser Inspect Element its code is visible.
Jsoup can only parse the source code (right click > view source) as retrieved from the server. However, your button is added to the dom (right click > inspect) using javascript.
To get the url, you need to render the page first and then pass the html to jsoup.
Here is an example of how to do it with HtmlUnit:
page.html - source code
<html>
<head>
<script src="loadData.js"></script>
</head>
<body onLoad="loadData()">
<div class="container">
<table id="data" border="1">
<tr>
<th>col1</th>
<th>col2</th>
</tr>
</table>
</div>
</body>
</html>
loadData.js
// append rows and cols to table.data in page.html
function loadData() {
data = document.getElementById("data");
for (var row = 0; row < 2; row++) {
var tr = document.createElement("tr");
for (var col = 0; col < 2; col++) {
td = document.createElement("td");
td.appendChild(document.createTextNode(row + "." + col));
tr.appendChild(td);
}
data.appendChild(tr);
}
}
page.html when loaded to browser
| Col1 | Col2 |
| ------ | ------ |
| 0.0 | 0.1 |
| 1.0 | 1.1 |
Using jsoup to parse page.html for col data
// load source from file
Document doc = Jsoup.parse(new File("page.html"), "UTF-8");
// iterate over row and col
for (Element row : doc.select("table#data > tbody > tr"))
for (Element col : row.select("td"))
// print results
System.out.println(col.ownText());
Output
(empty)
What happened?
Jsoup parses the source code as delivered from the server (or in this case loaded from file). It does not invoke client-side actions such as JavaScript or CSS DOM manipulation. In this example, the rows and cols are never appended to the data table.
How to parse my page as rendered in the browser?
// load page using HTML Unit and fire scripts
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(new File("page.html").toURI().toURL());
// convert page to generated HTML and convert to document
doc = Jsoup.parse(myPage.asXml());
// iterate row and col
for (Element row : doc.select("table#data > tbody > tr"))
for (Element col : row.select("td"))
// print results
System.out.println(col.ownText());
// clean up resources
webClient.close();
Output
0.0
0.1
1.0
1.1

Android Remove First and Last <div> tag from html text using Jsoup

I want to remove first and last div tag from the html text. i use jsoup library to parse the html text.i tried some thing which are shown in code.The html text which have more than one div tag or not be , but i want to remove just first and last div tag if available. please help me. thanks in advance.
public String divremove(String html) {
Document doc = Jsoup.parse(html);
for (Element e : doc.select("div")){
if (e != null) {
Log.e("LOG","link >> " + e.text());
}
}
/* Element link = doc.removeClass("div");
if (link != null)
{
}
Integer in = doc.select("div").first().elementSiblingIndex();*/
Element link = doc.select("div").first();
Log.e("LOG","link >> " + link);
Element link2 = doc.select("div").last();
Log.e("LOG","link2 >> " + link2.text());
return html;//formatted
}
Here's an example:
final String html = "<div>A</div><div>B</div><div>C</div><div>D</div>";
Document doc = Jsoup.parse(html);
// (1) - Remove from html
doc.select("div").first().remove();
doc.select("div").last().remove();
System.out.println(doc.body());
// (2) - Remove from list
Elements divs = doc.select("div");
divs.remove(0);
divs.remove(divs.size()-1);
System.out.println(divs);
(1) removes the first and last tag from the html, so doc wont contain them anymore. If you just want to remove them from your selected div's, use (2) instead. This will keep it in your html (= doc), but it's removed from divs.

Issue on parsing Html with jsoup

I am trying to parse this HTML using jsoup.
My code is:
doc = Jsoup.connect(htmlUrl).timeout(1000 * 1000).get();
Elements items = doc.select("item");
Log.d(TAG, "Items size : " + items.size());
for (Element item : items) {
Log.d(TAG, "in for loop of items");
Element titleElement = item.select("title").first();
mTitle = titleElement.text().toString();
Log.d(TAG, "title is : " + mTitle);
Element linkElement = item.select("link").first();
mLink = linkElement.text().toString();
Log.d(TAG, "link is : " + mLink);
Element descElement = item.select("description").first();
mDesc = descElement.text().toString();
Log.d(TAG, "description is : " + mDesc);
}
I am getting following output:
in for loop of items
D/HtmlParser( 6690): title is : Indonesian president: Some multinationals "take too much"
D/HtmlParser( 6690): link is :
D/HtmlParser( 6690): description is : April 23 - Indonesian President Susilo Bambang Yudhoyono tells a Thomson Reuters Newsmaker event that the country welcomes foreign investment in its resources sector, but must receive a "fair share" of benefits.<div class="feedflare"> <img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?d=yIl2AUoC8zA" border="0"></img> <img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?i=NX3AY96GfGk:hAtGeOq2ESs:V_sGLiPBpWU" border="0"></img> <img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?i=NX3AY96GfGk:hAtGeOq2ESs:F7zBnMyn0Lo" border="0"></img> </div><img src="http://feeds.feedburner.com/~r/reuters/audio/newsmakerus/rss/mp3/~4/NX3AY96GfGk" height="1" width="1"/>
But I want output as:
in for loop of items
D/HtmlParser( 6690): title is : Indonesian president: Some multinationals "take too much"
D/HtmlParser( 6690): link is : http://feeds.reuters.com/~r/reuters/audio/newsmakerus/rss/mp3/~3/KDcQe4gF-3U/62828262.mp3
D/HtmlParser( 6690): description is : April 23 - Indonesian President Susilo Bambang Yudhoyono tells a Thomson Reuters Newsmaker event that the country welcomes foreign investment in its resources sector, but must receive a "fair share" of benefits.
What should I change in my code?
How to achieve my goal. Please help me!!
Thank you in advance!!
There are 2 problems in rss content you fetched.
The link text is not within the <link/> tag but outside of it.
There is some escaped html content within the description tag.
PFB the modified code.
Also I found some clean html content when viewed the URL in Browser, which when parsed will make you easy to extract the desired fields. You can achieve that setting the userAgent as Browser in the Jsoup. But its up to you to decide how to fetch the content.
doc = Jsoup.connect("http://feeds.reuters.com/reuters/audio/newsmakerus/rss/mp3/").timeout(0).get();
System.out.println(doc.html());
System.out.println("================================");
Elements items = doc.select("item");
for (Element item : items) {
Element titleElement = item.select("title").first();
String mTitle = titleElement.text();
System.out.println("title is : " + mTitle);
/*
* The link in the rss is as follows
* <link />http://feeds.reuters.com/~r/reuters/audio/newsmakerus/rss/mp3/~3/NX3AY96GfGk/59621707.mp3
* which doesn't fall in the <link> element but falls under <item> TextNode
*/
String mLink = item.ownText(); //
System.out.println("link is : " + mLink);
Element descElement = item.select("description").first();
/*Unescape the html content, Parse it to a doc, and then fetch only the text leaving behind all the html tags in content
* "/" is a dummy baseURI passed, as we don't care about resolving the links within parsed content.
*/
String mDesc = Parser.parse(Parser.unescapeEntities(descElement.text(), false),"/" ).text();
System.out.println("description is : " + mDesc);
}

How can I extract something specific using Jsoup?

How can I extract the full names from this sample HTML code?
I only want to get the following.
Full name1
Full name2
Full name3
<div class="readerP">
<p><a href="link1_english.html" title="Complete" >Full name1</a><br>[ other info ]</br> </p>
</di
<div class="readerP">
<p><a href="link2_english.html" title="Complete" >Full name2</a><br>[ other info ]</br> </p>
</div>
<div class="readerP">
<p><a href="link1_english.html" title="Complete" >Full name3</a><br>[ other info ]</br> </p>
</div>
I am using this code, but it looks to all the 'a' tags in the page, so I would get extra info like.
Home Page
About
Contact
Full name1
Full name2
Full name3
and so on ...
try {
doc = Jsoup.connect("http://www.somesite.com").get();
Elements links = doc.getElementsByTag("a");
for (Element el : links) {
linkText = el.ownText();
arr_linkText.add(linkText);
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
How can I look at the 'div' tag and if class="readerP" look at the 'a' tags inside the 'div'?
How can I look at the 'div' tag and if class="readerP" look at the 'a'
tags inside the 'div'?
Using the appropiate selector, in stead of just searching by tags.
Elements links = doc.select("div .readerP a");
Read more about selectors in the Jsoup documentation.

Categories

Resources