Issue on parsing Html with jsoup

Issue on parsing Html with jsoup - android

I am trying to parse this HTML using jsoup.
My code is:
doc = Jsoup.connect(htmlUrl).timeout(1000 * 1000).get();
Elements items = doc.select("item");
Log.d(TAG, "Items size : " + items.size());
for (Element item : items) {
Log.d(TAG, "in for loop of items");
Element titleElement = item.select("title").first();
mTitle = titleElement.text().toString();
Log.d(TAG, "title is : " + mTitle);
Element linkElement = item.select("link").first();
mLink = linkElement.text().toString();
Log.d(TAG, "link is : " + mLink);
Element descElement = item.select("description").first();
mDesc = descElement.text().toString();
Log.d(TAG, "description is : " + mDesc);
}
I am getting following output:
in for loop of items
D/HtmlParser( 6690): title is : Indonesian president: Some multinationals "take too much"
D/HtmlParser( 6690): link is :
D/HtmlParser( 6690): description is : April 23 - Indonesian President Susilo Bambang Yudhoyono tells a Thomson Reuters Newsmaker event that the country welcomes foreign investment in its resources sector, but must receive a "fair share" of benefits.<div class="feedflare"> <img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?d=yIl2AUoC8zA" border="0"></img> <img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?i=NX3AY96GfGk:hAtGeOq2ESs:V_sGLiPBpWU" border="0"></img> <img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?i=NX3AY96GfGk:hAtGeOq2ESs:F7zBnMyn0Lo" border="0"></img> </div><img src="http://feeds.feedburner.com/~r/reuters/audio/newsmakerus/rss/mp3/~4/NX3AY96GfGk" height="1" width="1"/>
But I want output as:
in for loop of items
D/HtmlParser( 6690): title is : Indonesian president: Some multinationals "take too much"
D/HtmlParser( 6690): link is : http://feeds.reuters.com/~r/reuters/audio/newsmakerus/rss/mp3/~3/KDcQe4gF-3U/62828262.mp3
D/HtmlParser( 6690): description is : April 23 - Indonesian President Susilo Bambang Yudhoyono tells a Thomson Reuters Newsmaker event that the country welcomes foreign investment in its resources sector, but must receive a "fair share" of benefits.
What should I change in my code?
How to achieve my goal. Please help me!!
Thank you in advance!!

There are 2 problems in rss content you fetched.
The link text is not within the <link/> tag but outside of it.
There is some escaped html content within the description tag.
PFB the modified code.
Also I found some clean html content when viewed the URL in Browser, which when parsed will make you easy to extract the desired fields. You can achieve that setting the userAgent as Browser in the Jsoup. But its up to you to decide how to fetch the content.
doc = Jsoup.connect("http://feeds.reuters.com/reuters/audio/newsmakerus/rss/mp3/").timeout(0).get();
System.out.println(doc.html());
System.out.println("================================");
Elements items = doc.select("item");
for (Element item : items) {
Element titleElement = item.select("title").first();
String mTitle = titleElement.text();
System.out.println("title is : " + mTitle);
/*
* The link in the rss is as follows
* <link />http://feeds.reuters.com/~r/reuters/audio/newsmakerus/rss/mp3/~3/NX3AY96GfGk/59621707.mp3
* which doesn't fall in the <link> element but falls under <item> TextNode
*/
String mLink = item.ownText(); //
System.out.println("link is : " + mLink);
Element descElement = item.select("description").first();
/*Unescape the html content, Parse it to a doc, and then fetch only the text leaving behind all the html tags in content
* "/" is a dummy baseURI passed, as we don't care about resolving the links within parsed content.
*/
String mDesc = Parser.parse(Parser.unescapeEntities(descElement.text(), false),"/" ).text();
System.out.println("description is : " + mDesc);
}

Related

How to get bullet points when parsing text with Jsoup?

I am using Jsoup to get the text from an html doc and display it in my android app.
The text cotains a list (<ul><li>).
If I do it like this I get only the text:
val doc = Jsoup.parse(someHtml)
return doc.text()
I tried using wholeText:
val doc = Jsoup.parse(removeImages)
return doc.wholeText()
In this way it keeps some formatting, but still it ignores the bullet points. Is there any way to get the bullet points in the text?

The bullets are rendered by the browser, so they are not a part of the text.
You'll have to add it by yourself, like in this example:
String html = "<html>" +
"<head>" +
"<title>List</title>" +
"</head>" +
"<body>" +
"<ul>" +
"<li>Item 1</li>" +
"<li>Item 2</li>" +
"<li>Item 3</li>" +
"</ul> " +
"</body>" +
"</html>";
Document doc = Jsoup.parse(html);
Element list = doc.select("ul").first();
Elements item = list.children();
for (Element e : item) {
System.out.println("\u2022" + e.text());
}
The output is:
•Item 1
•Item 2
•Item 3
You can replace the bullet with any other character that you like, by replacing the \u2022 code with any other valid code/character.

Android: how to search a word or a phrase with Jsoup

My problem is: how can I search a word or a phrase in the page selected with Jsoup.
For example if the word or phrase in in a span how can I find per example the text next to this <span>? For example a link?
Html example code:
...
<div class="div">
<span>my y favourite text </span>
my link
</div>
....
From this example how to find that my word is favourite and I also want to retrieve the link in <a href>?

Target: get text in a span and href attribute of a sibling a element, if the span contains a specified search word.
One way is to look for a a having the href attribute set, that has a preceding sibling span element. Then select the parent element and therein the span element to compare the content. For the parsing of a DOM tree, jsoup is a good option.
Example Code
String source = "<div class=\"div\"><span>my y favourite text </span>my link </div>" +
"<div class=\"div\"><span>my y favourite 2 text </span>my link 1</div>" +
"<div class=\"div\"><span>my y text </span>my link 2</div>";
String searchWord = "favourite";
Document doc = Jsoup.parse(source, "UTF-8");
doc.setBaseUri("http://some-source.com"); // only for absolute links in local example
Element parent;
String spanContent="";
String link = "";
for (Element el : doc.select("span ~ a[href]")) {
parent = el.parent();
if(parent.select("span").text().contains(searchWord)){
spanContent = parent.select("span").first().text();
link = parent.select("a[href]").first().absUrl("href");
System.out.println(spanContent + " -> " + link); // do something useful with the matches
}
}
Output
my y favourite text -> http://www.mylink.com
my y favourite 2 text -> http://some-source.com/some-link.html

Jsoup : How to escape from this simple img selection nightmare?

I am using jsoup lib.
So I want to take img from this website
here
so I used select element and my code was like that
Elements news2 = document.select("div.contentcolumn");
// Elements title2 = news2.select("div.catitems");
Log.d("MainActivity", "This is news = " + title);
for (Element el : news2) {
news_object = new item();
news_object.setTitle(el.select("h1").text());
news_object.setauther(el.select("a").attr("abs:href"));
news_object.setimg(el.select("ImageArea").attr("abs:src"));
Log.d("newsdetail", "humam" + news_object.getimg());
and here is the source code of web site
<div id="ImageArea">
<a href="/filestorage/contentfiles/2016/04_16/090416102307_140_1.jpg"
target="_blank">
<img src="/filestorage/contentfiles/2016/04_16/090416102307_140_1.jpg"
alt="المالكي: الإصلاح محاولة لإفشال المشروع الإسلامي وضرب المتدينين"
style="max-width:620px;">
</a>
</div>
I want to select img and put it in view img and select the text.

The problem lies in this code line: el.select("ImageArea").attr("abs:src").
Let's see what happen:
el.select("ImageArea") // Here we select an element with tag name ImageArea
.attr("abs:src")
ImageArea is the id of a div containing an anchor containing a targeted image.
Try this instead:
news_object.setimg(el.select("#ImageArea img").attr("abs:src"));
If the ImageArea element may not have an img, use the code below:
Element img = el.select("#ImageArea img").first();
if ( img != null ) {
news_object.setimg(img.attr("abs:src"));
} else {
// ...
}

Android Remove First and Last <div> tag from html text using Jsoup

I want to remove first and last div tag from the html text. i use jsoup library to parse the html text.i tried some thing which are shown in code.The html text which have more than one div tag or not be , but i want to remove just first and last div tag if available. please help me. thanks in advance.
public String divremove(String html) {
Document doc = Jsoup.parse(html);
for (Element e : doc.select("div")){
if (e != null) {
Log.e("LOG","link >> " + e.text());
}
}
/* Element link = doc.removeClass("div");
if (link != null)
{
}
Integer in = doc.select("div").first().elementSiblingIndex();*/
Element link = doc.select("div").first();
Log.e("LOG","link >> " + link);
Element link2 = doc.select("div").last();
Log.e("LOG","link2 >> " + link2.text());
return html;//formatted
}

Here's an example:
final String html = "<div>A</div><div>B</div><div>C</div><div>D</div>";
Document doc = Jsoup.parse(html);
// (1) - Remove from html
doc.select("div").first().remove();
doc.select("div").last().remove();
System.out.println(doc.body());
// (2) - Remove from list
Elements divs = doc.select("div");
divs.remove(0);
divs.remove(divs.size()-1);
System.out.println(divs);
(1) removes the first and last tag from the html, so doc wont contain them anymore. If you just want to remove them from your selected div's, use (2) instead. This will keep it in your html (= doc), but it's removed from divs.

Html Checkbox in android webview

I want to know:-
In my project i am using html content and displaying them in android webview. and i am using eclipse ide. this is tiny code.
"<form name =\"frm\">"+
"<input type=\"checkbox\" name =\"First\" value =\"xyz\">xyz<br>"+
"<input type=\"checkbox\" name =\"First\" value =\"abc\">abc<br>"+
"</form>"
my question is how can i get check box state .its checked or unchecked.
or how can i catch the state in my java code.
UPD:-
public String html = "<form name =\"frm\">"+
"<input type=\"checkbox\" name =\"First\" value =\"xyz\">as<br>"+
"<input type=\"checkbox\" name =\"Second\" value =\"zyx\">as<br>"+
"<input type =\"button\" onclick =\"callDoSomething()\"><br>"+
"</form>" +
"<script type=\"text/javascript\">"+
"function callDoSomething() {"+
" var theName = document.frm.First.value;"+
"alert('theName ')"+
"}"+
"</script>";

First, both of your checkboxes are named "First", you should probably name second one "Second". If you want to search checkboxes by value - just add a simple js for loop.
Assuming you want to get the results from your Android code (as opposed to JS event like clicking a button), here's how you get Java boolean value for you checkbox by name:
// assuming your activity is MyActivity, target checkbox name
// is in the targetCheckboxName var and webView has the document
// loaded already
Object jsi = new Object() {
#JavascriptInterface
public String reportCheckboxState(String name, boolean isChecked) {
new AlertDialog.Builder(MyActivity.this).setMessage(name + " is " +
isChecked).create().show();
}
};
webView.addJavascriptInterface(jsi, "injection");
webView.loadUrl(
"javascript:injection.reportCheckboxState(frm." + targetCheckboxName +
".name, frm." + targetCheckboxName + ".checked);"
);
But really, it's a very simple trick. Judging by comments to the question, you should probably read up on JavaScript and WebView.addJavaScriptInterface()

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.

Issue on parsing Html with jsoup - android

Related

How to get bullet points when parsing text with Jsoup?

Android: how to search a word or a phrase with Jsoup

Jsoup : How to escape from this simple img selection nightmare?

Android Remove First and Last <div> tag from html text using Jsoup

Html Checkbox in android webview

Categories

Resources