I am trying to parse the following link:
http://rate-exchange.appspot.com/currency?from=USD&to=EUR&q=1
As you can see it is a very simple page and I am just trying to extract the text on the page with JSoup. My current implementation returns the wrong HTML and I am not sure why. Here is my code:
public class RetreiveCurrencies extends AsyncTask<String, Void, String>{
#Override
protected String doInBackground(String... arg0) {
Document html = null;
try {
Log.i("wbbug",arg0[0]);
html = Jsoup.parse((arg0[0]));
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Log.i("wbbug",html.toString());
return null;
}
}
Which is called with:
AsyncTask<String, Void, String> rc = new RetreiveCurrencies().execute("http://rate-exchange.appspot.com/currency?from=USD&to=EUR&q=1");
However, instead of returning the correct HTML with the text you see when clicking the link, my Log.i returns:
<html>
<head></head>
<body>
http://rate-exchange.appspot.com/currency?from=USD&to=EUR&q=1
</body>
</html>
What am I doing wrong and how can I extract the text you see when clicking the link?
Jsoup.parse() takes a String argument, so currently your code is parsing the URL as if it was a String of html code.
To parse a Document from a remote URL you should use Jsoup.connect(), for example:
Document doc = Jsoup.connect("URL").get();
For your specific example (which appears to be returning JSON, not HTML):
Document doc = Jsoup.connect("http://rate-exchange.appspot.com/currency?from=USD&to=EUR&q=1").ignoreContentType(true).get();
System.out.println(doc.text());
Will output:
{"to": "EUR", "rate": 0.73757499999999998, "from": "USD", "v": 0.73757499999999998}
The reason I had to add ignoreContentType(true) is because otherwise it throws an UnsupportedMimeTypeException.
Related
I am trying to debug an issue I am having. I am using the following code to try to get the link to an image off of a page.
private class DownloadWebpageTask extends AsyncTask<String, Void, String> {
#Override
protected String doInBackground(String... args) {
String urls = args[0];
Document doc = null;
try {
doc = Jsoup.connect(urls).ignoreContentType(true).get();
image = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]").last();
theurlstring = "test " + image.attr("src"); // I put test here to make sure it is being executed
} catch (IOException e) {
e.printStackTrace();
}
return urls;
}
}
I am usually getting an error from any way I am trying to get the link from the Element "image." It says
Attempt to invoke virtual method 'java.lang.String org.jsoup.nodes.Element.attr(java.lang.String)' on a null object reference
So with that error, I am now thinking that image is not getting selected properly. Does anyone see anything that looks wrong? Or how could I pinpoint the problem better?
Your query is not working, see http://try.jsoup.org/~I4Y0POaloHUtrNTMJO7IAiAUIRY
You could use:
image = doc.select("img[src$=.png],img[src$=.gif],img[src$=.jpg],img[src$=.jpeg]").last();
Not as compact, but at least selecting the images (see http://try.jsoup.org/~kjnlfvCzrxiqaGQqwcszLZswSNg).
If the error persists, use try.jsoup.org with your source url to verify, that the expected output is rendered in the received html, to rule out issues with javascript generated content.
I am trying to convert iOS application into android. But I just start learning Java a few days ago. I'm trying to get a value from a tag inside html.
Here is my swift code:
if let url = NSURL(string: "http://www.example.com/") {
let htmlData: NSData = NSData(contentsOfURL: url)!
let htmlParser = TFHpple(HTMLData: htmlData)
//the value which i want to parse
let nPrice = htmlParser.searchWithXPathQuery("//div[#class='round-border']/div[1]/div[2]") as NSArray
let rPrice = NSMutableString()
//Appending
for element in nPrice {
rPrice.appendString("\n\(element.raw)")
}
let raw = String(NSString(string: rPrice))
//the value without trimming
let stringPrice = raw.stringByReplacingOccurrencesOfString("<[^>]+>", withString: "", options: .RegularExpressionSearch, range: nil)
//result
let trimPrice = stringPrice.stringByReplacingOccurrencesOfString("^\\n*", withString: "", options: .RegularExpressionSearch)
}
Here is my Java code using Jsoup
public class Quote extends Activity {
TextView price;
String tmp;
#Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_quote);
price = (TextView) findViewById(R.id.textView3);
try {
doc = Jsoup.connect("http://example.com/").get();
Element content = doc.getElementsByTag("//div[#class='round-border']/div[1]/div[2]");
} catch (IOException e) {
//e.printStackTrace();
}
}
}
My problems are as following:
I got NetworkOnMainThreatException whenever i tried any codes.
I'm not sure that using getElementByTag with this structure is correct.
Please help,
Thanks.
I got NetworkOnMainThreatException whenever i tried any codes.
You should use Volley instead of Jsoup. It will be a faster and more efficient alternative. See this answer for some sample code.
I'm not sure that using getElementByTag with this structure is correct.
Element content = doc.getElementsByTag("//div[#class='round-border']/div[1]/div[2]");
Jsoup doesn't understand xPath. It works with CSS selectors instead.
The above line of code can be corrected like this:
Elements divs = doc.select("div.round-border > div:nth-child(1) > div:nth-child(2)");
for(Element div : divs) {
// Process each div here...
}
When I try to parse the document from a html file it works fine.
But, when I try to parse a document from a url it gives the following error:
java.lang.IndexOutOfBoundsException: Invalid index 3, size is 2
I am sure the content from the file is the same from the url and I also tried using threads
Here, below is the website:
http://pucminas.br/relatorio_atividades_2014/arquivos/ensino_graduacao.htm
Here, below, is the code
class MyTask extends AsyncTask<Void, Void, String> {
#Override
protected String doInBackground(Void... params) {
String title ="";
try {
URL url = new URL(getString(R.string.url));
Document doc = Jsoup.parse(url, 3000);
Element table = doc.select("table").get(3);
} catch (IOException e) {
e.printStackTrace();
}
return title;
}
}
You should know that Jsoup has a size limit and a timeout limit also, therefor not every table is parsed.
Fortunately, there's a way to change this when connecting to the site and making your document object.
Solution
Document doc = Jsoup.connect(url).maxBodySize(0)
.timeout(0)
.followRedirects(true)
.get();
JSoup APIDocs
Connection#maxBodySize(int bytes)
Update the maximum body size, in bytes.
Connection#timeout(int millis)
Update the request timeout.
I wanna save all web page including .css .js on android by programmatically.
So far I tried html get method and jsoup , webview content but all of them I could not save all page with css and js. These methods just save html parts of WEB Page. When I save the all page ,I want to open it offline.
Thanks in advance
You have to take the html, parse it and get the urls of the resources and then make requests for those urls too.
public class Stack {
private static final String USER_AGENT = "";
private static final String INITIAL_URL = "";
public static void main(String args[]) throws Exception {
Document doc = Jsoup
.connect(INITIAL_URL)
.userAgent(USER_AGENT)
.get();
Elements scripts = doc.getElementsByTag("script");
Elements css = doc.getElementsByTag("link");
for(Element s : scripts) {
String url = s.absUrl("src");
if(!url.isEmpty()) {
System.out.println(url);
Document docScript = Jsoup
.connect(url)
.userAgent(USER_AGENT)
.ignoreContentType(true)
.get();
System.out.println(docScript);
System.out.println("--------------------------------------------");
}
}
for(Element c : css) {
String url = c.absUrl("href");
String rel = c.attr("rel") == null ? "" : c.attr("rel");
if(!url.isEmpty() && rel.equals("stylesheet")) {
System.out.println(url);
Document docScript = Jsoup
.connect(url)
.userAgent(USER_AGENT)
.ignoreContentType(true)
.get();
System.out.println(docScript);
System.out.println("--------------------------------------------");
}
}
}
}
I have similar problem...
Using this code we can get images,.css,.js. However some html contents are still missing.
For instance when we save a web page via chrome,there are 2 options.
Complete html
html only
Out of .css,.js,.php..."Complete html" consists of more elements than "only html". The requirement is to download the html as complete like chrome does in the first option.
I am developing an android application in which I am parsing html contents from a website using Jsoup in android.
<meta name="title" content="Notices for the week - Holy Family Church, Pestom Sagar" />
For this I've wrote:
#Override
protected Void doInBackground(Void... params) {
try {
// Connect to the web site
org.jsoup.nodes.Document document = Jsoup.connect(url).get();
// Get the html document title
title=document.select("meta[name=title]");
desc = title.attr("content");
} catch (IOException e) {
e.printStackTrace();
} catch(NullPointerException ex){
System.out.println(ex);
}
return null;
}
#Override
protected void onPostExecute(Void result)
{
// Set title into TextView
t1.setText(desc);
}
This is working fine without any problem and displaying in textView of Activity. Now I want to parse h3 tag from that website.
<h3 xmlns="http://www.w3.org/1999/xhtml" id="sites-page-title-header" style="" align="left">
<span id="sites-page-title" dir="ltr">Notices for the week</span>
</h3>
I am not getting any idea how to do this and display this using TextView in android activity. Please suggest me...Also if I want to parse whole div tag and display that into activity using textView..!!!
You can select h3 tag directly as :
String h3=document. select("h3").text();
textView.setText(h3);
Or
textView. setText(document.select("span[id=sites-page-title]").first().text());