Parsing HTML with Jsoup lib

Parsing HTML with Jsoup lib - android

I'm trying parse html with Jsoup lib. Everything works perfect, but something that does't display.
Code:
protected ArrayList<Order> doInBackground(String... urls) {
listItems.clear();
myAdapterDouble.notifyDataSetChanged();
String url = null;
try {
Document doc = Jsoup.connect(URL).timeout(0).userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6").get();
Elements days = doc.select("div.day_now");
for (Element day : days) {
dd = day.select("div.tooltip");
for (Element d : dd) {
title = d.select("td.tooltip_title h4").text();
time = d.select("td.tooltip_info h4").text();
img = d.select("td.tooltip_desc img[src]");
Order o = new Order();
o.setLink(URL + img.attr("src"));
o.setTextName(title);
o.setTextTime(time
.replace("on", getResources().getString(R.string.on))
.replace("at", getResources().getString(R.string.at))
.replace("Ep:", getResources().getString(R.string.episode))
.replace("Final", getResources().getString(R.string.final_ep)));
o.setDetailsUrl(URL + url); //set urls text in list
listItems.add(o);
}
Elements links = day.select("h3");
for (Element link : links) {
url = link.select("a").attr("href"); // parse page urls
System.out.println(url); //display urls in LogCat
}
}
} catch (IOException e) {
e.printStackTrace();
}
return listItems;
}
In LogCat i see urls, that i parse in code above
01-20 12:13:17.671: I/System.out(23390): /show/678/AKB0048_next_stage
01-20 12:13:17.671: I/System.out(23390): /show/668/Battle_Spirits%3A_Sword_Eyes
01-20 12:13:17.671: I/System.out(23390): /show/694/Beast_Saga
01-20 12:13:17.671: I/System.out(23390): /show/660/Cross_Fight_B-Daman_eS
But these links are not displayed on the screen instead i get null.
What am I doing wrong?
Thanks.

Currently you are not adding url to listItems . change your code as to get url :
ArrayList<Order> newarraylist=new ArrayList<Order>;
Elements links = day.select("h3");
int urlcount=0;
for (Element link : links) {
url = link.select("a").attr("href"); // parse page urls
System.out.println(url); //display urls in LogCat
if(urlcount < listItems.size()){
Order o = (Order)listItems.get(urlcount);
o.setDetailsUrl(URL + url); //set urls text in list
newarraylist.add(o);
}
urlcount++;
}
now return newarraylist from doInBackground instead of listItems

Related

Scraping google search first page using Jsoup with AsyncTask fails?

I've been using Jsoup in order to fetch certain words from google search but it fails to my understanding in the Jsoup query process.
It's getting successfully into the doInBackground method but it won't print the title and body of each link on the search.
My guess is that the list I'm getting from doc.select (links) is empty.
which brings it to query syntax problem
value - it's the keyword search, in my case, it's a barcode that actually works. Here's the link
Here it's the async call from another class:
String url = "https://www.google.com/search?q=";
if (!value.isEmpty())
{
url = url + value + " price" + "&num10";
Scrape_Asynctasks task = new Scrape_Asynctasks();
task.execute(url);
}
and here is the async task itself:
public class Scrape_Asynctasks extends AsyncTask<String, Integer, String>
{
#Override
protected void onPreExecute() {
super.onPreExecute();
}
#Override
protected String doInBackground(String... strings) {
try
{
Log.i("IN", "ASYNC");
final Document doc = Jsoup
.connect(strings[0])
.userAgent("Jsoup client")
.timeout(5000).get();
Elements links = doc.select("li[class=g]");
for (Element link : links)
{
Elements titles = link.select("h3[class=r]");
String title = titles.text();
Elements bodies = link.select("span[class=st]");
String body = bodies.text();
Log.i("Title: ", title + "\n");
Log.i("Body: ", body);
}
}
catch (IOException e)
{
Log.i("ERROR", "ASYNC");
}
return "finished";
}
#Override
protected void onProgressUpdate(Integer... values) {
super.onProgressUpdate(values);
}
#Override
protected void onPostExecute(String s) {
super.onPostExecute(s);
}
}

Don't use "Jsoup client" as your user agent string. Use the same string as your browser, eg. "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0". Some sites (including google) don't like it.
Your first selector should be .g: Elements links = doc.select(".g");
The sites uses javascript, so you will not get all the results as you get in your browser.
You can disable JS in your browser and see the difference.

asp.NET login using HTTP post method with jsoup

I am recently trying to develop a android app for my school friends so they do not have to use a web browser but an simple app to check their updated grades and exam schedule but since the school wont give permission to use their DB the only method is to do HTML parsing.
so I found this library Jsoup and an example and started writing my own code but it always brings me the page source of login in page (It doesnt log in at all)
public Document getHTMLsoure() {
Document doc=null;
try {
doc = Jsoup.connect("http://karinca.meliksah.edu.tr")
.data("ctl00$ContentPlaceHolder1$txtKullaniciAdi","usernm")
.data("ctl00$ContentPlaceHolder1$txtSifre", "passwd")
.data("ctl00$ContentPlaceHolder1$btnLogin", "Giriş")
.userAgent("Mozilla")
.post();
} catch (IOException e1) {
e1.printStackTrace();
}
return doc;
}

Please check it.
Result Kullanıcı adı yada şifre hatası !
Response res = Jsoup
.connect("https://karinca.meliksah.edu.tr/View/Login")
.userAgent("Mozilla")
.execute();
Document doc = res.parse();
String eventArgument = doc.select("input[name=__EVENTARGUMENT]").val();
String viewState = doc.select("input[name=__VIEWSTATE]").val();
String viewStateGenerator = doc.select("input[name=__VIEWSTATEGENERATOR]").val();
String eventValidation = doc.select("input[name=__EVENTVALIDATION]").val();
String asyncPost = "true";
String ct = "";
String body = doc.body().html();
int indexOf = body.indexOf("Sys.WebForms.PageRequestManager._initialize(");;
if(indexOf > -1){
int indexEnd = body.substring(indexOf).indexOf("');");
if(indexEnd > -1){
String temp = body.substring(indexOf, indexOf+indexEnd);
int indexStart = temp.lastIndexOf("'");
ct = temp.substring(indexStart+1,temp.length());
}
}
Document doc1 = Jsoup.connect("https://karinca.meliksah.edu.tr/View/Login.aspx")
.referrer("https://karinca.meliksah.edu.tr/View/Login")
.cookies(res.cookies())
.data(ct+"$ContentPlaceHolder1$ScriptManager2",ct+"$ContentPlaceHolder1$UpdatePanel1|"+ct+"$ContentPlaceHolder1$btnLogin")
.data(ct+"$ContentPlaceHolder1$txtKullaniciAdi","usernm")
.data(ct+"$ContentPlaceHolder1$txtSifre", "passwd")
.data("__EVENTTARGET",ct+"$ContentPlaceHolder1$btnLogin")
.data("__EVENTARGUMENT",eventArgument)
.data("__VIEWSTATE",viewState)
.data("__VIEWSTATEGENERATOR",viewStateGenerator)
.data("__EVENTVALIDATION",eventValidation)
.data("__ASYNCPOST",asyncPost)
.userAgent("Mozilla")
.post();
System.out.println(doc1.html());

how to get an image from HTML page Android studio

I have an image in JPG in a sit (I suppose it is HTML format but I am not sure about it). I open the source of the page and I see there the image I need written this way.
If I take the link it show me the image.
But i don't know how can I get from the URL page to get this link. It is not look like written in JSON format.
How can I get it?
Thanks
Bar.

After some play I get to this:
The meta is the elements, and og.image and content are one of there meta data attribute.
So I do as follow to get the image URL string
String imageLink=null;
try {
Log.d(TAG, "Connecting to [" + strings[0] + "]");
Document doc = Jsoup.connect(strings[0]).get(); // put all the HTML page in Document
// Get meta info
Elements metaElems = doc.select("meta");
for (Element metaElem : metaElems) {
String property = metaElem.attr("property");
if(property.equals("og:image"))// if find the line with the image
{
imageLink = metaElem.attr("content");
Log.d(TAG, "Image URL" + imageLink );
}
}
} catch (Exception e) {
e.printStackTrace();
exception =e;
return null;
}

Here I am posting the small code snippet for ingrate this kind of functionality may this help you.
Step 1: Add below gradle
compile 'org.jsoup:jsoup:1.10.2'
Step 2:
Use below async task for get all meta information from any Url.
public class MainActivity extends AppCompatActivity {
private ImageView imgOgImage;
private TextView text;
String URL = "https://www.youtube.com/watch?v=ufaK_Hd6BpI";
String UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36";
#Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
text = (TextView) findViewById(R.id.text);
imgOgImage = (ImageView) findViewById(R.id.imgOgImage);
new FetchMetadataFromURL().execute();
}
private class FetchMetadataFromURL extends AsyncTask<Void, Void, Void> {
String websiteTitle, websiteDescription, imgurl;
#Override
protected void onPreExecute() {
super.onPreExecute();
}
#Override
protected Void doInBackground(Void... params) {
try {
// Connect to website
Document document = Jsoup.connect(URL).get();
// Get the html document title
websiteTitle = document.title();
//Here It's just print whole property of URL
Elements metaElems = document.select("meta");
for (Element metaElem : metaElems) {
String property = metaElem.attr("property");
Log.e("Property", "Property =" + property + " \n Value =" + metaElem.attr("content"));
}
// Locate the content attribute
websiteDescription = metaElems.attr("content");
String ogImage = null;
Elements metaOgImage = document.select("meta[property=og:image]");
if (metaOgImage != null) {
imgurl = metaOgImage.first().attr("content");
System.out.println("src :<<<------>>> " + ogImage);
}
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
#Override
protected void onPostExecute(Void result) {
text.setText("Title : " + websiteTitle + "\n\nImage Url :: " + imgurl);
//t2.setText(websiteDescription);
Picasso.with(getApplicationContext()).load(imgurl).into(imgOgImage);
}
}
}
Note : Here I have just roughly making this demo.no any coding standard will user so please take care this while you ingrate this code in your application.I am just making this demo for learning purpose only.
Here I am just used youtube url for display meta data.you can used any url based on your requirement.
I hope you are clear with my logic.
Good Luck

Issue on Parsing image url from image tag

I have written the code for parsing image url from <image> tag.
The code is below:
NodeList imageLink = docElement.getElementsByTagName("image");
String nUrl;
if (imageLink.toString() != null) {
Element element = (Element) imageLink.item(0);
NodeList imageUrl = element.getElementsByTagName("url");
if (imageUrl.toString() != null) {
Element imageFirst = (Element) imageUrl.item(0);
nUrl = imageFirst.getFirstChild().getNodeValue();
Log.d(TAG,
"<<<<<<<<<<<<<<<<<<<<<<..............Image Url is : "
+ nUrl
+ ".....................>>>>>>>>>>>>>>>>>.....");
} else {
Log.d(TAG,
"<<<<<<<<<<<<<<<<<<<<<<..............Image Url is null : .....................>>>>>>>>>>>>>>>>>.....");
nUrl = "http://static.dnaindia.com/images/710/logo_dna_rss.gif";
}
} else {
Log.d(TAG,
"<<<<<<<<<<<<<<<<<<<<<<..............Image tag is not found.....................>>>>>>>>>>>>>>>>>.....");
nUrl = "http://static.dnaindia.com/images/710/logo_dna_rss.gif";
}
It was working fine with the rss feed which having <image> tag. I want to set default image for the Rss which does not having <image> url.
But my code showing java.lang.NullPointerException in this line NodeList imageUrl = element.getElementsByTagName("url");.
How to check null for NodeList?
And give me any idea to rectify it.
Thank you in advance!!!

Surround yor code NodeList imageUrl = element.getElementsByTagName("url"); with try-catch and catch the exception if no `url Tag is found.
Like this
try{
NodeList imageUrl = element.getElementsByTagName("url");
} catch(NullPointerException e){
nUrl = "http://static.dnaindia.com/images/710/logo_dna_rss.gif";
e.printStackTrace();
}
Hope it helps...

Incorrect parse with Jsoup

I'm parse site http://animecalendar.net with Jsoup. Аll is well parsed fine, but i have one problem. I get a mixed list of urls, but they are parsed correctly (see logs)
Code:
#Override
protected ArrayList<Order> doInBackground(String... urls) {
listItems.clear();
myAdapter.notifyDataSetChanged();
String dates = null;
String url = null;
try {
Document doc = Jsoup.connect(URL).get();
Elements main = doc.select("div.day");
for (Element m : main) {
titles = m.select("div.tooltip");
for (Element tts : titles) {
title = tts.select("td.tooltip_title h4").text();
time = tts.select("td.tooltip_info h4").text();
img = tts.select("td.tooltip_desc img[src]");
Order o = new Order();
o.setLink(URL + img.attr("src"));
o.setTextName(title);
o.setTextTime(time);
o.setTextDate(dates);
o.setDetailsUrl(URL + url); // incorrect (mixed) displayed urls list in device
listItems.add(o);
}
Elements date = m.select("h2");
for (Element m1 : date) {
dates = m1.select("a").attr("href");
}
Elements links = m.select("h3");
for (Element link : links) {
url = link.select("a").attr("href"); // parse urls from site
System.out.println(url); // in LogCat displayed correct urls list
}
}
} catch (IOException e) {
e.printStackTrace();
}
return listItems;
}
LogCat:
01-21 12:55:55.429: I/System.out(8036): /show/596/Cardfight%21%21_Vanguard%3A_Asia_Circuit_Hen
01-21 12:55:55.429: I/System.out(8036): /show/583/Inazuma_Eleven_GO_2%3A_Chrono_Stone
01-21 12:55:55.445: I/System.out(8036): /show/671/Ai_Mai_Mi_
01-21 12:55:55.445: I/System.out(8036): /show/697/Mangirl%21
etc...
As a result, I get a mixed list of urls.
Screen:
How to resolve it?
Thanks.

Problem resolved like this
Elements epBox = doc.select("div.ep_box h3");
int urlcount = 0;
for (Element ep : epBox) {
url = ep.select("a").attr("href");
if (urlcount < listItems.size()) {
Order o = (Order) listItems.get(urlcount);
o.setDetailsUrl(URL + url);
newarraylist.add(o);
}
urlcount++;
}

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.

Parsing HTML with Jsoup lib - android

Related

Scraping google search first page using Jsoup with AsyncTask fails?

asp.NET login using HTTP post method with jsoup

how to get an image from HTML page Android studio

Issue on Parsing image url from image tag

Incorrect parse with Jsoup

Categories

Resources