How to get Google search headings with Jsoup

How to get Google search headings with Jsoup - android

I am trying to get the headings of Google search with Jsoup.
Here is my code:
String request = "https://www.google.com/search?q=" + query + "&num=5";
try {
Document doc = Jsoup
.connect(request)
.userAgent(
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
.timeout(5000).get();
Elements headings = doc.select("h3");
//headings array is empty
} catch (IOException e) {
e.printStackTrace();
}
I get no results from doc.select("h3"). What am I doing wrong?

Check your Document's content, perhaps the request didn't go through properly or the result is different from your browser.

Related

Can't scrap elements by class name using JSOUP

This code returns nothing when I'm trying to scrap data from airbnb,
try {
doc = Jsoup.connect("https://www.airbnb.com").
header("Accept", "text/html")
.header("Accept-Encoding", "gzip,deflate")
.header("Accept-Language", "it-IT,en;q=0.8,en-US;q=0.6,de;q=0.4,it;q=0.2,es;q=0.2")
.header("Connection", "keep-alive")
.userAgent("Mozilla")
.get();
} catch (IOException e) {
e.printStackTrace();
}
Elements els = doc.getElementsByClass("cy5jw6o dir dir-ltr");
System.out.println(els);
I tried the mentioned code and also this
Elements els = doc.getElementsByClass("div.cy5jw6o.dir.dir-ltr");
How to get all elements with this class name and even access links under it or other divs under?

Parsing an XML string into a kXML Element

I'm writing an Android app that connects to a SOAP webservice using kSOAP2, and I have a kXML element where I would like to inject a child based on an XML string I got from elsewhere (a REST API). I have the following code:
Element samlHeader = new Element().createElement("http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd", "Security");
samlHeader.setPrefix("wsse", "http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd");
samlHeader.setPrefix("wsu", "http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd");
String samlTokenString = ...; //I got this from elsewhere
Element samlTokenElement = ...; //I don't know how to build this
samlHeader.addChild(Node.ELEMENT, samlTokenElement);
So I'm trying to figure out how to build my Element based on the XML string I'm getting from elsewhere.

This is the solution that we ended up implementing:
try {
KXmlParser parser = new KXmlParser();
parser.setInput(new StringReader(samlTokenString));
parser.setFeature(XmlPullParser.FEATURE_PROCESS_NAMESPACES, true);
Document samlTokenDocument = new Document();
samlTokenDocument.parse(parser);
samlHeader.addChild(Node.ELEMENT, samlTokenDocument.getRootElement());
} catch (XmlPullParserException e) {
Log.e(TAG,"Could not parse SAML assertion", e);
} catch (IOException e) {
Log.e(TAG,"Could not parse SAML assertion", e);
}
We're still validating if it produces the right result but it seems to work.

Retrieving information from google page

I'm thinking about making my first android app, It'd be about movies, I found an excellent data source, it is "http://www.google.com/movies?" but I wanted to know how could I extract this information and put it in my app,
I've searched but I don't know which is the optimal way to do this? does google have an API for this? is that what I want? is it better with the source code?what could I read or see to learn to do this?
thanks a lot guys, Is my first time as well programming retrieving information from the cloud,
cheers

Yup. Here is one way to do it.
First, you need to find the source of the SQL. The Yahoo Developer Console is a great place to look for this sort of stuff. It has EVERYTHING. The way these resources work is that you have a long link, like this....
developer.yahoo.com/blah/this . . . &q=KEYWORD_HERE+blah/ . . .
To access the information you are looking for, you stick whatever the correct keyword is where "KEYWORD_HERE" is, and the link will give you info in SQL format. I'll be doing the example as a stocks app.
First you create an Activity and define both sides of your link as strings. It'll look a bit like this:
public class InfoActivity extends Activity {
String firstHalf = "http://query.yahooapis.com/v1/public/blahblahblah&q=";
String secondHalf = "+blah/blah&blah . . . ";
Then, in your onCreate, you'll need to start an aSync task to do the actual pulling and parsing:
protected void onCreate(Bundle bundle) {
super.onCreate(bundle);
setContentView(R.id.layout_name);
final String yqlURL = firstHalf + KEYWORD_HERE + secondHalf;
new MyAsyncTask().execute(yqlURL);
}
Then to define our MrAsyncTask:
private class MyAsyncTask extends AsyncTask<String, String, String>{
protected String doInBackground(String... args) {
try {
URL url = new URL(args[0]);
URLConnection connection;
connection = url.openConnection();
HttpURLConnection httpConnection = (HttpURLConnection)connection;
int responseCode = httpConnection.getResponseCode();
// Tests if responseCode == 200 Good Connection
if (responseCode == HttpURLConnection.HTTP_OK) {
InputStream in = httpConnection.getInputStream();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document dom = db.parse(in);
Element docEle = dom.getDocumentElement();
NodeList nl = docEle.getElementsByTagName("nodeName1");
if (nl != null && nl.getLength() > 0) {
for (int i = 0 ; i < nl.getLength(); i++) {
//Parse the node here with getTextValue(n1, "Name of element")
//ex: String movieName = getTextValue(n1, "MovieName");
}
}
}
} catch (MalformedURLException e) {
Log.d(TAG, "MalformedURLException", e);
} catch (IOException e) {
Log.d(TAG, "IOException", e);
} catch (ParserConfigurationException e) {
Log.d(TAG, "Parser Configuration Exception", e);
} catch (SAXException e) {
Log.d(TAG, "SAX Exception", e);
}
finally {
}
return null;
}
I hope that gives you some idea of how to do this sort of thing. I'll go see if I can quickly spot a good resource on the yahoo apis to get the movie times at a certain location.
Good luck :) Let me know if you need anything clarified.
EDIT:
Looks like this is EXACTLY what you need (resource wise):
https://developer.yahoo.com/yql/console/?q=show%20tables&env=store://datatables.org/alltableswithkeys#h=select+*+from+google.igoogle.movies+where+movies%3D'68105'%3B
Check that out. Using that, your two halves of the link would be:
String firstHalf = "https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20google.igoogle.movies%20where%20movies%3D'"
String secondHalf = "'%3B&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys"
And then to get your final link, you would just do
String yqlURL = firstHalf + "ZIP CODE OF YOUR LOCATION" + secondHalf;
And you would have all of the movies playing near you returned!

Make your life a lot easier and choose the api that is right for you. Choose one of these:
http://www.programmableweb.com/news/52-movies-apis-rovi-rotten-tomatoes-and-internet-video-archive/2013/01/22
Make your decision not only based on the content, but also ease of use and documentation. Documentation is a biggy.
Good luck!

well i would rather advice you to use an TheMovieDB.com API it is simple and provides every info of movies.

Android rss feed parsing

I am new to android,In my application i have to parse the data and i need to display in screen.But in one particular tag data i can't able to parse why because some special character also coming inside that tag.Here below i display my code.
My parser function:
protected ArrayList<String> doInBackground(Context... params)
{
// context = params[0];
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
test = new ArrayList<String>();
try {
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new java.net.URL("input URL_confidential").openConnection().getInputStream());
//Document document = builder.parse(new URL("http://www.gamestar.de/rss/gamestar.rss").openConnection().getInputStream());
Element root = document.getDocumentElement();
NodeList docItems = root.getElementsByTagName("item");
Node nodeItem;
for(int i = 0;i<docItems.getLength();i++)
{
nodeItem = docItems.item(i);
if(nodeItem.getNodeType() == Node.ELEMENT_NODE)
{
NodeList element = nodeItem.getChildNodes();
Element entry = (Element) docItems.item(i);
name=(element.item(0).getFirstChild().getNodeValue());
// System.out.println("description = "+element.item(2).getFirstChild().getNodeValue().replaceAll("<div><p>"," "));
System.out.println("Description"+Jsoup.clean(org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(element.item(2).getFirstChild().getNodeValue()), new Whitelist()));
items.add(name);
}
}
}
catch (ParserConfigurationException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
catch (MalformedURLException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
catch (SAXException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
catch (IOException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
return items;
}
Input:
<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
<channel>
<title>my application</title>
<link>http:// some link</link>
<atom:link href="http:// XXXXXXXX" rel="self"></atom:link>
<language>en-us</language>
<lastBuildDate>Thu, 20 Dec 2012</lastBuildDate>
<item>
<title>lllegal settlements</title>
<link>http://XXXXXXXXXXXXXXXX</link>
<description> <div><p>
India was joined by all members of the 15-nation UN Security Council except the US to condemn Israelâ€™s announcement of new construction activity in Palestinian territories and demand immediate dismantling of the â€œillegalâ€ settlements.
</p>
<p>
UN Secretary General Ban Ki-moon also expressed his deep concern by the heightened settlement activity in West Bank, saying the move by Israel â€œgravely threatens efforts to establish a viable Palestinian state.â€
</p>
<p>
</description>
</item>
</channel>
Output:
lllegal settlements ----> title tag text
India was joined by all members of the 15-nation UN Security Council except the US to condemn Israel announcement of new construction activity in Palestinian territories and demand immediate dismantling of the illegal settlements. -----> description tag text
UN Secretary General Ban Ki-moon also expressed his deep concern by the heightened settlement activity in West Bank, saying the move by Israel gravely threatens efforts to establish a viable Palestinian state. ----> description tag text.

Your text node contains both escaped HTML entities (> is >, greater then) and garbage characters (â€œgrosslyâ€). You should first adjust the encoding according to your input source, then you can unescape the HTML with Apache Commons Lang StringUtils.escapeHtml4(String).
This method (hopefully) returns an XML which you can query (for example with XPath) to extract the wanted text node, or you can give the whole string to JSOUP or to the Android Html class
// JSOUP, "html" is the unescaped string. Returns a string
Jsoup.parse(html).text();
// Android
android.text.Html.fromHtml(instruction).toString()
Test program (JSOUP and Commons-Lang required)
package stackoverflow;
import org.apache.commons.lang3.StringEscapeUtils;
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
public class EmbeddedHTML {
public static void main(String[] args) {
String src = "<description> <div><p> An independent" +
" inquiry into the September 11 attack on the US Consulate" +
" in Benghazi that killed the US ambassador to Libya and" +
" three other Americans has found that systematic failures" +
" at the State Department led to â€œgrosslyâ€ inadequate" +
" security at the mission. </p></description>";
String unescaped = StringEscapeUtils.unescapeHtml4(src);
System.out.println(Jsoup.clean(unescaped, new Whitelist()));
}
}

Is there anything wrong with simply replacing the offending characters?
string = string.replaceAll("<", "");
string = string.replaceAll("div>", "");
string = string.replaceAll("p>", "");

Run the node value with Html.fromHTML() two or three times and it wil be fine.
EXPLANATION: The built-in Html.fromHTML() method will convert wild and broken HTML into usable content. Pseudo code here:
sHTML = node.getNodeValue()
sHTML = Html.fromHTML(sHTML)
sHTML = Html.fromHTML(sHTML)
sHTML = Html.fromHTML(sHTML)
By the the third or fourth time unreadable content will become readable again. You can display it in a textview or loaddata with a webview.

Android - How to parse RSS Feed from .net website?

I'm using this method to read RSS Feeds from URL. Everything works fine except it fails to get feeds from .net webserver (eg. http://www.dotnetnuke.com/Resources/Blogs/rssid/99.aspx).
public String getRSSLinkFromURL(String url) {
// RSS url
String rss_url = null;
try {
// Using JSoup library to parse the html source code
org.jsoup.nodes.Document doc = Jsoup.connect(url).get();
// finding rss links which are having link[type=application/rss+xml]
org.jsoup.select.Elements links = doc.select("link[type=application/rss+xml]");
Log.d("No of RSS links found", " " + links.size());
// check if urls found or not
if (links.size() > 0) {
rss_url = links.get(0).attr("href").toString();
} else {
// finding rss links which are having link[type=application/rss+xml]
org.jsoup.select.Elements links1 = doc.select("link[type=application/atom+xml]");
if(links1.size() > 0){
rss_url = links1.get(0).attr("href").toString();
}
}
} catch (IOException e) {
e.printStackTrace();
}
// returing RSS url
return rss_url;
}

You RSS feed is broken: transfer closed with outstanding read data remaining.
curl will return that message when the socket has been closed before
the final terminating chunk of a chunky transfer is read. It sure
sounds like a server bug to me.
Source: [Re: transfer closed with outstanding read data remaining with Expect: 100-continue][1]
Fix (workaround) for JSoup is here:
https://github.com/jhy/jsoup/pull/323

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.

How to get Google search headings with Jsoup - android

Check your Document's content, perhaps the request didn't go through properly or the result is different from your browser.

Related

Can't scrap elements by class name using JSOUP

Parsing an XML string into a kXML Element

Retrieving information from google page

Android rss feed parsing

Android - How to parse RSS Feed from .net website?

Categories

Resources