How to organize extracted values when working with jsoup?

How to organize extracted values when working with jsoup? - android

How do you guys store the values extracted using jsoup in a way where it can be easily readable? So if you have an HTML code like below.
<td width="200">country1 </td>
<td width="200">country2 </td>
<td width="200">country3 </td>
I want to save the countries and the href link for each one, and later be able to read them easily. The way I do it, I have two ListViews one for the countries and one for the href link. If the user selects for example country2 I find the index of it, then use it to get the href link from the other ListView. I feel this method is not good, how do you guys do it?
This is my jsoup code by the way in case it needs more improvement too.
try {
doc = Jsoup.connect("http://somesite.com").get();
// Here to get the names inside tag a
Elements links = doc.select("a");
for (Element el : links) {
links = el.ownText();
//Save all the links into String Array.
array_link.add(links);
}
//Here to get the names inside tag td
Elements linktwo = doc.select("td");
for (Element eltwo : linktwo) {
linkText = eltwo.ownText();
//Save the countries to String Array
array_countries.add(linkText);
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Thank you!

Is this what you want?
try {
Document doc = Jsoup.connect("http://somesite.com").get();
// Here to get the names inside tag a
Elements links = doc.select("a");
Elements linktwo = doc.select("td");
String eltwo = null;
int i = 0;
for (Element el : links) {
eltwo = linktwo.get(i).text();
//Save all the links into String Array.
array_link.add(el.text());
array_countries.add(eltwo);
i++;
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

Related

Can't scrap elements by class name using JSOUP

This code returns nothing when I'm trying to scrap data from airbnb,
try {
doc = Jsoup.connect("https://www.airbnb.com").
header("Accept", "text/html")
.header("Accept-Encoding", "gzip,deflate")
.header("Accept-Language", "it-IT,en;q=0.8,en-US;q=0.6,de;q=0.4,it;q=0.2,es;q=0.2")
.header("Connection", "keep-alive")
.userAgent("Mozilla")
.get();
} catch (IOException e) {
e.printStackTrace();
}
Elements els = doc.getElementsByClass("cy5jw6o dir dir-ltr");
System.out.println(els);
I tried the mentioned code and also this
Elements els = doc.getElementsByClass("div.cy5jw6o.dir.dir-ltr");
How to get all elements with this class name and even access links under it or other divs under?

String.contains doing opposite Jsoup Android

I may be misunderstanding what String.contains does. I am now trying to pull a specific link using Jsoup in Android. I'm trying to just get the faceBook one as an example. Ive tried a few things. this one It Seems to be outputting got it on the ones that do not contain the facebook url and leaving the facebook ones blank. How do I just get the FaceBook ones and stop the loop.
protected Void doInBackground(Void... params) {
Document doc = null;
try {
doc = Jsoup.connect("http://www.homedepot.com").get();
Elements link = doc.select("a[href]");
String stringLink = null;
for (int i = 0; i < link.size(); i++) {
stringLink = link.toString();
if (stringLink.contains("https://www.facebook.com/")){
System.out.println(stringLink+"got it");
}
else{
//System.out.println(stringLink+"not it");
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return null;
}
}

The following line is causing the problem:
stringLink = link.toString();
The link variable is a collection of Elements (in this case every link on the page), so by calling link.toString() you're getting the String representation of every single link on the page all at once! That means stringLink will always contain the facebook link!
Change the line to:
stringLink = link.get(i).toString();
This line gets only the link at index i on each iteration and checks whether or not it contains the facebook link.

Get random words from android dictionary

I am kind of learning android...and I would like to know if there is a way to access 3 letter words or 4 letter words or some specif type of words at random from the android User Dictionary class??Considering the fact that android has an auto correct feature I'm guessing it also has a dictionary in it...thus how do I use that...where can I find a proper tutorial?
i have no idea about the code...searched around a lot...please help me with the code and also the explanation possibly :)

I don't know how to access the android dictionary but you can have a "custom" dictionary as a txt file in the app's assets folder. This link has several word lists from around 20,000 words to 200,000 words. You could find more lists with google.
Afterwards, you can read the txt file and add it to an Array List if it matches the word length. A random word can then be selected from the dictionary list. The following code will create the dictionary and select a random word from it.
private ArrayList<String> dictionary;
private int wordLength; //Set elsewhere
private void createDictionary(){
dictionary = new ArrayList<String>();
BufferedReader dict = null; //Holds the dictionary file
AssetManager am = this.getAssets();
try {
//dictionary.txt should be in the assets folder.
dict = new BufferedReader(new InputStreamReader(am.open("dictionary.txt")));
String word;
while((word = dict.readLine()) != null){
if(word.length() == wordLength){
dictionary.add(word);
}
}
} catch (FileNotFoundException e){
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
try {
dict.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
//Precondition: the dictionary has been created.
private String getRandomWord(){
return dictionaryList.get((int)(Math.random() * dictionaryList.size()));
}

Android rss feed parsing

I am new to android,In my application i have to parse the data and i need to display in screen.But in one particular tag data i can't able to parse why because some special character also coming inside that tag.Here below i display my code.
My parser function:
protected ArrayList<String> doInBackground(Context... params)
{
// context = params[0];
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
test = new ArrayList<String>();
try {
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new java.net.URL("input URL_confidential").openConnection().getInputStream());
//Document document = builder.parse(new URL("http://www.gamestar.de/rss/gamestar.rss").openConnection().getInputStream());
Element root = document.getDocumentElement();
NodeList docItems = root.getElementsByTagName("item");
Node nodeItem;
for(int i = 0;i<docItems.getLength();i++)
{
nodeItem = docItems.item(i);
if(nodeItem.getNodeType() == Node.ELEMENT_NODE)
{
NodeList element = nodeItem.getChildNodes();
Element entry = (Element) docItems.item(i);
name=(element.item(0).getFirstChild().getNodeValue());
// System.out.println("description = "+element.item(2).getFirstChild().getNodeValue().replaceAll("<div><p>"," "));
System.out.println("Description"+Jsoup.clean(org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(element.item(2).getFirstChild().getNodeValue()), new Whitelist()));
items.add(name);
}
}
}
catch (ParserConfigurationException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
catch (MalformedURLException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
catch (SAXException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
catch (IOException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
return items;
}
Input:
<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
<channel>
<title>my application</title>
<link>http:// some link</link>
<atom:link href="http:// XXXXXXXX" rel="self"></atom:link>
<language>en-us</language>
<lastBuildDate>Thu, 20 Dec 2012</lastBuildDate>
<item>
<title>lllegal settlements</title>
<link>http://XXXXXXXXXXXXXXXX</link>
<description> <div><p>
India was joined by all members of the 15-nation UN Security Council except the US to condemn Israelâ€™s announcement of new construction activity in Palestinian territories and demand immediate dismantling of the â€œillegalâ€ settlements.
</p>
<p>
UN Secretary General Ban Ki-moon also expressed his deep concern by the heightened settlement activity in West Bank, saying the move by Israel â€œgravely threatens efforts to establish a viable Palestinian state.â€
</p>
<p>
</description>
</item>
</channel>
Output:
lllegal settlements ----> title tag text
India was joined by all members of the 15-nation UN Security Council except the US to condemn Israel announcement of new construction activity in Palestinian territories and demand immediate dismantling of the illegal settlements. -----> description tag text
UN Secretary General Ban Ki-moon also expressed his deep concern by the heightened settlement activity in West Bank, saying the move by Israel gravely threatens efforts to establish a viable Palestinian state. ----> description tag text.

Your text node contains both escaped HTML entities (> is >, greater then) and garbage characters (â€œgrosslyâ€). You should first adjust the encoding according to your input source, then you can unescape the HTML with Apache Commons Lang StringUtils.escapeHtml4(String).
This method (hopefully) returns an XML which you can query (for example with XPath) to extract the wanted text node, or you can give the whole string to JSOUP or to the Android Html class
// JSOUP, "html" is the unescaped string. Returns a string
Jsoup.parse(html).text();
// Android
android.text.Html.fromHtml(instruction).toString()
Test program (JSOUP and Commons-Lang required)
package stackoverflow;
import org.apache.commons.lang3.StringEscapeUtils;
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
public class EmbeddedHTML {
public static void main(String[] args) {
String src = "<description> <div><p> An independent" +
" inquiry into the September 11 attack on the US Consulate" +
" in Benghazi that killed the US ambassador to Libya and" +
" three other Americans has found that systematic failures" +
" at the State Department led to â€œgrosslyâ€ inadequate" +
" security at the mission. </p></description>";
String unescaped = StringEscapeUtils.unescapeHtml4(src);
System.out.println(Jsoup.clean(unescaped, new Whitelist()));
}
}

Is there anything wrong with simply replacing the offending characters?
string = string.replaceAll("<", "");
string = string.replaceAll("div>", "");
string = string.replaceAll("p>", "");

Run the node value with Html.fromHTML() two or three times and it wil be fine.
EXPLANATION: The built-in Html.fromHTML() method will convert wild and broken HTML into usable content. Pseudo code here:
sHTML = node.getNodeValue()
sHTML = Html.fromHTML(sHTML)
sHTML = Html.fromHTML(sHTML)
sHTML = Html.fromHTML(sHTML)
By the the third or fourth time unreadable content will become readable again. You can display it in a textview or loaddata with a webview.

Android: Need help, trying to parse a HTML page using JSoup parser

Here is the code so far I am trying but it is showing me error:
URL url = null;
try {
url = new URL("http://wap.nastabuss.se/its4wap/QueryForm.aspx?hpl=Teleborg+C+(V%C3%A4xj%C3%B6)");
} catch (MalformedURLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("1");
Document doc = null;
try {
System.out.println("2");
doc = Jsoup.parse(url, 3000);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("3");
Element table = doc.select("table[title=Avgångar:]").first();
System.out.println("4");
Iterator<Element> it = table.select("td").iterator();
//we know the third td element is where we wanna start so we call .next twice
it.next();
it.next();
while(it.hasNext()){
// do what ever you want with the td element here
System.out.println(it.next());
//iterate three times to get to the next td you want. checking after the first
// one to make sure
// we're not at the end of the table.
it.next();
if(!it.hasNext()){
break;
}
it.next();
it.next();
}
It prints System.out.println("3");
then it stops in this line
Element table = doc.select("table[title=Avgångar:]").first();
How can i solve this problem,
Thanks

It looks like the website you're trying to parse the HTML from has an error and doesn't have any tables on it. This is what's causing the null pointer exception. doc.select("table[title=Avgångar:]") isn't returning an element and then you're trying to call a method on it. To prevent this error from happening again, you could do something like this:
Elements foundTables = doc.select("table[title=Avgångar:]");
Element table = null;
if(!foundTables.isEmpty()){
table = tables.first();
}
Now, if any table was found, the table variable won't be null. You'll just have to alter the code to adapt in case no tables are found.

You're not checking the result of doc.select() before calling .first(). If there are no elements in the document that match the specified query, doc.select() could return null. Then you are calling .first() on a null pointer which, of course, will throw an exception. There is no table tag with the title you have specified in the document that you are using in your example. So, the result is not surprising.

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.

How to organize extracted values when working with jsoup? - android

Related

Can't scrap elements by class name using JSOUP

String.contains doing opposite Jsoup Android

Get random words from android dictionary

Android rss feed parsing

Android: Need help, trying to parse a HTML page using JSoup parser

Categories

Resources