I need to extract information from an unstructured web page in Android. The information I want is embedded in a table that doesn't have an id.
<table>
<tr><td>Description</td><td></td><td>I want this field next to the description cell</td></tr>
</table>
Should I use
Pattern Matching?
Use BufferedReader to extract the information?
Or are there faster way to get that information?
I think in this case it makes no sense to look for a fast way to extract the information as there is virtually no performance difference between the methods already suggested in answers when you compare it to the time it will take to download the HTML.
So assuming that by fastest you mean most convenient, readable and maintainable code, I suggest you use a DocumentBuilder to parse the relevant HTML and extract data using XPathExpressions:
Document doc = DocumentBuilderFactory.newInstance()
.newDocumentBuilder().parse(new InputSource(new StringReader(html)));
XPathExpression xpath = XPathFactory.newInstance()
.newXPath().compile("//td[text()=\"Description\"]/following-sibling::td[2]");
String result = (String) xpath.evaluate(doc, XPathConstants.STRING);
If you happen to retrieve invalid HTML, I recommend to isolate the relevant portion (e.g. using substring(indexOf("<table")..) and if necessary correct remaining HTML errors with String operations before parsing. If this gets too complex however (i.e. very bad HTML), just go with the hacky pattern matching approach as suggested in other answers.
Remarks
XPath is available since API Level 8 (Android 2.2). If you develop for lower API levels you can use DOM methods and conditionals to navigate to the node you want to extract
The fastest way will be parsing the specific information yourself. You seem to know the HTML structure precisely beforehand. The BufferedReader, String and StringBuilder methods should suffice. Here's a kickoff example which displays the first paragraph of your own question:
public static void main(String... args) throws Exception {
URL url = new URL("http://stackoverflow.com/questions/2971155");
BufferedReader reader = null;
StringBuilder builder = new StringBuilder();
try {
reader = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));
for (String line; (line = reader.readLine()) != null;) {
builder.append(line.trim());
}
} finally {
if (reader != null) try { reader.close(); } catch (IOException logOrIgnore) {}
}
String start = "<div class=\"post-text\"><p>";
String end = "</p>";
String part = builder.substring(builder.indexOf(start) + start.length());
String question = part.substring(0, part.indexOf(end));
System.out.println(question);
}
Parsing is in practically all cases definitely faster than pattern matching. Pattern matching is easier, but there is a certain risk that it may yield unexpected results, certainly when using complex regex patterns.
You can also consider to use a more flexible 3rd party HTML parser instead of writing one yourself. It will not be as fast as parsing yourself with beforehand known information. It will however be more concise and flexible. With decent HTML parsers the difference in speed is pretty negligible. I strongly recommend Jsoup for this. It supports jQuery-like CSS selectors. Extracting the firsrt paragraph of your question would then be as easy as:
public static void main(String... args) throws Exception {
Document document = Jsoup.connect("http://stackoverflow.com/questions/2971155").get();
String question = document.select("#question .post-text p").first().text();
System.out.println(question);
}
It's unclear what web page you're talking about, so I can't give a more detailed example how you could select the specific information from the specific page using Jsoup. If you still can't figure it at your own using Jsoup and CSS selectors, then feel free to post the URL in a comment and I'll suggest how to do it.
When you Scrap Html webPage. Two things you can do for it. First One is using REGEX. Another One is Html parsers.
Using Regex is not preferable by all. Because It causes logical exception at the Runtime.
Using Html Parser is More Complicated to do. you can not sure proper output will come. its too made some runtime exception by my experience.
So Better make response of the url to Xml file. and do xml parsing is very easy and effective.
Why don't you just write
int start=data.indexOf("Description");
After that take the required substring.
Why don't you create a script that does the scraping with cURL and simple html dom parser and just grab the value you need from that page? These tools work with PHP, but other tools exist for exist for any language you need.
One way of doing this is to put the html into a String and then manually search and parse through the String. If you know that the tags will come in a specific order then you should be able to crawl through it and find the data. This however is kinda sloppy, so its a question of do you want it to work now? or work well?
int position = (String)html.indexOf("<table>"); //html being the String holding the html code
String field = html.substring(html.indexOf("<td>",html.indexOf("<td>",position)) + 4, html.indexOf("</td>",html.indexOf("</td>",position)));
like i said... really sloppy. But if you're only doing this once and you need it to work, this just might do the trick.
Related
I am making an android app that displays stored HTML data using webview. Now, the problem I am trying to over come is how to ignore HTML/CSS etc tag/elements when searching for some user-input string. My DB is already 110MB and I think using another field with only text and no HTML will just add more size to DB. Regex will be expensive too and may not be reliable.
Is there any other way to do it?
Maybe you can do an additional filtering in your program on the queried records. You can use HTML parsers like Jsoup to strip HTML tags, then you can search in the remaining text. Simple Java example with Jsoup:
List<String> records = ... // your queried records - potential results
List<String> results = new ArrayList<String>();
for(String r : records) {
Document d = Jsoup.parse(r); // parse HTML
String text = d.text(); // extract text
if (text.contains(searchTerm)) { // or do your search here
results.add(r);
}
}
return results; // you got real results here
It may not be the best solution but is an option. I think it's expensive too, but more reliable than regular expressions (which you try to avoid).
Update: the regex way
I think the only way to strip HTML tags while fetching is to use regex in SQLite. For example, the following pattern should work to match string outside HTML tags:
(^|>)[^<]*(searchterm)[^<]*(<|$)
In the following example text it will match only the 1st, 3rd and 4th searchterm and not the 2nd:
searchterm <tag searchterm> searchterm </tag> searchterm
You can see it in action here.
In SQLite you can use regular expressions this way:
WHERE column-name REGEXP 'regular-expression'
The below string is what I retrieve from one of the fields of json response >How do I get the value of src in the string below .I really appreciate any help.Thanks in Advance.
Getting better doesn’t stop because it’s getting colder. The best athletes don’t just overcome the elements, they embrace them with Nike Hyperwarm. Gear up for winter: http://www.nike.com/hyperwarm<br/><br/><img class="img" src="http://vthumb.ak.fbcdn.net/hvthumb-ak-ash3/t15/1095964_10151882078663445_10151882076318445_40450_2013_b.jpg" alt="" style="height:90px;" /><br/>Winning in a Winter Wonderland
Try using jsoup html parsing api to with dedicated functionality for html parsing and would also provide for an extensible solution.
For your case (I escape quotes and additional \ to make it a valid Java string):
String str = "Getting better doesn’t stop because it’s getting colder. The best athletes don’t just overcome the elements, they embrace them with Nike Hyperwarm. Gear up for winter: http://www.nike.com/hyperwarm<br/><br/><img class=\"img\" src=\"http://vthumb.ak.fbcdn.net/hvthumb-ak-ash3/t15/1095964_10151882078663445_10151882076318445_40450_2013_b.jpg\" alt=\"\" style=\"height:90px;\" /><br/>Winning in a Winter Wonderland\"";
Document doc = Jsoup.parse(str);
Element element = doc.select("img").first();
System.out.println(element.attr("src"));
Element element2 = doc.select("a").first(); // Get the anchor tag element
System.out.println(element2.attr("onclick")); // onclick as attribute for anchor tag
Output;
http://vthumb.ak.fbcdn.net/hvthumb-ak-ash3/t15/1095964_10151882078663445_10151882076318445_40450_2013_b.jpg
Hi Everyone,
I am fetching a text from my DB and before inserting the text into db i know that the encoding of text is ISO-8859-1 , but after fetching from db and before loading this text i am checking the encoding through this code
InputStreamReader is = new InputStreamReader(new ByteArrayInputStream(body.getBytes()));
is.getEncoding();
Log.v("encoding", ""+is.getEncoding());
// String body = fetched from db
and i am getting in the log for the encoding of the text is UTF-8. And this text is not getting loaded on the webview with this method :
mailView.loadDataWithBaseURL(null, body, "text/html", "UTF-8", null);
please suggest me a correct way to solve this problem.
This reply is terribly late, but I stumbled on the question via Google and so thought I'd answer.
As described in the JavaDoc, new InputStreamReader(InputStream) will create a reader with the system default (apparently UTF-8). is.getEncoding() is simply returning that default which may or may not match your stored data.
In general, it is a good idea to specify the encoding of your stream explicitly. The implication is that you need to store the encoding along with the content. You can use out of band knowledge (e.g., my application only uses ISO-8859-1) but this will be brittle in the event that you change your chosen encoding in the future.
Since the world isn't always a nice place, and strings get separated from their charsets, you might look into a charset detector. See http://userguide.icu-project.org/conversion/detection as an example.
I have a question:
I have a link: http://wap.nastabuss.se/its4wap/QueryForm.aspx?hpl=Teleborg+C+(V%C3%A4xj%C3%B6)
and I wanna take only some specific data from this link and to show in textview in Android.
Is this possible in Android, I mean is there any chance by parsing or I don't know, you can suggest me guys.
For example I just want to take this column Nästa tur (min) from that site.
Regards
JSoup is pretty nice and getting popular. Here's how you could just parse the whole table:
URL url = new URL("http://www.nseindia.com/content/equities/niftysparks.htm");
Document doc = Jsoup.parse(url, 3000);
Element table = doc.select("table[title=Avgångar:]").first();
Iterator<Element> it = table.select("td").iterator();
//we know the third td element is where we wanna start so we call .next twice
it.next();
it.next();
while(it.hasNext()){
// do what ever you want with the td element here
//iterate three times to get to the next td you want. checking after the first
// one to make sure
// we're not at the end of the table.
it.next();
if(!it.hasNext()){
break;
}
it.next();
it.next();
}
If the parsing seems simple enough and you want you could also use regular expressions to find the correct part of the html. Regular expressions will be useful to know at some point anyway. Using some XML/HTML parsing library is the more flexible way to do it (XMLReader for example).
I am making an application for android, and an element of the functionality of the application is to return results from an online search of a library's catalogue. The application needs to display the results of the search, which is carried out by way of a custom HTML form, in a manner in keeping with the rest of the application. Ie, the results of the search need to be parsed and the useful elements displayed. I was just wondering if/how this could be achieved in android?
You would use a Html Parser. One that i use and works VERY well is JSoup
This is where you will need to begin with parsing html. Also Apache Jericho is another good one.
You would retrieve the html document by using DOM, and use the JSOUP Select() method to select any tags that you would like to get. Either by tag, id, or class.
Solution
Use the: Jsoup.connect(String url) method:
Document doc = Jsoup.connect("http://example.com/").get();
This will allow you to connect to the html page by using the url. And store it as the Document doc, Through DOM. And the read from it using the selector() method.
Description
The connect(String url) method creates a new Connection, and get()
fetches and parses a HTML file. If an error occurs whilst fetching the
URL, it will throw an IOException, which you should handle
appropriately.
The Connection interface is designed for method chaining to build
specific requests:
Document doc = Jsoup.connect("http://example.com")
If you read through the documentation on Jsoup you should be able to achieve this.
EDIT: Here is how you would use the selector method
//Once the Document is retrieved above, use these selector methods to Extract the data you want by using the tags, id, or css class
Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
// img with src ending .png
Element masthead = doc.select("div.masthead").first();
// div with class=masthead
Elements resultLinks = doc.select("h3.r > a"); // direct a after h3
EDIT: Using JSOUP you could use this to get attributes, text,
Document doc = Jsoup.connect("http://example.com")
Element link = doc.select("a").first();
String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""
String linkOuterH = link.outerHtml();
// "<b>example</b>"
String linkInnerH = link.html(); // "<b>example</b>"
You can use XmlPullParser for parsing XML.
For e.g. refer to http://developer.android.com/reference/org/xmlpull/v1/XmlPullParser.html
Being that the search results are HTML and HTML is a Markup Language (ML) you can use Android's XmlPullParser to parse the results.