I'm developing a cross platform mobile app with Qt 5.3.1. I need to load various HTML pages and parse DOM element values from them. At the moment I have succesfully loaded a page with QNetworkAccessManager and stored it in QByteArray but I hit the wall trying to parse the valuable data out from it.
Couple points:
I can't use QWebkit since it's not supported on Android on Qt 5
The HTML can't be assumed being strict mark up, eg Qt's XML readers or DOM parsers won't work on their own
I'm only parsing text from pages. The information is all i need, not visual style
What options do I have? It sounds a little bit stupid that WebKit would be the only way doing this, since I don't need to display any graphical data from webpages. Is writing my own DOM parser for HTML the way to go?
http://qt-project.org/wiki/Handling_HTML
Has a pretty good list of html parsers that are available.
Sometimes a good regular expression can catch what you need, but it isn't as robust as a good HTML parser.
The first link on the page looks pretty promising:
http://tidy.sourceforge.net/libintro.html
I don't know how difficult it would be to build the libraries for Qt Android, but it looks do-able, and works with standard tools.
Hope that helps.
Related
I want to get data from a webpage to stylize and display them in my android app, is there an effective way to do so?? I've already tried xml parsers but the app becomes extremely slow.
I think html parser is effective enough if you want restyle webpage(remove, add something, etc.)
Have you tried JSoup?
JSoup HTML parser
jsoup is a Java library for working with real-world HTML. It provides
a very convenient API for extracting and manipulating data, using the
best of DOM, CSS, and jquery-like methods.
jsoup implements the WHATWG HTML5 specification, and parses HTML to
the same DOM as modern browsers do.
scrape and parse HTML from a URL, file, or string
find and extract data, using DOM traversal or CSS selectors
manipulate the HTML elements, attributes, and text
clean user-submitted content against a safe white-list, to prevent XSS attacks
output tidy HTML
jsoup is designed to deal with all varieties of HTML found in the
wild; from pristine and validating, to invalid tag-soup; jsoup will
create a sensible parse tree.
Adding to project:
Download lib
Latest JSoup jar
Right click on your project > Properties > Java Build Path > Libraries > Add [external] Jars
I am trying to develop an app to get the RSS feeds from http://xxx.xxx.com/xxxxxblog .
Can someone help me with the HTML parsing to get the feeds?
You can try JSoup to parse the HTML.
It is very simple to use and well documented, you should not have too much trouble parsing your page.
You can find how to do that from this page
http://jsoup.org/cookbook/extracting-data/selector-syntax
It uses different html tag to parse data between that tag.
The feeds on this web page seem clearly delimited by <dc:subject> tag.
As you only need to get the feeds, the shortest way may be better to get the feed boundaries with regular expression that would also capture the header (something like <dc:subject>(.*?)</dc:subject>). Read line by line, once you detect the expression - this is the start of the feed. Maybe it is philosophically not the most right way and we should parse all HTML instead but why to run unnecessary code ...
There is no lack of Java built-in parsers either, starting from Java's built in HTML parser and continuing to various alternative libraries that in some cases may fit better, some also suggest to use XML parser (XPath). Various solutions are discussed here.
please try
Use this example code to create RSS reader that is actually can handle namespace extensions
https://github.com/dodyg/AndroidRivers/blob/master/src/com/silverkeytech/android_rivers/xml/RssParser.kt
The library underlying this code is this https://github.com/thebuzzmedia/simple-java-xml-parser.
It works very well in Android as well.
I need to parse about 100 kB of HTML data and this simply causes huge performance issues on Android. I've tried both the built-in XML parser and JTidy.
The built-in XML parser gives me a parsing time of about half a second, which I can easily live with. Problem is that it's a bad idea to use an XML parser to parse messy HTML code, those this is not an option. (I tried preprocessing, but it even started complaining about valid HTML, so...)
I googled a bit and JTidy was suggested for cleaning up the code before passing it to an XML parser. This was an absolute nightmare, with JTidy for preprocessing parsing now takes approximately 7 seconds.
So now my only alternative really is regex. What do you think?
It depends on whether you are owner of html.
If (as I understood) you are not owner of html data and can't influence on how it is formatted, then you probably find this info useful: Parse HTML in Android
But if html is really bad, the result can't be guaranteed. And you would prefer working with regex.
Even browsers switch to quirks mode when work with "bad" html without guarantee of correctness viewing.
i want to write a program that gets the match dates from this link http://www.goal.com/en/teams/germany/148/fc-bayern-munich-news
and use it in my program i just want the dates and the matches how can i do this? in andorid
I'd write an Activity to display the data, which calls an AsyncTask to connect to the site and download the HTML. I'd then use some kind of parser to grab the data I want and save it to a database.
Have you written Java before? If not I'd start out by learning the language. Download Eclipse and write a simple program that can connect to the site and grab the HTML. Then add the parser.
Once you are that far, do the Hello World tutorial, then work your way through the other tutorials. Also learn about the Android Application Lifecycle. At that point you can start thinking about moving your code over to the Android framework.
EDIT
Here are some links to information about potential parsers & parsing approaches.
Tag Soup
What HTML parsing libraries do you recommend in Java
Two HTML parsing links
You could also consider using (hushed voice) regex/pattern matching.
I have read the example for Rss Parsing from the ibm site.(http://www.ibm.com/developerworks/opensource/library/x-android/).
In this example,the rss are shown in a listview and then,if you press one announcement you can see it in the web browser of the device.How could i see them in the app,with no use of the device browser?
Thanks a lot
Create a layout with a WebView then load the URL from each "announcement" using WebView.loadUrl.
I'm a little confused but you seem to have answered your own question.
You say you don't want to use the web browser on the device but the example in your question doesn't use the browser. It does exactly what you're asking for.
The idea is that you download the html from the website and then use the parser to break it up into separate "announcements" and store them in list view items in your program.
I have done a bit of this type of thing myself in android. I used jsoup java library, which makes breaking the html into the bits you want to display really easy.
If you want some more help I can give you an example of an app I made that pulls movie times from google.com/movies as an example. here are links to the classes where I did the html download and parse:
ScreenScraper.java
HtmlParser.java