I need to parse about 100 kB of HTML data and this simply causes huge performance issues on Android. I've tried both the built-in XML parser and JTidy.
The built-in XML parser gives me a parsing time of about half a second, which I can easily live with. Problem is that it's a bad idea to use an XML parser to parse messy HTML code, those this is not an option. (I tried preprocessing, but it even started complaining about valid HTML, so...)
I googled a bit and JTidy was suggested for cleaning up the code before passing it to an XML parser. This was an absolute nightmare, with JTidy for preprocessing parsing now takes approximately 7 seconds.
So now my only alternative really is regex. What do you think?
It depends on whether you are owner of html.
If (as I understood) you are not owner of html data and can't influence on how it is formatted, then you probably find this info useful: Parse HTML in Android
But if html is really bad, the result can't be guaranteed. And you would prefer working with regex.
Even browsers switch to quirks mode when work with "bad" html without guarantee of correctness viewing.
Related
I'm developing a cross platform mobile app with Qt 5.3.1. I need to load various HTML pages and parse DOM element values from them. At the moment I have succesfully loaded a page with QNetworkAccessManager and stored it in QByteArray but I hit the wall trying to parse the valuable data out from it.
Couple points:
I can't use QWebkit since it's not supported on Android on Qt 5
The HTML can't be assumed being strict mark up, eg Qt's XML readers or DOM parsers won't work on their own
I'm only parsing text from pages. The information is all i need, not visual style
What options do I have? It sounds a little bit stupid that WebKit would be the only way doing this, since I don't need to display any graphical data from webpages. Is writing my own DOM parser for HTML the way to go?
http://qt-project.org/wiki/Handling_HTML
Has a pretty good list of html parsers that are available.
Sometimes a good regular expression can catch what you need, but it isn't as robust as a good HTML parser.
The first link on the page looks pretty promising:
http://tidy.sourceforge.net/libintro.html
I don't know how difficult it would be to build the libraries for Qt Android, but it looks do-able, and works with standard tools.
Hope that helps.
I would like to know which is more efficient to get the data from the server by the xml or json.
Another question:
does XmlPullParser related to parsing xml data that come from the web service? so if I am using json I don't need XmlPullParser ! or there is other uses !
thank you very much
What I've found extremely useful for parsing JSON is Google's gson library. For xml, you can use gson underneath to do the same thing with gson-xml. With a single line of code you can map your JSON/XML to your objects without having to write a single line of parsing code.
If you find performance to be an issue (I'm making this suggestion because these libs make you super productive), there are mechanisms in both to allow you finer grained control. I doubt you'll have problems with performance though.
For a very thoroughly researched answer to the headline question (though focussed on browsers, not android apps), see David Lee's Balisage 2013 paper:
http://www.balisage.net/Proceedings/vol10/html/Lee01/BalisageVol10-Lee01.html
His conclusion, in one line, is that the choice between XML and JSON makes very little difference in itself - though the details of how you do XML or how you do JSON can make a big difference.
I am trying to develop an app to get the RSS feeds from http://xxx.xxx.com/xxxxxblog .
Can someone help me with the HTML parsing to get the feeds?
You can try JSoup to parse the HTML.
It is very simple to use and well documented, you should not have too much trouble parsing your page.
You can find how to do that from this page
http://jsoup.org/cookbook/extracting-data/selector-syntax
It uses different html tag to parse data between that tag.
The feeds on this web page seem clearly delimited by <dc:subject> tag.
As you only need to get the feeds, the shortest way may be better to get the feed boundaries with regular expression that would also capture the header (something like <dc:subject>(.*?)</dc:subject>). Read line by line, once you detect the expression - this is the start of the feed. Maybe it is philosophically not the most right way and we should parse all HTML instead but why to run unnecessary code ...
There is no lack of Java built-in parsers either, starting from Java's built in HTML parser and continuing to various alternative libraries that in some cases may fit better, some also suggest to use XML parser (XPath). Various solutions are discussed here.
please try
Use this example code to create RSS reader that is actually can handle namespace extensions
https://github.com/dodyg/AndroidRivers/blob/master/src/com/silverkeytech/android_rivers/xml/RssParser.kt
The library underlying this code is this https://github.com/thebuzzmedia/simple-java-xml-parser.
It works very well in Android as well.
I want to create an App that uses a potentially large xml file. It will also modify and ideally be able to traverse in reverse.
I know there is SAX, DOM, and the XML pull parser. The pull parser is out, unless I spend memory on creating my own tree of objects which does not seem feasible.
That leaves SAX and DOM unless there is another parser out there that can do what I want. Highly improbable, I know.
Yes, I saw this answer: https://stackoverflow.com/questions/7498616/which-xml-parser-should-i-use-for-android
Thoughts on having tree like usability without having to use DOM?
There are a lot of options when it comes to parsing XML. But it depends on your own requirements that which parser you can use when. For that you need to know the basic differences between the parser. Here is some basic information i have provided.
SAX parser is one where your code is notified as the parser walks through the XML tree,
and you are responsible for keeping track of state and constructing any objects you might want to keep track of the data as the parser marches through.
DOM parser reads the entire document and builds up an in-memory representation that you can query for different elements. Often, you can even construct XPath queries to pull out particular pieces.
And as you said you are having large file and also if you want faster performance i suggest that you should use StAX parser. Here is link for that.
Hope this will help you...
Also refer this link.
DOM is better for most of the cases where it will load all the XML at a time. But If the XML size is very big then we should go for SAX parser where it will read for the tag from the start of the XML every time.
If the XML is really big then it is better to filter from the server end by sending the requirements in the request or else we can go for pagination which is suggestible.
My application shall parse XML received via HTTP. As far as I understand there are three major ways of parsing XML:
SAX
DOM
XmlPullParser
It is said that SAX is the fastest of these while DOM is not optimal for larger XML documents. But what is a large XML document in terms of parsing? What would be a recommended parser for the following?
XML document size between 1-5 kB
Easy traversing through the document, i.e. I need to know not only the current element but also the parent elements.
As far as I understand there are three major ways of parsing XML:
- SAX
- DOM
- XmlPullParser
Wrong! Neither of those is the best way. What you really want is annotation based parsing using the Simple XML Framework. To see why follow this logic:
Java works with objects.
XML can be represented using Java objects. (see JAXB)
Annotations could be used to map that XML to your Java objects and vice versa.
The Simple XML Framework uses Annotations to allow you to map your Java and XML together.
Simple XML is capable of running on Android (unlike JAXB).
You should use Simple XML for all of your XML needs on Android.
And to help you do exactly that I will point you to my own blog post that explains exactly how to use the Simple library on Android.
Unless you have a 100MB XML file then Simple will be more than fast enough for you. It is for me, I use it on all of my Android XML projects.
N.B. I should point out that if you require the user to download XML files that are more than 1MB on Android then you may want to rethink your strategy. You might be doing it wrong.
I'm afraid this is a case of, it depends ...
As a rule of thumb, using Java to build a DOM tree from an XML document will consume between 4 and 10 times that document's native size (assuming Western text and UTF-8 encoding), depending on the underlying implementation. So if speed and memory-use are not critical it will not be a problem for the small documents you mention.
DOM is generally regarded as quite an unpleasant way to work with XML. For background you might want to look at Elliotte Rusty Harold's presentation: What's Wrong with XML APIs (and how to fix them).
However, using SAX can be even more tedious as the document is processed one item at a time. SAX however is fast and consumes very little memory. If you can find a pull parser you like then by all means try that.
Another approach (not super-efficient, but clean and maintainable) is to build an in-memory tree of your XML (using DOM, say) and then use XPath expressions to select the information you are interested in.