How to parse the XML which contains HTML contents

How to parse the XML which contains HTML contents - android

i am new for android. could you help me to parse this XML which contains the HTML contents like,
<title>Jeff Mayweather: Floyd Sr showed a Sign of finally letting go of his Son, Passing Torch to Roger</title>
<summary type="html">
<p>By Shawn Craddick</p><p></p>
<p>Boxingsocialist had a chance to catch up with Floyd Mayweather's other uncle Jeff Mayweather. While Jeff stays busy at the gym he gave us some updates on his fighters as well as his thoughts on Brandon Rios, Gamboa, Floyd Mayweather Sr and Floyd Jr. meeting back together. Also he talked to us about a surprise boxing veteran he might be working with. Check out the interview below.</p>
<p><br></br> <span style="color: #ff6600;">BoxingSocialist</span>- What did you…</p> </summary>
I can parse the title field , For parsing the summary field I give the command in RSS handler-- localname.equals("summary") . i cannot parse the content in the summary field. anyone help me on this??

You can use the jsoup to parse the html content in java.
tutorial Link example
Cheers

try this one
android.text.Html.fromHtml(text).toString();

Once I had such feed with html data inside tags. My solution was to ask data provider to wrap html with CDATA. So, if you have access to how xml is made, consider this option.

Related

How do I pick selective html content in my webview in android?

I am currently trying to import selective headline from html content in my webview. I am looking at wide variety of options like json parsing or any hack will do. I was wondering if anyone has had experience with this or a brief idea on how to go about this?
Here's my example:
This is my html file content:
<div><h1><span class = "headline"> Some depressing title </span> <span class = "source" > ABCD </span> </h1> <br/> <span class = "body"> crappy body content which I do not need </span></div>
I just want to retrieve "headline" and "source" from this html in my webview, nothing else(not the body ). How do I go about defining a parameter to retrieve these? Any clues on how to do it?
Thanks!

Step 1: get the HTML source from your WebView - see this question. You basically create a JS interface that extracts your HTML source to a Java String.
Step 2: Use an HTML Parser (for example JSOUP) to parse the JAVA String into a format that you can handle easily.
Step 3: Use the parser to extract your relevant information. Here, you could use getElementsByTag('span') to get all your spans, then filter by class; or you could directly use getElementsByClass('healine') and getElementsByClass('source').
In general, you can retreive the HTML source and parse the DOM in all cases.
Edit: if you don't want to use a parser, you can extract your information by using searches on the HTML source string (finding the correct classes, then finding the indexes of '<' and '>' caracters to parse the information. This way is harder, less efficient, and less flexible, but it can be done.

Parsing content which contains html tags using XMLPullParser

I am building an app in android using XmlPullParser.
How can I get the content from an html formatted like this?
<div class="content">
"Some text is here."
<br>
"some more text "<a class="link" href="adress">continues here</a>
<br>
</div>
I want to parse all the content like this:
"Some text is here.
some more text continues here"
"continues here" part should also be hyperlinked.
ADDITION after some comments: HTML is first put into Yahoo YQL and YQL generates an XML. I use the generated XML file in the code. Above mentioned part that i want to parse is from the generated XML.

Both HTML and XML, although they share common syntax in some cases, are different. I think using a XmlPullParser for that purpose is not a good idea. I recommend using one of the several Java HTML parsers for that.

XmlPullParser is meant to deal with XML. It's really rare to encounter XHMTL pages that are well structured on the web. An XML Parser would expect very well formatted data and is not supposed to be fault tolerant. On the other hand, HTML is usually loosely organized.
So, no, it's not a good idea. You should prefer other libraries like tagsoup or geronimo.
PS : and the best when you ask a stack over flow question is to try something by yourself and, if blocked, then ask. Not the other way around.

Best way to dynamically display news articles (text, images)? Other than WebView?

I'm writing an Android app that reads RSS feeds, fetches HTML articles, processes the article's HTML to only store the important stuff (story body, including paragraphs and images/image captions, etc), and display it to the user.
I've done everything except for the final step.
The articles will obviously have varying text, varying image positions, etc. and I want to be able to preserve the order of those elements (as they were when fetched).
What is the best way to implement this? I don't really want to use a WebView...
Thanks in advance.
EDIT
Please see comments of accepted answer for my solution.

The best way I could see to do something like this would be to escape each of the html tags and handle each appropriately. Assuming your not interested interested in the head element and metadata you could do something like the pseudo code below for the following html page
<html>
...
<head>
...
</head>
<body>
<h1> some text probably your title </h1>
<p1> first paragraph </p1>
<p2> second paragraph </p2>
<img src='/some_url' title='some_title'>
</body>
</html>
Now for what you need to do also note that how the html page is actually set up will depend on webpage/rss feed so modifications may will probably be need to be made for many sites none the less you'll want to do something like this: Not when i say llok for I mean some how search for substring (java if on device) anything you wish off device
find("<body>") everything before can be thrown away
find ("<img" or "<p1" or "<h1" or "<div") handle accordingly
(more then likely this will change on source of page)
but for say <p1 found
find (">") represents end of tag attribute pull all of this until delimiter tag "</p1>"
there you've got your first paragraph
for image tag
ie. find("<img")
then find("title=") or find("src=")
the substring after these will be The image title and source file for the image respectively not that these values will be wrapped in one of ' or "
This isn't a complete solution but hey I have seen what you've tried so its a starting point

RSS feed parse from html file

i have url of RSS feed Click here in the context of url you can see title and description over there. now i need to parse it in to android i have try to search on it but all help is regarding xml format. but here i want something like "HTML parse" and based on that perticular news description i can parse.. so is there any idea regarding those parse if yes then please help me on this...
one more thing in my searching i found that this link may be usefull for me and it guide me or attract to use "Apache Feedparser" so is this right way ??

My advice is to use JSoup to parse the HTML.
It is very simple to use and well documented, you should not have too much trouble parsing your page.
EDIT, to point you in the right direction for parsing your page:
You should take a look at this documentation page.
You should able to parse the title and article content with something like this:
Document doc = Jsoup.connect("http://.....").get();
String title = doc.select("h3").first().text(); //text in <h3> tag
String articleContent = doc.select("div.articleLeft p").toString(); //text in <p> elements nested in the <div class="articleLeft">

Extracting data using JSoup

I am trying to extract product name information from Google Shopping (http://www.google.co.uk/m/products?q=5010459007289, phone website).
The product name always appear in between the span with class "owb63p",for example
"<span class="owb63p">Highland Spring Sports Bottle 750 Ml</span>"
I am new with JSoup, I can connect with the URL and get the whole document, but I just need help setting it up so that I only get the piece of information I need.

In JSoup it will be like:
Document doc = Jsoup.connect("www.google.co.uk/m/products?q=5010459007289").get();
Element title = doc.select("span.owb63p").first();
System.out.println(title.text());

I don't like JSoup that much, but with apache jericho it would like :
Source source=new Source(new URL(sourceUrlString));
String content=source.getFirstElementByClass( "owb63p" ).getContent().toString();

It looks like JSoup examples has what you are looking for.

You could try
doc.select("span").get(0).data();
or you can simply iterate for multiple span tags...

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.

How to parse the XML which contains HTML contents - android

You can use the jsoup to parse the html content in java. tutorial Link example Cheers

try this one android.text.Html.fromHtml(text).toString();

Once I had such feed with html data inside tags. My solution was to ask data provider to wrap html with CDATA. So, if you have access to how xml is made, consider this option.

Related

How do I pick selective html content in my webview in android?

Parsing content which contains html tags using XMLPullParser

Best way to dynamically display news articles (text, images)? Other than WebView?

RSS feed parse from html file

Extracting data using JSoup

Categories

Resources