I'm writing an Android app that reads RSS feeds, fetches HTML articles, processes the article's HTML to only store the important stuff (story body, including paragraphs and images/image captions, etc), and display it to the user.
I've done everything except for the final step.
The articles will obviously have varying text, varying image positions, etc. and I want to be able to preserve the order of those elements (as they were when fetched).
What is the best way to implement this? I don't really want to use a WebView...
Thanks in advance.
EDIT
Please see comments of accepted answer for my solution.
The best way I could see to do something like this would be to escape each of the html tags and handle each appropriately. Assuming your not interested interested in the head element and metadata you could do something like the pseudo code below for the following html page
<html>
...
<head>
...
</head>
<body>
<h1> some text probably your title </h1>
<p1> first paragraph </p1>
<p2> second paragraph </p2>
<img src='/some_url' title='some_title'>
</body>
</html>
Now for what you need to do also note that how the html page is actually set up will depend on webpage/rss feed so modifications may will probably be need to be made for many sites none the less you'll want to do something like this: Not when i say llok for I mean some how search for substring (java if on device) anything you wish off device
find("<body>") everything before can be thrown away
find ("<img" or "<p1" or "<h1" or "<div") handle accordingly
(more then likely this will change on source of page)
but for say <p1 found
find (">") represents end of tag attribute pull all of this until delimiter tag "</p1>"
there you've got your first paragraph
for image tag
ie. find("<img")
then find("title=") or find("src=")
the substring after these will be The image title and source file for the image respectively not that these values will be wrapped in one of ' or "
This isn't a complete solution but hey I have seen what you've tried so its a starting point
Related
I'm using Jsoup and need to know text point of Element or Node in jsoup. Example: I have html: <p><span>1</span></p> then I need to know text point of <p> is 0,<span> is 4,</span> is 10... How to do that?
Currently you can't do this in Jsoup, for it does not keep track of the positions of tags in the original input. There was some discussion going on about this earlier (JSOUP HTML Parser)
The solution is to use another parser that explicitly supports this feature. The other post suggested Jericho.
I am currently trying to import selective headline from html content in my webview. I am looking at wide variety of options like json parsing or any hack will do. I was wondering if anyone has had experience with this or a brief idea on how to go about this?
Here's my example:
This is my html file content:
<div><h1><span class = "headline"> Some depressing title </span> <span class = "source" > ABCD </span> </h1> <br/> <span class = "body"> crappy body content which I do not need </span></div>
I just want to retrieve "headline" and "source" from this html in my webview, nothing else(not the body ). How do I go about defining a parameter to retrieve these? Any clues on how to do it?
Thanks!
Step 1: get the HTML source from your WebView - see this question. You basically create a JS interface that extracts your HTML source to a Java String.
Step 2: Use an HTML Parser (for example JSOUP) to parse the JAVA String into a format that you can handle easily.
Step 3: Use the parser to extract your relevant information. Here, you could use getElementsByTag('span') to get all your spans, then filter by class; or you could directly use getElementsByClass('healine') and getElementsByClass('source').
In general, you can retreive the HTML source and parse the DOM in all cases.
Edit: if you don't want to use a parser, you can extract your information by using searches on the HTML source string (finding the correct classes, then finding the indexes of '<' and '>' caracters to parse the information. This way is harder, less efficient, and less flexible, but it can be done.
I am building an app in android using XmlPullParser.
How can I get the content from an html formatted like this?
<div class="content">
"Some text is here."
<br>
"some more text "<a class="link" href="adress">continues here</a>
<br>
</div>
I want to parse all the content like this:
"Some text is here.
some more text continues here"
"continues here" part should also be hyperlinked.
ADDITION after some comments: HTML is first put into Yahoo YQL and YQL generates an XML. I use the generated XML file in the code. Above mentioned part that i want to parse is from the generated XML.
Both HTML and XML, although they share common syntax in some cases, are different. I think using a XmlPullParser for that purpose is not a good idea. I recommend using one of the several Java HTML parsers for that.
XmlPullParser is meant to deal with XML. It's really rare to encounter XHMTL pages that are well structured on the web. An XML Parser would expect very well formatted data and is not supposed to be fault tolerant. On the other hand, HTML is usually loosely organized.
So, no, it's not a good idea. You should prefer other libraries like tagsoup or geronimo.
PS : and the best when you ask a stack over flow question is to try something by yourself and, if blocked, then ask. Not the other way around.
i am new for android. could you help me to parse this XML which contains the HTML contents like,
<title>Jeff Mayweather: Floyd Sr showed a Sign of finally letting go of his Son, Passing Torch to Roger</title>
<summary type="html">
<p>By Shawn Craddick</p><p></p>
<p>Boxingsocialist had a chance to catch up with Floyd Mayweather's other uncle Jeff Mayweather. While Jeff stays busy at the gym he gave us some updates on his fighters as well as his thoughts on Brandon Rios, Gamboa, Floyd Mayweather Sr and Floyd Jr. meeting back together. Also he talked to us about a surprise boxing veteran he might be working with. Check out the interview below.</p>
<p><br></br> <span style="color: #ff6600;">BoxingSocialist</span>- What did you…</p> </summary>
I can parse the title field , For parsing the summary field I give the command in RSS handler-- localname.equals("summary") . i cannot parse the content in the summary field. anyone help me on this??
You can use the jsoup to parse the html content in java.
tutorial Link example
Cheers
try this one
android.text.Html.fromHtml(text).toString();
Once I had such feed with html data inside tags. My solution was to ask data provider to wrap html with CDATA. So, if you have access to how xml is made, consider this option.
Android's TextView class can display formatted text via HTML.fromHtml() as explained for example here: HTML tags in string for TextView
The TextView class can only deal with a small subset of HTML, but I do not know which tags and attributes are supported and which are not. The summary given here: http://commonsware.com/blog/Android/2010/05/26/html-tags-supported-by-textview.html does not seem to be correct. E.g. <div align="..."> does NOT work for me using Android 2.2
Looked it up for everyone searching for it.
Date: July 2017
Source: https://android.googlesource.com/platform/frameworks/base/+/master/core/java/android/text/Html.java
Html.fromHtml supports:
p
ul
li
div
span
strong
b
em
cite
dfn
i
big
small
font
blockquote
tt
a
u
del
s
strike
sup
sub
h1
h2
h3
h4
h5
h6
img
br
I noticed that this article:
https://web.archive.org/web/20171118200650/http://daniel-codes.blogspot.com/2011/04/html-in-textviews.html
lists <div> as being supported by Html.fromHtml(), but it doesn't show support for the "align" attribute.
(Other supported attributes are shown for tags on that page.)
The author says he constructed the reference by looking at code in the git repositories for Android.
Edit:
Over time, it appears the list of supported tags has changed. See this later post for example: http://www.grokkingandroid.com/android-quick-tip-formatting-text-with-html-fromhtml/ .
Based on both those articles, I'd suggest that examining the source code seems to be the most reliable way to get the recent information.
The best approach to use CData sections for the string in strings.xml file to get a actual display of the html content to the TextView the below code snippet will give you the fair idea.
//in string.xml file
<string name="welcome_text"><![CDATA[<b>Welcome,</b> to the forthetyroprogrammers blog Logged in as:]]> %1$s.</string>
Java code
String welcomStr=String.format(getString(R.string.welcome_text),username);
tvWelcomeUser.setText(Html.fromHtml(welcomStr));
CData section in string text keeps the html tag data intact even after formatting text using String.format method. So, Html.fromHtml(str) works fine and you’ll see the bold text in Welcome message.
Output:
Welcome, to your favorite music app store. Logged in as: username
Since it is constantly being updated, the best way to track which HTML tags are supported in Android is to check the source code of Html.java