How to get text point of Element or Node in jsoup - android

I'm using Jsoup and need to know text point of Element or Node in jsoup. Example: I have html: <p><span>1</span></p> then I need to know text point of <p> is 0,<span> is 4,</span> is 10... How to do that?

Currently you can't do this in Jsoup, for it does not keep track of the positions of tags in the original input. There was some discussion going on about this earlier (JSOUP HTML Parser)
The solution is to use another parser that explicitly supports this feature. The other post suggested Jericho.

Related

Parsing content which contains html tags using XMLPullParser

I am building an app in android using XmlPullParser.
How can I get the content from an html formatted like this?
<div class="content">
"Some text is here."
<br>
"some more text "<a class="link" href="adress">continues here</a>
<br>
</div>
I want to parse all the content like this:
"Some text is here.
some more text continues here"
"continues here" part should also be hyperlinked.
ADDITION after some comments: HTML is first put into Yahoo YQL and YQL generates an XML. I use the generated XML file in the code. Above mentioned part that i want to parse is from the generated XML.
Both HTML and XML, although they share common syntax in some cases, are different. I think using a XmlPullParser for that purpose is not a good idea. I recommend using one of the several Java HTML parsers for that.
XmlPullParser is meant to deal with XML. It's really rare to encounter XHMTL pages that are well structured on the web. An XML Parser would expect very well formatted data and is not supposed to be fault tolerant. On the other hand, HTML is usually loosely organized.
So, no, it's not a good idea. You should prefer other libraries like tagsoup or geronimo.
PS : and the best when you ask a stack over flow question is to try something by yourself and, if blocked, then ask. Not the other way around.

JSoup Screen Scraping With Many Divs

I have a page I want to scrape with android, and the contents are want are located like this:
body
div#wrapper
div#mainContentArea
div#scheduleModule
div#scheduleDayView
div#scheduleDayViewScroll
div#scheduleItemContainer
div#eventContainer
div#SSPP_o090570*A*
div.eventInfo
p.eventText
span.eventInfoDefault
How can I access the span using jsoup?
If you don't want to be taken out in the streets and whipped for your transgressions, you will split up that block of text there.
Anyway, you want to find the span whose class is eventInfoDefault? Well:
Document site = Jsoup.connect("http://www.example.com");
Element span = site.select("span.eventInfoDefault").first();
//Proceed to do whatever you want with that below.
Source: http://jsoup.org/cookbook/extracting-data/selector-syntax

Best way to dynamically display news articles (text, images)? Other than WebView?

I'm writing an Android app that reads RSS feeds, fetches HTML articles, processes the article's HTML to only store the important stuff (story body, including paragraphs and images/image captions, etc), and display it to the user.
I've done everything except for the final step.
The articles will obviously have varying text, varying image positions, etc. and I want to be able to preserve the order of those elements (as they were when fetched).
What is the best way to implement this? I don't really want to use a WebView...
Thanks in advance.
EDIT
Please see comments of accepted answer for my solution.
The best way I could see to do something like this would be to escape each of the html tags and handle each appropriately. Assuming your not interested interested in the head element and metadata you could do something like the pseudo code below for the following html page
<html>
...
<head>
...
</head>
<body>
<h1> some text probably your title </h1>
<p1> first paragraph </p1>
<p2> second paragraph </p2>
<img src='/some_url' title='some_title'>
</body>
</html>
Now for what you need to do also note that how the html page is actually set up will depend on webpage/rss feed so modifications may will probably be need to be made for many sites none the less you'll want to do something like this: Not when i say llok for I mean some how search for substring (java if on device) anything you wish off device
find("<body>") everything before can be thrown away
find ("<img" or "<p1" or "<h1" or "<div") handle accordingly
(more then likely this will change on source of page)
but for say <p1 found
find (">") represents end of tag attribute pull all of this until delimiter tag "</p1>"
there you've got your first paragraph
for image tag
ie. find("<img")
then find("title=") or find("src=")
the substring after these will be The image title and source file for the image respectively not that these values will be wrapped in one of ' or "
This isn't a complete solution but hey I have seen what you've tried so its a starting point

How to parse the XML which contains HTML contents

i am new for android. could you help me to parse this XML which contains the HTML contents like,
<title>Jeff Mayweather: Floyd Sr showed a Sign of finally letting go of his Son, Passing Torch to Roger</title>
<summary type="html">
<p>By Shawn Craddick</p><p></p>
<p>Boxingsocialist had a chance to catch up with Floyd Mayweather's other uncle Jeff Mayweather. While Jeff stays busy at the gym he gave us some updates on his fighters as well as his thoughts on Brandon Rios, Gamboa, Floyd Mayweather Sr and Floyd Jr. meeting back together. Also he talked to us about a surprise boxing veteran he might be working with. Check out the interview below.</p>
<p><br></br> <span style="color: #ff6600;">BoxingSocialist</span>- What did you…</p> </summary>
I can parse the title field , For parsing the summary field I give the command in RSS handler-- localname.equals("summary") . i cannot parse the content in the summary field. anyone help me on this??
You can use the jsoup to parse the html content in java.
tutorial Link example
Cheers
try this one
android.text.Html.fromHtml(text).toString();
Once I had such feed with html data inside tags. My solution was to ask data provider to wrap html with CDATA. So, if you have access to how xml is made, consider this option.

Android JSoup Example

I was just wondering has anyone got a sample eclipse project with a working implementation of JSoup? Im trying to use it to pull information from websites and have gone all over google trying to get it to work but cant. If anyone could help I'd really appreciate it.
JSoup is really easy to use, look at these exemples from the JSoup cookbook:here
First, You have to connect to the webpage you want to parse using:
Document doc = Jsoup.connect("http://example.com/").get();
Then, you can select page elements using the JSoup selector syntax.
For instance, say you want to select all the content of the div tags with the id attribute set to test, you just have to use:
Elements divs = doc.select("div#test");
to retrieve the divs, then you can iterate on them using:
for (Element div : divs)
System.out.println(div.text());
}

Categories

Resources