JSoup Screen Scraping With Many Divs - android

I have a page I want to scrape with android, and the contents are want are located like this:
body
div#wrapper
div#mainContentArea
div#scheduleModule
div#scheduleDayView
div#scheduleDayViewScroll
div#scheduleItemContainer
div#eventContainer
div#SSPP_o090570*A*
div.eventInfo
p.eventText
span.eventInfoDefault
How can I access the span using jsoup?

If you don't want to be taken out in the streets and whipped for your transgressions, you will split up that block of text there.
Anyway, you want to find the span whose class is eventInfoDefault? Well:
Document site = Jsoup.connect("http://www.example.com");
Element span = site.select("span.eventInfoDefault").first();
//Proceed to do whatever you want with that below.
Source: http://jsoup.org/cookbook/extracting-data/selector-syntax

Related

How to get text point of Element or Node in jsoup

I'm using Jsoup and need to know text point of Element or Node in jsoup. Example: I have html: <p><span>1</span></p> then I need to know text point of <p> is 0,<span> is 4,</span> is 10... How to do that?
Currently you can't do this in Jsoup, for it does not keep track of the positions of tags in the original input. There was some discussion going on about this earlier (JSOUP HTML Parser)
The solution is to use another parser that explicitly supports this feature. The other post suggested Jericho.

Best way to dynamically display news articles (text, images)? Other than WebView?

I'm writing an Android app that reads RSS feeds, fetches HTML articles, processes the article's HTML to only store the important stuff (story body, including paragraphs and images/image captions, etc), and display it to the user.
I've done everything except for the final step.
The articles will obviously have varying text, varying image positions, etc. and I want to be able to preserve the order of those elements (as they were when fetched).
What is the best way to implement this? I don't really want to use a WebView...
Thanks in advance.
EDIT
Please see comments of accepted answer for my solution.
The best way I could see to do something like this would be to escape each of the html tags and handle each appropriately. Assuming your not interested interested in the head element and metadata you could do something like the pseudo code below for the following html page
<html>
...
<head>
...
</head>
<body>
<h1> some text probably your title </h1>
<p1> first paragraph </p1>
<p2> second paragraph </p2>
<img src='/some_url' title='some_title'>
</body>
</html>
Now for what you need to do also note that how the html page is actually set up will depend on webpage/rss feed so modifications may will probably be need to be made for many sites none the less you'll want to do something like this: Not when i say llok for I mean some how search for substring (java if on device) anything you wish off device
find("<body>") everything before can be thrown away
find ("<img" or "<p1" or "<h1" or "<div") handle accordingly
(more then likely this will change on source of page)
but for say <p1 found
find (">") represents end of tag attribute pull all of this until delimiter tag "</p1>"
there you've got your first paragraph
for image tag
ie. find("<img")
then find("title=") or find("src=")
the substring after these will be The image title and source file for the image respectively not that these values will be wrapped in one of ' or "
This isn't a complete solution but hey I have seen what you've tried so its a starting point

Adding a Resource Bundle Link to load a new Activity

I have a paragraph of text with a url at the end of it. I have the text and link in the strings.xml. Is there anyway to get it to load a new Activity from the strings.xml file? I'm assuming I'll have to break up the paragraph text and link, but thought I'd check.
strings.xml:
The quick brown fox can be found at: http://thequickbrownfox.com\n more text here
I need to change the hardcoded url "http://thequickbrownfox.com" to load a screen inside my app instead of a page on the web.
strings.xml is purely an abstraction mechanism used for string lookup to facilitate multi language support etc; you cannot use it to load activities or do anything else programatically. It sounds like you are actually talking about parsing the url out of a particular paragraph stored within strings.xml and then depending on what that url is, you invoke a corresponding activity.
If this is the case then you can either parse out the url from the paragraph and respond accordingly.
OR
you can store your paragraph as one item in strings.xml and your url as another item and combine them programmatically in your code.
Either approach can be fine depending on what you are doing.

How to get a specific tag from a div class with this html page?

I am trying to retrieve the image url in from this html page
The image is inside of the editions box on the webpage. How would i go about getting it using the JSoup selector method.
Such as
Document doc = Jsoup.connect(url).get();
Element png = doc.select(//What would the tag be?);
I have an idea of how to set it up, just not how to retrieve the tag.
doc.select("div.box-art").select("img").attr("abs:src"));
From the docs it looks like doc.select(".box-art img") should do the trick. (Select an img element which is the child of an element of class box-art.) Note that this could get you multiple imgs, (if JSoup supports that).

Android JSoup Example

I was just wondering has anyone got a sample eclipse project with a working implementation of JSoup? Im trying to use it to pull information from websites and have gone all over google trying to get it to work but cant. If anyone could help I'd really appreciate it.
JSoup is really easy to use, look at these exemples from the JSoup cookbook:here
First, You have to connect to the webpage you want to parse using:
Document doc = Jsoup.connect("http://example.com/").get();
Then, you can select page elements using the JSoup selector syntax.
For instance, say you want to select all the content of the div tags with the id attribute set to test, you just have to use:
Elements divs = doc.select("div#test");
to retrieve the divs, then you can iterate on them using:
for (Element div : divs)
System.out.println(div.text());
}

Categories

Resources