I would like to provide the user of my android app a preview of each article given to them by an RSS feed. Since the user enters the URL of the site with the RSS feed I don not control the source of the RSS (I must assume the page is using the RSS 2 specifications).
I know there is an <image> tag for the channel RSS feed but how do I get a preview image for each article?
In contemplating this problem I have considered parsing through the website and finding an <img> tag and using that... but the possibility of selecting an add or irrelevant image is high.
Right now I am doing all my XML parsing on a server to remove the restriction of phone processing - so parsing for the image would be easy-ish... but not fun :)
Any Ideas? (I'm thinking something like how Facebook adds thumbnails to your posts - I've looked into Facebook's open graph - but it is not widely enough distributed to be of any use)
Feed: http://www.modernizedmedia.com/blog/feed
There's no easy answer here. The RSS spec doesn't have a field for a thumbnail preview that I'm aware of.
The Media RSS spec does define a <media:thumbnails> element, but that's neither widely used nor really what you're looking for.
I'd use the <image> tag first, if it's present. If it's not, then you can scan through the article and find an <img> tag if it's present -- but you're right, it might not be the correct one. To compensate for that, you could filter out known ad sizes, then select the largest image on the page.
In all cases, you'll want to resize the image to be thumbnail sized. This is application dependent behavior.
Related
Good day.I want to create an share field like facebook.In the facebook if you insert an link it will automatically get the lets say 'short preview' of your link with an title and image.I dont think that facebook got in touch with whole wordls website databses so it gather them from databases but rather parsing the url somehow.Anyway lets imagine i have an edittext with user letting input some ulr.How can i achieve this kind of technique?By googling i did not find anything closer to what i want.
Facebook downloads the url and displays the title in the header and a few lines of text. A prominent image on the page if it finds one. You'd have to do something similar- download the url, parse the html, and grab enough info to make your preview.
Most of the websites have og(Open Graph) tags in them. Check out the Facebook Open Graphs Website to understand more about these tags. You can fetch data corresponding to these tags and that is what most social media sites and WhatsApp do to provide a preview.
Currently I am creating an Android application which allows to extract main content and picture from a website. Now I am using Jsoup API to extract all p tags from the HTML. However, it is not a good solution. Any suggestion or better solution enable me to extract main content and picture from a website in Android?
I didn't find anything that works for me, so I published Goose for Android, here: https://github.com/milosmns/goose
Some description follows...
Document cleaning
When you pass a URL to Goose, the first thing it starts to do is clean
up the document to make it easier to parse. It will go through the
whole document and remove comments, common social network sharing
elements, convert em and other tags to plain text nodes, try to
convert divs used as text nodes to paragraphs, as well as do a general
document cleanup (spaces, new lines, quotes, encoding, etc).
Content / Images Extraction
When dealing with random article links you're bound to come across the
craziest of HTML files. Some sites even like to include 2 or more HTML
files per site. Goose uses a scoring system based on clustering of
English stop words and other factors that you can find in the code.
Goose also does descending scoring so as the nodes move down - the
lower their scores become. The goal is to find the strongest grouping
of text nodes inside a parent container and assume that's the relevant
group of content as long as it's high enough (up) on the page.
Image extraction is the one that takes the longest. Trying to find the
most important image on a page proved to be challenging and required
to download all the images to manually inspect them using external
tools (not all images are considered, Goose checks mime types,
dimensions, byte sizes, compression quality, etc). Java's Image
functions were just too unreliable and inaccurate. On Android, Goose
uses the BitmapFactory class, it is well documented, tested, and is
fast and accurate. Images are analyzed from the top node that Goose
finds the content in, then comes a recursive run outwards trying to
find good images - Goose also checks if those images are ads, banners
or author logos, and ignores them if so.
Output Formatting
Once Goose has the top node where we think the content is, Goose will
try to format the content of that node for the output. For example,
for NLP-type applications, Goose's output formatter will just suck all
the text and ignore everything else, and other (custom) extractors can
be built to offer a more Flipboardy-type experience.
Why do you think it's not a good solution to use Jsoup?
I've written many web scrapers for different webpages, and in my experience Jsoup is the way to go for that task. You should study the Jsoup Syntax it is very powerful and with the right selectors you could extract most information from HTML documents very easy. Generally it becomes harder to extract information when the document has no id, class attributes or other unique features.
Other HTML parsers that might be interesting for you are JTidy and TagSoup
You could try the textracto api it automatically identifies the main content of HTML documents. There is also the opportunity to parse OpenGraph meta data, therefore you were also able to extract a picture (og:image).
I've not been successful in searching for this answer, but basically, how can I find the image file name from a website that blocks saving of the images? Normally, the url source will have the image filename and it is easily searchable.
However, some sites lets you hover over the picture and it then zooms to a larger image. The source shows,
for example,
http://example.com/pictures.aspx?ImagePath=ABCDEFGHIJ1234567989KLMNOPQRST==.....
E.g.: A random long string of code, but no .jpg or .png indicating the file name. When right clicking to save, it shows the image is blocked.
How can I get the images using Android code?
Tks
If it's not allowing them to be taken from a website, then it probably shouldn't be taken. That being said, there are a couple of options if you are the owner of the website.
If they are in a database then you can get the url address of them. Also, right clicking anywhere on the page to "View source" should get you some sort of URL of the image.
This seems more of a web service implementation issue rather than anything andriod specific. Many websites have this feature where you are not meant to retrieve information from the url.
However there should be a corresponding supported web service interface for fetching the metadata of a particular resource. If this is not supported then it's meant to be blocked.
The other trial and error sort of way is to invoke a 3rd party API like oEmbed (or Embed.ly : http://embed.ly/) that would return the metadata of the resource for which you have the URL. These services cover a wide range of websites and their resources.
You can search around in their site or find similar such services that would get you this information. Embed.ly is more of a personal preference due to its exhaustive list of supported sites.
I am retrying to retreive just the text and the image from this wiki page.
http://en.wikipedia.org/wiki/Where_the_Red_Fern_Grows
I will have a news feed, and when i item is clicked the url will be fetched. Intead of going to a the webpage through the browser, i would like to get the text and images and feed them into a textview and imageviews.
You have a couple of options.
Host a intermediary site that parses the link and passes back the data you want.
Get all of the page, and parse on the device.
Obviously parsing a huge page on the device will be far slower than parsing it on a webserver and serving just what you need.
Of course, if you are really in need of just the text and image, there is some help by using the mobile version of Wikipedia:
http://mobile.wikipedia.org
OR
http://en.m.wikipedia.org
The "mobile" version splits pages up and contains no graphics, but the "m" version is probably more along the lines of what you are looking for.
Here is the formatted page for "Where The Red Fern Grows":
http://en.m.wikipedia.org/wiki/Where_the_Red_Fern_Grows
Well I am learning Android RSS Feeds parsing, and I have a question. Consider I am using Goal.com RSS feed and displaying on my Android phone.
Goal.com RSS Feed : http://www.goal.com/en-us/feeds/news?fmt=rss
But as you can see from the Rss feeds, they are only the headlines of the article containing 2 - 3 lines of the description. I was wondering is there any way I can get the complete article from it after parsing the feeds. Any pointers to guide me would be helpful . Thanks
Unfortunately, if the RSS feed doesn't contain the full article, there isn't an easy way to get it. You can look at the link tag for each item to find it on the web and then do some screen scraping, but that gets ugly fast.
#Clifton is correct. Even by this nature we can save our posts from autoposter plugins. Besides you can find the common pattern of your target web page where contents resides. You can etract that area of content programmatically.