Android: Extract article main content

Android: Extract article main content - android

Currently I am creating an Android application which allows to extract main content and picture from a website. Now I am using Jsoup API to extract all p tags from the HTML. However, it is not a good solution. Any suggestion or better solution enable me to extract main content and picture from a website in Android?

I didn't find anything that works for me, so I published Goose for Android, here: https://github.com/milosmns/goose
Some description follows...
Document cleaning
When you pass a URL to Goose, the first thing it starts to do is clean
up the document to make it easier to parse. It will go through the
whole document and remove comments, common social network sharing
elements, convert em and other tags to plain text nodes, try to
convert divs used as text nodes to paragraphs, as well as do a general
document cleanup (spaces, new lines, quotes, encoding, etc).
Content / Images Extraction
When dealing with random article links you're bound to come across the
craziest of HTML files. Some sites even like to include 2 or more HTML
files per site. Goose uses a scoring system based on clustering of
English stop words and other factors that you can find in the code.
Goose also does descending scoring so as the nodes move down - the
lower their scores become. The goal is to find the strongest grouping
of text nodes inside a parent container and assume that's the relevant
group of content as long as it's high enough (up) on the page.
Image extraction is the one that takes the longest. Trying to find the
most important image on a page proved to be challenging and required
to download all the images to manually inspect them using external
tools (not all images are considered, Goose checks mime types,
dimensions, byte sizes, compression quality, etc). Java's Image
functions were just too unreliable and inaccurate. On Android, Goose
uses the BitmapFactory class, it is well documented, tested, and is
fast and accurate. Images are analyzed from the top node that Goose
finds the content in, then comes a recursive run outwards trying to
find good images - Goose also checks if those images are ads, banners
or author logos, and ignores them if so.
Output Formatting
Once Goose has the top node where we think the content is, Goose will
try to format the content of that node for the output. For example,
for NLP-type applications, Goose's output formatter will just suck all
the text and ignore everything else, and other (custom) extractors can
be built to offer a more Flipboardy-type experience.

Why do you think it's not a good solution to use Jsoup?
I've written many web scrapers for different webpages, and in my experience Jsoup is the way to go for that task. You should study the Jsoup Syntax it is very powerful and with the right selectors you could extract most information from HTML documents very easy. Generally it becomes harder to extract information when the document has no id, class attributes or other unique features.
Other HTML parsers that might be interesting for you are JTidy and TagSoup

You could try the textracto api it automatically identifies the main content of HTML documents. There is also the opportunity to parse OpenGraph meta data, therefore you were also able to extract a picture (og:image).

Related

Interact with a website built with Knockout.js in Android Kotlin

My goal is to interact with a website (not mine), getting and posting data from it to my Android app coded using Kotlin. The interaction part is to be done in the background, as the result is to be shown in a RecyclerView in my app.
The website in question uses Knockout.js - the responsiveness and dynamically changing data makes it impossible to use libraries such as Jsoup for my goal at hand.
I am an aspiring App developer (n00b), and the question I have for the more senior devs here:
Is my project impossible? I have read it is "complex" to interact with a website that is dynamic, and I have also heard it is impossible. Is it? If not, could you guide me to the libraries I should be using? It is ok if these are in Java, I could probably look at adapting these to Kotlin.

If the site you need to extract data from produces a predictable result when you make a request to a URL then it would be easy to extract the data you need from it using a library like Jsoup which you've mentioned. Looking at the Jsoup docs that would be something like:
Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
log("%s\n\t%s",
headline.attr("title"), headline.absUrl("href"));
}
Where doc.select references an id in a given div (or other element) whose contents you're looking to extract.
Whether the site uses knockout or other JS library to help it render content shouldn't matter at all since all you're doing is parsing the string contents of the request--basically what you see when you view source in your browser. Knockout or any other script will have already run, doing its work in the rendering of the final HTML which you're going to parse with Jsoup.
But doing all of this is rather irregular as #Gushan indicates since normally unless you're doing some sort of scraping type of activity which would be weird for an android app, a site that wants to give you data and which you want to get data from will provide an API (usually some sort of REST API) that will simplify (document) how to go about getting that data. But I guess things aren't always like that. :)

Android, Store large amount of text (HTML) and search through them

I am making a framework in order to easily "appify" books.
This framework will need to automatically detect chapter and heading to make a table of contents. The idea is to also be able to easily search through the text and find what you are looking for.
Now what I still need to figure out is:
how to store the data in such a way that I can easily detect the chapters and heading
and still be able to search through the text.
The text that is stored needs to be formatted, so I thought I would store them as HTML or Markdown (which will be translated to HTML). I don't think it would be very searchable if the text is in HTML.
P.S. it does not have to be HTML if there are other more efficient ways to format the text.

Do you really want to do such thing on the device itself?
I can suggest you to use separate sqlite database for every book. With separate tables for table of contents, chapters, summarized keywords of chapters(for faster search) and other service info.
Also here you can find full text search example
Also I recommend you to bring your own sqlite build with your app.
Now lets talk about the main problem of yours - the book scraping.
I have no competency here, I believe this problem is the same as the web sites scraping.
Upd:
Please do not store book contents as HTML, you can store it as markdown for example, it takes less amount of storage, easier to sanitize and you can always apply your styles later

storing information in png and jpg

I have found a number of resources but nothing that has quite helped me with what I am looking for. I am trying to understand the .png and.jpg file formats enough to be able to modify and/or read the exif or other meta data in the files or to create my own meta data if possible.
I want to do this in the context of an android application so we can keep if there but it is really not exclusive to that. I am trying to figure out how to do this using a simple imput stream byte array and go from there.
Android itself has to at least extract the RGB pixel information at some point when it creates a bmp image from the stream, I took a look in the BitMapFactory source to try and understand it but I got lost somewhere after delving into the Native files.
I assume the bmps are losing any exif/meta data in the files based on my research. So I guess I want to break the inputstreams down by byte arrays and remove meta data. In .pngs I know there is no 'standard' but based on this page it seems there is some organization of the meta data you can store.
With all that said, I wouldn't mind just leaving exif/png standards behind and trying to store my own information in some sort of standardized way, but I need to know more about how the image readers id the files as either jpg, png, ect. then determine where the pixel information is located.
So I guess my first question is, has anyone done something similar to this before so that they can file me in? If not, does anyone know of any good libraries that might be good for educational purposes into figuring out how to locate and extract this data?
Or even more basically, what is a good way to find meta data and/or the exif standard or even the rgb data programmatically using something like a byte array?

There are a lot of things to address in your question, but first I should clarify that when you say "Android itself has to at least extract the RGB pixel information," what you're referring to is the act of decompression, which is complicated in the case of JPEG, and non trivial even for PNG. I think it would be very useful for you to read through the wikipedias for JPEG and PNG before attempting to go any further (especially sections on header, syntax, file structure, etc).
That being said, you've got the right idea. It shouldn't be too difficult to read in the header of an image as a byte array/stream, make some changes, and replace the old file. A PNG file can be identified by the first 8 bytes, and there should be a similar way to identify a JPEG - I can't remember off the top of my head.
To modify PNG meta data, you'll have to understand "chunks" - types/names, ordering, format, CRC, etc. The libpng website has some good resources for this, here's general PNG info, as well as chunk specifications. Make sure you don't forget to recalculate the CRC if you change anything.
JPEG sections off a file using "markers," which are two bytes long and always start with FF. Exif is just a regular JPEG file with a more specific structure for meta data, and this seems like a reasonable introduction: Exit/TIFF
There are probably libraries for Android/Java that conveniently take care of this for you, but I've never used any myself. A quick google search turns up this, and I'm sure there are many other options if you don't want to take the time to write a parser yourself.

How can I search content in HTML and not the tags

I have a database of content of which the majority are HTML pages which are then used for display purposes in an app.
We are looking to build out a search feature but I have some concerns over false positives appearing due to the results including HTML code.
E.g searching for "title" will return any content pages which have a title html tag
We are currently using NSPredicates to perform the query on a Core Data database.
Are there any easy/efficient ways to prevent these results being returned?
I have the same problem on Windows and Android as well!

One idea for iOS is to actually store a separate a text version apart from the HTML version. You could then use very simple (even if not very efficient) predicates lie
[NSPredicate predicateWithFormat:#"text CONTAINS[cd] %#", searchText];
A more performant way would be to strip out the words and store them in lowercase in an indexed attribute of another entity.
In both cases, the parsing should be done beforehand via one of the available libraries (see e.g. link in the comment).

best method for xml data storage

I am a php/mysql developer learning android. I am creating an android app that receives info from my php app to create list views of different products which will open a web view of that product's detail.
Currently my php cms web application outputs xml lists for an iphone app.... (also, separately outputs html). I have full control of the php app so if there is a better way to output the data for the android app please let me know.
I have created code that reads the xml from the web and creates the list view. The list can be refreshed daily, so the data does not need to be read from the online xml every time the app starts.
So I was thinking to store the data retrieved locally to improve my apps responsiveness. there may be up to 500 product descriptions to be stored at any given time in up to 30 different xml lists. I am starting development with one xml list with about 30 products.
For best performance should i store the product info in a sqlLite db or should i store the actual xml file in the cache/db or some other method like application cache.
I also was think to create the update of the data as a service, would this be a good idea?

The most efficient way to store data is RAM. But if you want to cache it, then the most efficient way is Database.
I recommend you store your data in sqlite android database.
You could also consider zipping you xml for faster network transfer and unzipping through java.util.zip package classes. You could even consider a simpler format for transmitting data, less verbose than xml, using a datainput/outputstream.
(I do that in of my apps and it works great)
Here are some details on data input / output stream method :
imagine a proprietary protocol for your data, only what you need. No tags, no attributes, just raw values in order.
on the client side, get an input stream on your data using URL.getContent() and cast it in input stream.
on the client side still, build a data input stream encapsulating your socket input stream and read data in order. Use readInt, readDouble, readUTF, and so on.
on the client side, from php, you need to find a way to save your data in a format that is compatible with the data format expected by the client. I can't tell much about PHP, I only program using java.
The advantage of this technique is that you save bandwith as there is only data and no verbose decoration due to xml. You should read about java specs to understand how double, int, strings are written in data output stream. But it can be hard using two languages to get the data right.
If php can't save format in a suitable way, use xml, it will be much simpler. First try with just plain xml, then give a try using a zip or tarball or xml file.
But all this is about speed gain during network connection.
The second part of what you have to do is to store each row of your list in a SQL table. Then you can retrieve it pretty fast using a CursorAdapter for your list view (it breaks the charming MVC model but it is quite fast !).

Sorry about this, but it became too long to write as a comment. This is not intended to be an answer to your question, because in my opinion Stéphane answered very well. The best solution is indeed to store the data in an sqlite database. Then you need to create the class to be used as a connection between the data, the database and the app. I don't want to take credit for what is said here already (I, too, voted it up).
I'm concerned with the other suggestion (use of low level raw streams for data manipulation, the list steps on that answer). I strongly recommend you to avoid creating your own proprietary protocol. It goes like this:
I need to exchange data.
I don't want to deal with the hassle of integrating external APIs into my code.
I know I can write two 5 minute routines to read and write the data back and forth.
Therefore, I just created my own proprietary format for exchanging data!
It makes me cry whenever I need to deal with unknown, obscure and arbitrary sequence of data blobs. It's always good to remember why we should not use unknown formats:
Reinventing the wheel is counter-productive. It seems not, but on the middle term it is. You can adapt your project to other mediums (server-side, other platforms) easily.
Using off-the-shelf components help you scale your code later.
Whenever you need to adapt your solution to other technologies and mediums, you'll work faster. Otherwise, you would probably end up with ad hoc code solutions that are not (easily) extensible and interoperable.
Using off the shelf components enables you to leverage advances in that particular technology. That's particularly important when you are using Android APIs, as they are frequently optimized for performance later down the road (See Android's Designing for Performance). Rolling your own standards may result in a performance penalty.
Unless you document your protocol, it's extremely easy to forget the protocol you created yourself. Just give it enough time and it will happen: you'll need to relearn/remember. If you document, then you are just wasting the computational time of your brain.
You think you don't need to scale your work, but chances are you will most of the time.
When you do, you will wish you had learned how to easily and seamlessly integrate well known formats.
The learning curve is needed anyway. In my experience, when you learn, you actually integrate well known formats faster than imagining your own way of doing things.
Finally, trust your data to geniuses that take their lives into creating cohesive and intelligent standards. They know it better!
Finally, if the purpose is to avoid the typical verbosity of XML, for whatever reasons, you have several options. Right now I can think of CSV, but I'm no expert in data storage, so if you're not confortable with it, I'm sure you can find good alternatives with plenty of ready to use APIs.
Good luck!

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.