While I know how to extract contents of a website by URLConnection and BufferedReader and get its source code, sometimes a website is itself getting data from elsewhere and showing onto the page.
e.g. I am now working on this page
http://bet.hkjc.com/marksix/userinfo.aspx?file=lucky_ocbs.asp&lang=en
and the 10 branches name and other details in the table in the page is not in the source code of the page.
Question:
Instead of extracting data from source code, is there any way to extract wordings simply from the final text showing in a page? If yes, how could it be done?
Thanks a lot.
Yes, there is a way to extract the information from the website even if it performs some client side operations such as loading the data from an external website before displaying it. Although it'll be a very tricky solution and if you would have an opportunity to make an agreement with the website's owner and ask him to provide API to your application, I'd choose that option.
Ok, according to your question you can try to use Android's WebView to render the website first. Then just get the html content using one of the method described here. The most tricky part here is to make it in user friendly way. You have to cover a WebView with a progress bar while your app is waiting for onPageFinished callback from WebView. I'm not sure that WebView is acting properly in that case. But it's worth to try.
Short Answer: You can't.
Reason: What renders the HTML is the client side. e.g: Browsers, Chrome, Firefox, IExplore, etc... Since you don't have a interpreter for the Markup Language you are unable to get only tag content ,even the browsers download all content, this is the HTTP behavior.
Workaround: Since you mentioned that some branches are not on page, i assume it is running on client side via some Javascript, what you can do is check what client is executing and perform via code). Since your client is the app.
Also see: Jsoup
You can not extract only your wanted information without download source html. after you downloaded source, you can use jsoup to iterate to only your wanted information.
add this to your app level build.gradle file
compile 'org.jsoup:jsoup:1.9.2'
then you can download and parse source code.
String url = "http://bet.hkjc.com/marksix/userinfo.aspx?file=lucky_ocbs.asp&lang=en";
InputStream input = new URL(url).openStream();
Document doc = Jsoup.parse(input, "ISO-8859-9", url);
Elements sectionElements = doc.select("div#general-info-panel");
Elements imageElements = sectionElements.select("img[src]");
you need to convert above code block to your html page source code. you can find examples to how to use jsoup.
http://phantomjs.org/ can be used to extract a website's content after JavaScript execution. Not sure if they have an android build.
Related
My goal is to interact with a website (not mine), getting and posting data from it to my Android app coded using Kotlin. The interaction part is to be done in the background, as the result is to be shown in a RecyclerView in my app.
The website in question uses Knockout.js - the responsiveness and dynamically changing data makes it impossible to use libraries such as Jsoup for my goal at hand.
I am an aspiring App developer (n00b), and the question I have for the more senior devs here:
Is my project impossible? I have read it is "complex" to interact with a website that is dynamic, and I have also heard it is impossible. Is it? If not, could you guide me to the libraries I should be using? It is ok if these are in Java, I could probably look at adapting these to Kotlin.
If the site you need to extract data from produces a predictable result when you make a request to a URL then it would be easy to extract the data you need from it using a library like Jsoup which you've mentioned. Looking at the Jsoup docs that would be something like:
Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
log("%s\n\t%s",
headline.attr("title"), headline.absUrl("href"));
}
Where doc.select references an id in a given div (or other element) whose contents you're looking to extract.
Whether the site uses knockout or other JS library to help it render content shouldn't matter at all since all you're doing is parsing the string contents of the request--basically what you see when you view source in your browser. Knockout or any other script will have already run, doing its work in the rendering of the final HTML which you're going to parse with Jsoup.
But doing all of this is rather irregular as #Gushan indicates since normally unless you're doing some sort of scraping type of activity which would be weird for an android app, a site that wants to give you data and which you want to get data from will provide an API (usually some sort of REST API) that will simplify (document) how to go about getting that data. But I guess things aren't always like that. :)
I have an app that has a web-view which has a basic web-form that has a few fields and a submit button. I would like to figure out in my app if the form has any input in any of the fields. I cannot change the form from the server side, and I can't be certain much about the fields (ids / names in the html).
In iOS we accomplish this with an interesting process of pulling all the html out when loading the form, and comparing it to the html at any given point, if they don't match, the user must have entered something into a field. I believe we were able to get the html by injecting and running some javascript into the web-view. I'm not sure exactly how to approach the problem on android, or if android has any better tools to get whether a form has been edited.
Anybody have any ideas / pseudo-code how I can tell if a form has had input in any of the fields in a webview in android?
Unfortunately, there are no special form-related tools in Android WebView either. You can use the same approach as you have described for iOS.
A couple of links to get you started:
Read HTML content of webview widgets
Android Web-View : Inject local Javascript file to Remote Webpage
I know how to get data from a url but my point is when a user paste a url in my EditText I want to expand the url to get the description and main image from the url. When we paste a url in a facebook/google + EditText it will read and expand the url so I want just like that. In web development we can get those data from html < meta > tag but in android how can I just get the < meta > data rather than whole url html.
Check out this question on So: How to extract meta tags from website on android?
You can pull the meta tags from websites using that library.
Edit:
Due to the lack of documentation on that other library, I did some research and found something better. There is something called jSoup which is a .jar file and can be imported into your libs folder in your Android Studio project.
jSoup let's you pull tags from websites using a method known as scraping. Disclaimer: make sure you have permission to scrape a website depending on what you're doing with the results. Here's a great tutorial on how to use jSoup to pull tags into Android. It has a really well laid out example.
Hopefully this works a little better.
I pulled a website to a WebView via HTTP GET. The problem is that the website isn't formatted for mobile. I found that if I edit the HTML, I can comment out the scripting that makes the left pane on the site.
Method:
Download page to string, search string for and replace first substring <link with <!--, write to file, and load into the WebView.
That works great until it comes to a link. Clicking on it causes the WebView to attempt to load file:///index.php/Whatever_the_page_was.
What I want to do is capture that link request and change the file:/// part to www.wurmpedia.com, and then run it through my parser to remove the script like the first, and repeat the process on any other link click that follows.
I could not find any other way to pull this off and this is what I made up. Any help would be appreciated, either through URL modification or with a more efficient method.
How about intercepting the link request using
WebView.shouldInterceptRequest
I have read the example for Rss Parsing from the ibm site.(http://www.ibm.com/developerworks/opensource/library/x-android/).
In this example,the rss are shown in a listview and then,if you press one announcement you can see it in the web browser of the device.How could i see them in the app,with no use of the device browser?
Thanks a lot
Create a layout with a WebView then load the URL from each "announcement" using WebView.loadUrl.
I'm a little confused but you seem to have answered your own question.
You say you don't want to use the web browser on the device but the example in your question doesn't use the browser. It does exactly what you're asking for.
The idea is that you download the html from the website and then use the parser to break it up into separate "announcements" and store them in list view items in your program.
I have done a bit of this type of thing myself in android. I used jsoup java library, which makes breaking the html into the bits you want to display really easy.
If you want some more help I can give you an example of an app I made that pulls movie times from google.com/movies as an example. here are links to the classes where I did the html download and parse:
ScreenScraper.java
HtmlParser.java