While I know how to extract contents of a website by URLConnection and BufferedReader and get its source code, sometimes a website is itself getting data from elsewhere and showing onto the page.
e.g. I am now working on this page
http://bet.hkjc.com/marksix/userinfo.aspx?file=lucky_ocbs.asp&lang=en
and the 10 branches name and other details in the table in the page is not in the source code of the page.
Question:
Instead of extracting data from source code, is there any way to extract wordings simply from the final text showing in a page? If yes, how could it be done?
Thanks a lot.
Yes, there is a way to extract the information from the website even if it performs some client side operations such as loading the data from an external website before displaying it. Although it'll be a very tricky solution and if you would have an opportunity to make an agreement with the website's owner and ask him to provide API to your application, I'd choose that option.
Ok, according to your question you can try to use Android's WebView to render the website first. Then just get the html content using one of the method described here. The most tricky part here is to make it in user friendly way. You have to cover a WebView with a progress bar while your app is waiting for onPageFinished callback from WebView. I'm not sure that WebView is acting properly in that case. But it's worth to try.
Short Answer: You can't.
Reason: What renders the HTML is the client side. e.g: Browsers, Chrome, Firefox, IExplore, etc... Since you don't have a interpreter for the Markup Language you are unable to get only tag content ,even the browsers download all content, this is the HTTP behavior.
Workaround: Since you mentioned that some branches are not on page, i assume it is running on client side via some Javascript, what you can do is check what client is executing and perform via code). Since your client is the app.
Also see: Jsoup
You can not extract only your wanted information without download source html. after you downloaded source, you can use jsoup to iterate to only your wanted information.
add this to your app level build.gradle file
compile 'org.jsoup:jsoup:1.9.2'
then you can download and parse source code.
String url = "http://bet.hkjc.com/marksix/userinfo.aspx?file=lucky_ocbs.asp&lang=en";
InputStream input = new URL(url).openStream();
Document doc = Jsoup.parse(input, "ISO-8859-9", url);
Elements sectionElements = doc.select("div#general-info-panel");
Elements imageElements = sectionElements.select("img[src]");
you need to convert above code block to your html page source code. you can find examples to how to use jsoup.
http://phantomjs.org/ can be used to extract a website's content after JavaScript execution. Not sure if they have an android build.
I've created a simple android browser. I have used EditText for URL and Webview for loading webpages. Although the browser works fine when I put the complete URL path, I need a functionality by which I can get predictions/auto-completion of partially put URLs. Please let me know if following options are valid -
Using AutoCompleteTextView instead of EditText with a database of all the available websites on the internet(More than 1 billion websites!! Where can I get such dynamically updating database??)
Saving URLs which are frequently being used by the user and use them as a prediction in an AutoCompleteTextView.(How can I add these URLs in a dynamic AutoCompleteTextView/Database??)
Currently I am using google search as a workaround. Please provide your views on how this can be achieved.
It is my beginning with Android and I would like to write an app to connect with certain website, then save whole html of this website to some text file. I need this to later purposes such as extracting some info from html and so on. Do I need to do it using WebView?
Details: The website contains job offers. I need to make a connection with fields responsible for description and location of a job, so my application can fill those fields. Next step would be listing all the results.
Please lend me some advice :)
I a'm writing a timetable app for my school. Therefore I would have to send a notification if the timetable has been changed. So this would be a change in the html code. I can't use the if modified since option because this webpage is automatically updated, so I have to notice a change in the html code.
I hope someone can help me out of here.
There's two ways to do this well, and neither involves looking at the HTML.
1)Write a webservice to query for the raw information. This can include a timestamp of the last update which you can just compare
2)Do the entire feature via push messaging rather than pull.
The only reasons to even consider checking the HTML for this is if you're doing this without the school's help and screen scraping the information. In which case you may as well just do a string.equals between the last html you got and the current one, there's no better way in that case.
I have a WebView with some web page in it. Now I want to retrieve complete HTML contents of what is inside the WebView.
I use loadUrl("javascript:...") and WebView's javascript interface feature to retrieve this HTML using something like this:
document.getElementsByTagName('html')[0].innerHTML / outerHTML
document.documentElement.outerHTML
...
In each case I receive partial HTML contents - exactly first 10000 characters! So my question is - how do I get complete HTML content? Is it device-specific and, maybe there are workarounds?
Btw, web pages are created dynamically with javascript - I can't simply download the file from server.
Also, I tried printing HTML contents in javascript with console.log and found exactly the same behavior.
Thanks in advance!
My mistake - it was not related to javascript, neither had to do with specific device I tested on.
So, in short, any of those js properties work correctly.