Selendroid as a web scraper - android

I intend to create an Android application that performs a headless login to a website and then scrape some content from the subsequent page while maintaining the logged-in session.
I first used HtmlUnit in a normal Java project and it worked just fine. But later found that HtmlUnit is not compatible with Android.
Then I tried JSoup library by sending HTTP “POST” request to the login form. But the resulting page does not load up completely since JSoup won't support JavaScript.
I was then suggested to have a look on Selendroid which actually is an android test automation framework. But what I actually need is an Html parser that supports both JavaScript and Android. I find Selendroid quite difficult to understand which I can't even figure out which dependencies to use.
selendroid-client
selendroid-standalone
selendroid-server
With Selenium WebDriver, the code would be as simple as the following. But can somebody show me a similar code example for Selendroid as well?
WebDriver driver = new FirefoxDriver();
driver.get("https://mail.google.com/");
driver.findElement(By.id("email")).sendKeys(myEmail);
driver.findElement(By.id("pass")).sendKeys(pass);
// Click on 'Sign In' button
driver.findElement(By.id("signIn")).click();
And also,
What dependencies to add to my Gradle.Build file?
Which Selendroid libraries to import?

Unfortunately I didn't get Selendroid to work. But I find a workaround to scrape dynamic content by using just Android's built in WebView with JavaScript enabled.
mWebView = new WebView();
mWebView.getSettings().setJavaScriptEnabled(true);
mWebView.addJavascriptInterface(new HtmlHandler(), "HtmlHandler");
mWebView.setWebViewClient(new WebViewClient() {
#Override
public void onPageFinished(WebView view, String url) {
super.onPageFinished(view, url);
if (url == urlToLoad) {
// Pass html source to the HtmlHandler
WebView.loadUrl("javascript:HtmlHandler.handleHtml(document.documentElement.outerHTML);");
}
});
The JS method document.documentElement.outerHTML will retrieve the full html contained in the loaded url. Then the retrived html string is sent to handleHtml method in HtmlHandler class.
class HtmlHandler {
#JavascriptInterface
#SuppressWarnings("unused")
public void handleHtml(String html) {
// scrape the content here
}
}
You may use a library like Jsoup to scrape the necessary content from the html String.

I never had used Selendroid so I'm not really sure about that but searching by the net I found this example and, according to it, I suppose that your code translation from Selenium to Selendroid would be:
Translation code (in my opinion)
public class MobileWebTest {
private SelendroidLauncher selendroidServer = null;
private WebDriver driver = null;
#Test
public void doTest() {
driver.get("https://mail.google.com/");
WebElement email = driver.findElement(By.id("email")).sendKeys(myEmail);
WebElement password = driver.findElement(By.id("pass")).sendKeys(pass);
WebElement button = driver.findElement(By.id("signIn")).click();
driver.quit();
}
#Before
public void startSelendroidServer() throws Exception {
if (selendroidServer != null) {
selendroidServer.stopSelendroid();
}
SelendroidConfiguration config = new SelendroidConfiguration();
selendroidServer = new SelendroidLauncher(config);
selendroidServer.launchSelendroid();
DesiredCapabilities caps = SelendroidCapabilities.android();
driver = new SelendroidDriver(caps);
}
#After
public void stopSelendroidServer() {
if (driver != null) {
driver.quit();
}
if (selendroidServer != null) {
selendroidServer.stopSelendroid();
}
}
}
What do you have to add to your project
It seems that you have to add to your project the Selendroid standalone jar file. If you have doubts about how to add a external jar in an Android project you can see this question: How can I use external JARs in an Android project?
Here you can download the jar file: jar file
Also, it seems that it is not enough just to add the jar file to your project. You should add too the selendroid-client jar file of the version of standalone that you have.
You can download it from here: client jar file
I expect it will be helpful for you!

I would suggest you use WebdriverIO since you want to use Javascript.
It uses NodeJs so it will be easy to require other plugins to scrape the HTML.
Appium is also an alternative but it's more focused on front-end testing.

Related

View index.html found in the assets folder using NanoHttpd server embedded within my app

I have an app (Let's call it the Main App) which has an index.html page in the assets folder. The index.html is a simple HTML file with some JavaScript. It doesn't need PHP or MySql.
What I'm trying to do is embed the NanoHttpd server within my Main app, and automatically start the Nano server when the app starts or resumes, and view my index.html file within my app. While I know I can use the
webView.loadUrl("file:///android_asset/index.html");
to access the index.html file, it is impossible to do that for this scenario. Hence the need to use a webserver.
Right now I have a different dedicated app as a web server, which runs at http://localhost:8080. When I open the Main app, it works automatically. As you can see, in order to view the HTML file, you need to launch the web server, start it, then go back to the Main app and start it. I wanted a solution where the NanoHttpd server starts automatically when I lunch my Main App and starts showing the index.html contents on the webview. Here is my code which works perfectly using my Main App and a web server
WebView wv;
wv = (WebView) findViewById(R.id.webView1);
WebSettings webSettings = wv.getSettings();
webSettings.setJavaScriptEnabled(true);
wv.loadUrl("http://localhost:8080/index.html");
While O couldn't find the exact documentation for it, I tried two different approaches from the following links
Using NanoHTTPD in Android
http://programminglife.io/android-http-server-with-nanohttpd/
Both didn't work, One just shows a white page while the others just gives me a response that the server is running.
So how can I automatically start NanoHttpd server when my app starts and automatically launch an html file from my assets folder to a webview?
If it's too much to ask for the NanoHttpd, is there another way to embed a webserver with an app and launch the index.html?
Given you have your HTML structure on your assets folder following this hierarchy:
+ src
++ main
+++ assets
+++ java
+++ res
You can use the following method to open a local file without security restriction:
webView.loadUrl("file:///android_asset/index.html");
Also you should set this to enable Javascript to run locally before load content:
webView.getSettings().setDomStorageEnabled(true);
webView.getSettings().setJavaScriptEnabled(true);
Additionally you can set some other settings to best fit your demands like:
setBuiltInZoomControls(true);
setSupportZoom(true);
setDefaultTextEncodingName("utf-8"); // support international chars
setUserAgentString("myVeryOwnUserAgent"); // personalize UA
And so on.
Assuming there is really only a single HTML file, the following works in Java, try running it on your computer:
public class Server extends NanoHTTPD {
final private String indexString = readFile("in/index.html");
public static void main(String[] args) throws IOException {
Server server = new Server();
server.start();
while(true){};
}
public Server() throws IOException {
super(8080);
}
#Override
public Response serve(IHTTPSession session) {
return newFixedLengthResponse(indexString);
}
private static String readFile(String path) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(path));
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append("\n");
line = br.readLine();
}
return sb.toString();
} finally {
br.close();
}
}
}
The main method is an example usage. To make it work on your application, start the server as exemplified, then use webView.loadUrl("http://localhost:8080"); instead of the infinite loop (which is there to ensure the example application doesn't quit early).

Xamarin Android CookieManager doesn't store all cookies

I am using Android Web View in my Xamarin Project to perform third party authentication. Once the login is successful I need to extract the authentication cookies. This cookies I am storing in persistent storage and then I am using them for passing to subsequent requests.
For example:
Android App >(opens) webview > Loads (idp provider) url > User provides credentials and saml request is sent to my backend server > backend server validates saml and returns authentication cookies.
It returns two cookies.
Now everything works fine. And in OnPageFinished method of the WebClient of webview I am trying to extract the cookies using the method.
public override void OnPageFinished(WebView view, string url)
{
base.OnPageFinished(view, url);
var handler = OnPageCompleted;
var uri = new Uri(url);
AllowCookies(view);
var cookies = CookieManager.Instance.GetCookie(url);
var onPageCompletedEventArgs = new OnPageCompletedEventArgs { Cookies = cookies, Url = uri.AbsolutePath, RelativeUrl = uri.PathAndQuery, Host = uri.Host };
handler?.Invoke(this, onPageCompletedEventArgs);
}
private void AllowCookies(WebView view)
{
CookieManager.Instance.Flush();
CookieManager.AllowFileSchemeCookies();
CookieManager.SetAcceptFileSchemeCookies(true);
CookieManager.Instance.AcceptCookie();
CookieManager.Instance.AcceptThirdPartyCookies(view);
CookieManager.Instance.SetAcceptCookie(true);
CookieManager.Instance.SetAcceptThirdPartyCookies(view, true);
}
The problem is, I am able to get just one cookie(wc_cookie_ps_ck
), I am unable to see the other authentication cookie(.AspNetCore.Cookies
).
Here's how the cookies appear in browser.
Please note that in postman and in chrome browser both the cookies appear.
But in android webview only cookie with name ".AspNetCore.Cookies" is not appearing at all.
As per Java document,"When retrieving cookies from the cookie store, CookieManager also enforces the path-match rule from section 3.3.4 of RFC 2965 . So, a cookie must also have its “path” attribute set so that the path-match rule can be applied before the cookie is retrieved from the cookie store."
Since both of my cookies have different path, is that the reason the one with path set as "/project" is not appearing?
After days and days of finding the answer to the question. I finally have found an answer.
I did remote debugging of the webview with the desktop chrome and I found out that all the cookies that I needed were present in the webview.
However the method,
var cookies = CookieManager.Instance.GetCookie(url);
doesn't return the cookie which has the same site variable set.
This looks like a bug from Xamarin Android. I have already raised an issue in Xamarin Android github.
In the xamarin android github issue I have mentioned the steps to reproduce.
For me, the workaround to resolve the issue was to set the samesite cookie varibale off in my asp.net core back end project.
As follows:
In order to configure the application cookie when using Identity, you can use the ConfigureApplicationCookie method inside your Startup’s ConfigureServices:
// add identity
services.AddIdentity<ApplicationUser, IdentityRole>();
// configure the application cookie
services.ConfigureApplicationCookie(options =>
{
options.Cookie.SameSite = SameSiteMode.None;
});
Link for the above solution mentioned. Here.

Unable to find exact Class name for finding Comments of a URL using jsoup

I am working in Android and using Jsoup for cwaling some data from internet. I am unable to find the exact class name where the comment lies in the below defined code. I tried with disqus_thread , dsq-content,ul-dsq-comments and dsq-comment-body by going to the source page of url but not any one returned the comments.
public static void main(String[] args) {
Document d;
Elements lin = null;
String url = "http://blogs.tribune.com.pk/story/39090/i-hate-materialistic-people-beta-but-i-love-my-designer-clothes/";
try {
d = Jsoup.connect(url).timeout(20*1000).userAgent("Chrome").get();
lin = d.getElementsByClass("dsq-comment-body");
System.out.println(lin);
} catch (IOException e) {
e.printStackTrace();
}
int i=0;
for(Element l :lin){
System.out.println(""+i+ " : " +l.text());
i++;
}
}
That's because the HTML that makes up the comments is generated dynamically after the page has been loaded, using Javascript. When the page is loaded the comment HTML doesn't exist, so Jsoup cannot retrieve it.
To get hold of the comments you have 3 options:
1) Use a web-crawler that can execute javascript. Selenium Webdriver (http://www.seleniumhq.org/projects/webdriver/) and PhantomJS (http://phantomjs.org/) are popular options here. The former works by hooking into a browser implementation (e.g. Mozilla Firefox) and opening the browser programmatically. The latter does not open a browser and executes the javascript by using Webkit instead.
2) Intercept the network traffic when opening the site (here you can probably use your browser's built-in network tab) and find the request that fetches the comments. Make this request yourself and extract the relevant data to your application. Bear in mind that this will not work if the server serving the comments requires some kind of authentication.
3) If the comments are served by a specialized provider with an openly accessible API, then it might be possible to extract them through this API. The site you linked to uses Disqus to handle the comment section so it might be possible to hook into their API and fetch them this way.

FTP via PhoneGap

I would like to develop an FTP file system using PhoneGap.
Essentially i would like the user to be able to have a list of sites they can connect to and get the whole root directory for viewing and changing file names.
I can do this process with C# and .NET languages but have no knowledge on achieving this with PhoneGap.
Are there specific library's i could use?
Will i have to develop everything from scratch?
Is it possible to mix Native with PhoneGap?
What kind of security will i be looking at achieving here?
If you could answer one or all of these questions that is greatly appreciated!
A very easy way of achieving what you want in android is falling back to Native java code and use the Apache FTPClient class.
There is a very good easy to read plugin here. You can get it and take a look on how it works. Basically it uses the FTP client class and it has two built in methods to Upload a file and to Download a file from the server.
It's again dead easy to expand the Execute method to perform other actions like Rename and Delete :
Java Code:
public boolean execute (...) {
...
if (action.equals("get")) {
get(filename, url);
}
else if (action.equals("put")) {
put(filename, url);
}
else if (action.equals("delete")){
delete(filename,url);
} else if (action.equals("rename")){
rename(filename,url);
}
...
}
private void delete(String filename, URL url) throws IOException {
FTPClient f = setup(url);
f.deleteFile(extractFileName(url));
teardown(f);
}
private void rename(String newFilename, URL url) throws IOException {
FTPClient f = setup(url);
f.rename(extractFileName(url),newFilename);
buffOut.flush();
buffOut.close();
teardown(f);
}
And add these methods too on the javascript layer
FtpClient.prototype.delete = function(url, successCallback, errorCallback) {
return exec(successCallback, errorCallback, "FtpClient", "get", [" ", url]);
};
FtpClient.prototype.rename = function(newFilename, url, successCallback, errorCallback) {
return exec(successCallback, errorCallback, "FtpClient", "get", [newFilename, url]);
};
If you need instructions on how to use a phonegap plugin there is a good guide here. Basically you need to do the follwowing:
Write your java code
Write your javascript code and using the exec Method call the native layer
Add your plugin to res/xml/config
Specifically for the plugin I posted above, in the github readme you can see the instructions on how to install that plugin.

How to determine URL has valid domain or not or it's a valid url?

What i get in android native messaging, whenever we send a message to some one and in the text there is a url, android recognizes that and underlined it showing it as a link, it does that for many domains, like .us,.uk,.dk,.ch and all others valid.
Even we send jhjh.us without 'www' or 'http' it recognises it as link.
and if the domain is wrong it doesn't do any thing.
I want the same thing, I tried using pattern
(((https?|ftp|file)://)|(www\\.))"+ "[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]
it does a good but in domain it didn't help. Also tried using URLUtil.isValidUrl() but of no use,
Can anyone give me some idea regarding this.
You can try this
public boolean isURL(String url)
{
try {
new URL(url);
return true;
} catch (MalformedURLException e) {
return false;
}
}
U can use the Use UrlValidator to validate the URL
Considering you are using Class UrlValidator
UrlValidator urlValidator = new UrlValidator();
urlValidator.isValid("http://Test Link!");
There are several properties that you can set to control how this class behaves, by default http, https, and ftp are accepted.

Categories

Resources