I am writing a simple program to capture image resources from the web page. The image items in the html looks like:
CASE1:<img src="http://www.aaa.com/bbb.jpg" alt="title bbb" width="350" height="385"/>
or
CASE2:<img alt="title ccc" src="http://www.ddd.com/bbb.jpg" width="123" height="456"/>
I know how to handle either case separately, take the first one for example:
String CAPTURE = "<img(?:.*)src=\"http://(.*)\\.jpg\"(?:.*)alt=\"(.*?)\"(?:.*)/>";
DefaultHttpClient client = new DefaultHttpClient();
BasicHttpContext context = new BasicHttpContext();
Scanner scanner = new Scanner(client
.execute(new HttpGet(uri), context)
.getEntity().getContent());
Pattern pattern = Pattern.compile(CAPTURE);
while (scanner.findWithinHorizon(pattern, 0) != null) {
MatchResult r = scanner.match();
String imageUrl = "http://" +r.group(1)+".jpg";
String imageTitle = r.group(2);
//Do something with the image
}
The question is how to write the correct pattern to get all the image items from a web page source code which contains both CASE1 and CASE2? I only want to scan the page once.
Use jsoup
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
...
Document doc;
String userAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0";
try {
// need http protocol
doc = Jsoup.connect("http://domain.tld/images.html").userAgent(userAgent).get();
// get all images
Elements images = doc.select("img");
for (Element image: images) {
// get the values from img attribute (src & alt)
System.out.println("\nImage: " + image.attr("src"));
System.out.println("Alt : " + image.attr("alt"));
}
} catch (IOException e) {
e.printStackTrace();
}
Jsoup, a HTML parser, its “jquery-like” and “regex” selector
syntax is very easy to use and flexible enough to get whatever you
want.
Related
I am trying to load SVF file on Autodesk forge viewer locally in Xamarin.Android. I copied the content to my project Assets/html folder. My code to load the content looks like this.
In MyWebViewClient.cs
public WebResourceResponse ShouldInterceptRequest(WebView webView, IWebResourceRequest request)
{
try
{
Android.Net.Uri url = request.Url;
//Uri uri = url;
String path = url.Path;
if (path.StartsWith("/android_asset/"))
{
try
{
AssetManager assetManager = this.context.Assets;
String relPath = path.Replace("/android_asset/", "").Replace("gz", "gz.mp3");
//InputStream stream = assetManager.Open(relPath);
return new WebResourceResponse(null, null, assetManager.Open(relPath));
}
catch (IOException ex)
{
String str = ex.Message;
}
}
}
catch (Exception ex) { }
return null;
}
Then in my Activity.cs
SetContentView(Resource.Layout.webview);
var wbMain = FindViewById<WebView>(Resource.Id.webView1);
wbMain.Settings.DomStorageEnabled = true;
wbMain.Settings.JavaScriptEnabled = true;
wbMain.Settings.AllowFileAccessFromFileURLs = true;
wbMain.Settings.AllowUniversalAccessFromFileURLs = true;
var customWebViewClient = new MyWebViewClient(BaseContext);
customWebViewClient.OnPageLoaded += MyWebViewClient_OnPageLoaded;
wbMain.SetWebViewClient(customWebViewClient);
wbMain.LoadUrl("file:///android_asset/html/index.html");
This only loads the side views not the main viewer.
Whats the reason for this and how can I resolve this?
Please note that the sample is a bit outdated and there have been some changes in our legal terms since then. Currently, the legal T&C state that all viewer assets (JS, CSS, icons, images, etc.) must be coming from the Autodesk domain.
If you need to be able to run your viewer-based app in "temporarily offline" scenarios (for example, on a construction site), I'd suggest that you look at the following blog post: https://forge.autodesk.com/blog/disconnected-workflows. This approach (using Service Workers and Cache API) is consistent with the legal requirements.
I am trying to get images from google
String url = "https://www.google.com/search?site=imghp&tbm=isch&source=hp&q=audi&gws_rd=cr";
org.jsoup.nodes.Document doc = Jsoup.connect(url).get();
Elements elements = doc.select("div.isv-r.PNCib.MSM1fd.BUooTd");
ImageData is encoded in base64 so in order to get actual image url I first get the data id which is set as an attribute , this works
for (Element element : elements) {
String id = element.attr("data-id")).get();
I need to make new connection with url+"#imgrc="+id ,
org.jsoup.nodes.Document imgdoc = Jsoup.connect(url+"#"+id).get();
Now in the browser when I inspect my required data is present inside <div jsname="CGzTgf"> , so I also do the same in Jsoup
Elements images = imgdoc.select("div[jsname='CGzTgf']");
//futher steps
But images always return empty , I am unable to find the error , I do this inside new thread in android , any help will be appreciated
Turns out the way you're doing it you'll be looking in the wrong place entirely. The urls are contained within some javascript <script> tag included in the response.
I've extracted and filtered fro the relevant <script> tag (one containing attribute nonce.
I then filter those tags for one containing a specific function name used AND a generic search string I'm expecting to find (something that won't be in the other <script> tags).
Next, the value obtained needs to be stripped to get the JSON object containing about a hundred thousand arrays. I've then navigated this (manually), to pull out a subset of nodes containing relevant URL nodes. I then filter this again to get a List<String> to get the full URLs.
Finally I've reused some code from an earlier solution here: https://stackoverflow.com/a/63135249/7619034 with something similar to download images.
You'll then also get some console output detailing which URL ended up in which file id. Files are labeled image_[x].jpg regardless of actual format (so you may need to rework it a little - Hint: take file extension from url if provided).
import com.jayway.jsonpath.JsonPath;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.List;
public class GoogleImageDownloader {
private static int TIMEOUT = 30000;
private static final int BUFFER_SIZE = 4096;
public static final String RELEVANT_JSON_START = "AF_initDataCallback(";
public static final String PARTIAL_GENERIC_SEARCH_QUERY = "/search?q";
public static void main(String[] args) throws IOException {
String url = "https://www.google.com/search?site=imghp&tbm=isch&source=hp&q=audi&gws_rd=cr";
Document doc = Jsoup.connect(url).get();
// Response with relevant data is in a <script> tag
Elements elements = doc.select("script[nonce]");
String jsonDataElement = getRelevantScriptTagContainingUrlDataAsJson(elements);
String jsonData = getJsonData(jsonDataElement);
List<String> imageUrls = getImageUrls(jsonData);
int fileId = 1;
for (String urlEntry : imageUrls) {
try {
writeToFile(fileId, makeImageRequest(urlEntry));
System.out.println(urlEntry + " : " + fileId);
fileId++;
} catch (IOException e) {
e.printStackTrace();
}
}
}
private static String getRelevantScriptTagContainingUrlDataAsJson(Elements elements) {
String jsonDataElement = "";
int count = 0;
for (Element element : elements) {
String jsonData = element.data();
if (jsonData.startsWith(RELEVANT_JSON_START) && jsonData.contains(PARTIAL_GENERIC_SEARCH_QUERY)) {
jsonDataElement = jsonData;
// IF there are two items in the list, take the 2nd, rather than the first.
if (count == 1) {
break;
}
count++;
}
}
return jsonDataElement;
}
private static String getJsonData(String jsonDataElement) {
String jsonData = jsonDataElement.substring(RELEVANT_JSON_START.length(), jsonDataElement.length() - 2);
return jsonData;
}
private static List<String> getImageUrls(String jsonData) {
// Reason for doing this in two steps is debugging is much faster on the smaller subset of json data
String urlArraysList = JsonPath.read(jsonData, "$.data[31][*][12][2][*]").toString();
List<String> imageUrls = JsonPath.read(urlArraysList, "$.[*][*][3][0]");
return imageUrls;
};
private static void writeToFile(int i, HttpURLConnection response) throws IOException {
// opens input stream from the HTTP connection
InputStream inputStream = response.getInputStream();
// opens an output stream to save into file
FileOutputStream outputStream = new FileOutputStream("image_" + i + ".jpg");
int bytesRead = -1;
byte[] buffer = new byte[BUFFER_SIZE];
while ((bytesRead = inputStream.read(buffer)) != -1) {
outputStream.write(buffer, 0, bytesRead);
}
outputStream.close();
inputStream.close();
System.out.println("File downloaded");
}
// Could use JSoup here but I'm re-using this from an earlier answer
private static HttpURLConnection makeImageRequest(String imageUrlString) throws IOException {
URL imageUrl = new URL(imageUrlString);
HttpURLConnection response = (HttpURLConnection) imageUrl.openConnection();
response.setRequestMethod("GET");
response.setConnectTimeout(TIMEOUT);
response.setReadTimeout(TIMEOUT);
response.connect();
return response;
}
}
Partial Result I tested with:
I've used JsonPath for filtering the relevant nodes which is good when you only care about a small portion of the JSON and don't want to deserialise the whole object. It follows a similar navigation style to DOM/XPath/jQuery navigation.
Apart from this one library and Jsoup, the libraries used are very bog standard.
Good Luck!
I'm using the android sdk generated by AWS API Gateway to get a pre-signed URL for objects in s3 (lambda behind API gateway).
My s3 bucket looks like this:
* module_a
|\
| * file_a
| * subdir_a
| \
| * file_sa
* module_b
This works perfectly for file_a, but for file_sa it doesn't. At least not when I use the android SDK, there I get an URL where the slash is replaced with %25252F.
However, when I test the api in the console, I get the correct URL.
Is there anything I can do with the SDK to fix this?
Update
Here's the chain of code snippets involved in this problem.
Android code to download file (exception happens in last line)
fileName = "css/style.css"; // file in s3
moduleName = "main"; // folder in s3
[...]
ApiClientFactory factory = new ApiClientFactory().credentialsProvider(
aws.credentialsProvider);
apiClient = factory.build(myAPIClient.class);
apiClient.modulesModuleFileGet(fileName.replace("/", "%2F"), moduleName);
URL url = new URL(url_path.getUrl());
URLConnection connection = url.openConnection();
connection.connect();
InputStream in = new BufferedInputStream(connection.getInputStream());
API Gateway
The api endpoint used above is configured with two path parameters (module name and file name). The body mapping template for the call to lambda looks like this:
#set($inputRoot = $input.path('$'))
{
"module" : "$input.params('module')",
"file": "$input.params('file')"
}
Lambda
from __future__ import print_function
import json
import urllib
import boto3
s3 = boto3.client('s3')
def lambda_handler(event, context):
key = event['module'] + "/" + event['file'].replace("%2F", "/")
url = s3.generate_presigned_url(
"get_object",
Params={'Bucket':"mybucket",
'Key': key},
ExpiresIn=60)
return {"url": url}
I've got it to work after following the comments. However I still somehow get double quoted slashes. Here's the working code
Android
public Url modulesModuleFileGet(String fileName, String moduleName) {
try {
String fileNameEnc = URLEncoder.encode(fileName, "UTF-8");
Url ret = getApiClient().modulesModuleFileGet(fileNameEnc, moduleName);
return ret;
} catch (UnsupportedEncodingException e){
Log.e(TAG, "> modulesModuleFileGet(", e);
return null;
}
}
Lambda
def lambda_handler(event, context):
key = event['module'] + "/" + urllib.unquote_plus(urllib.unquote_plus(event['file']))
url = s3.generate_presigned_url(
"get_object",
Params={'Bucket':"my",
'Key': key},
ExpiresIn=60)
return {"url": url}
I'd still welcome further suggestions on how to improve this, but for now it is working. Thanks for the comments pointing me in the right direction.
Usually after using Google to search for a city, there is a part of Wikipedia page on the right with an image and a map. Can anyone tell me how I could access this image? I should know how to download it.
Actually the main image (that goes with the map image on the right) is very rarely from Wikipedia, so you can't use Wikipedia API to get it. If you want to access the actual main image you can use this:
private static void GetGoogleImage(string word)
{
// make an HTTP Get request
var request = (HttpWebRequest)WebRequest.Create("https://www.google.com.pg/search?q=" + word);
request.UserAgent = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36";
using (var webResponse = (HttpWebResponse)request.GetResponse())
{
using (var reader = new StreamReader(webResponse.GetResponseStream()))
{
// get all images with base64 string
var matches = Regex.Matches(reader.ReadToEnd(), #"'data:image/jpeg;base64,([^,']*)'");
if (matches.Count > 0)
{
// get the image with the max height
var bytes = matches.Cast<Match>()
.Select(x => Convert.FromBase64String(x.Groups[1].Value.Replace("\\75", "=").Replace("\\075", "=")))
.OrderBy(x => Image.FromStream(new MemoryStream(x, false)).Height).Last();
// save the image as 'image.jpg'
using (var imageFile = new FileStream("image.jpg", FileMode.Create))
{
imageFile.Write(bytes, 0, bytes.Length);
imageFile.Flush();
}
}
}
}
}
This work for me, and always returns the actual main image (if such exists). For example, GetGoogleImage("New York") give me data:image/jpeg;base64,/9j/4AAQSkZJRg....
I use the fact that from the all base64 string images in response the main has the max height, so its need only to order them by height and to select the last one. If it's required, you can check here also for minimum image height. The replacing \075 to = is needed base64's padding.
If you want Wikipedia article main image you have to use Wikipedia API.
Update:
You can use jsoup: Java HTML Parser org.jsoup:jsoup:1.8.3 which return list of image inside page.
String stringResponse = getHtmlContent(url);
Document doc = Jsoup.parse(stringResponse);
Element content = doc.getElementById("content");
//Get all elements with img tag ,
Elements img = content.getElementsByTag("img");
for (Element el : img) {
//for each element get the src image url
String src = el.attr("src");
Log.d(TAG, "src attribute is : " + src);
String alt = el.attr("alt");
//do some stuff
}
Update:
Wikipida provide API for to return HTML Content
I wanna save all web page including .css .js on android by programmatically.
So far I tried html get method and jsoup , webview content but all of them I could not save all page with css and js. These methods just save html parts of WEB Page. When I save the all page ,I want to open it offline.
Thanks in advance
You have to take the html, parse it and get the urls of the resources and then make requests for those urls too.
public class Stack {
private static final String USER_AGENT = "";
private static final String INITIAL_URL = "";
public static void main(String args[]) throws Exception {
Document doc = Jsoup
.connect(INITIAL_URL)
.userAgent(USER_AGENT)
.get();
Elements scripts = doc.getElementsByTag("script");
Elements css = doc.getElementsByTag("link");
for(Element s : scripts) {
String url = s.absUrl("src");
if(!url.isEmpty()) {
System.out.println(url);
Document docScript = Jsoup
.connect(url)
.userAgent(USER_AGENT)
.ignoreContentType(true)
.get();
System.out.println(docScript);
System.out.println("--------------------------------------------");
}
}
for(Element c : css) {
String url = c.absUrl("href");
String rel = c.attr("rel") == null ? "" : c.attr("rel");
if(!url.isEmpty() && rel.equals("stylesheet")) {
System.out.println(url);
Document docScript = Jsoup
.connect(url)
.userAgent(USER_AGENT)
.ignoreContentType(true)
.get();
System.out.println(docScript);
System.out.println("--------------------------------------------");
}
}
}
}
I have similar problem...
Using this code we can get images,.css,.js. However some html contents are still missing.
For instance when we save a web page via chrome,there are 2 options.
Complete html
html only
Out of .css,.js,.php..."Complete html" consists of more elements than "only html". The requirement is to download the html as complete like chrome does in the first option.