I'm downloading website's source code using HttpClient and then I want to extract some data using regular expressions. Unfortunetely the website is encoded in iso-8859-1 which seems to be causing problems. Here's the sample code to download website:
HttpGet query = new HttpGet(url);
HttpResponse queryResponse = httpClient.execute(query);
String queryText = EntityUtils.toString(queryResponse.getEntity()).replaceAll("\r", " ").replaceAll("\n", " ");
And then the expression:
Pattern patter = Pattern.compile("<p class=\"qt\">(.*?)</p>");
Matcher matcher = pattern.matcher(queryText);
while (matcher.find()) // do something
The problem is that it's missing some occurences, when there are special iso-8859-1 characters. (.*?) doesn't seem to match them. What's the reason of this problem? How do I fix it?
Are you sure this has to do with "special iso-8859-1 characters" and not newlines? . does not match line terminators by default. You can use the DOTALL flag to enable matching of line terminators as well. eg:
Pattern patter = Pattern.compile("<p class=\"qt\">(.*?)</p>", Pattern.DOTALL);
Related
I have this string which I'm trying to format:
String url = "http://api/doSomething.json?params%5Bemail%5D=%s"
String.format(url,email).
The idea is that it ends up looking like this:
http://api/doSomething.json?params[email]=aValue;
I'm currently getting a MissingFormatArgumentException, Format specifier: 5D exception.
Has anyone had issues with this before?
String.format() doesn't like the %5D placeholder - %5D has to be %5d.
Reference: http://developer.android.com/reference/java/util/Formatter.html
... if it was about placeholders.
Anyway, it seems you just want the square brackets.
Therefore, change this
String url = "http://api/doSomething.json?params%5Bemail%5D=%s"
to
String url = "http://api/doSomething.json?params[email]=%s"
In the end i was able to resolve this using a URLEncoder.
This post was particularly helpful -> URL encoding in Android
String queryPart = String.format(PARAM_STRING,
email);
return baseUrl + URLEncoder.encode(queryPart, "utf-8");
I have an url like http://ashok-reddy:8080/hyd which consists hyphen. While making Http Post request, I am getting IllegalArgumentException saying Host name may not be null . I have tried with replacing the hyphen with its hexadecimal value and also tried converting using URLEncoder/Uri.encode(). But nothing has been worked till now.
mHttpPost = new HttpPost("ashok-reddy:8080/hyd");
mEnvelope = new SoapSerializationEnvelope(SoapEnvelope.VER11);
mEnvelope.encodingStyle = SoapSerializationEnvelope.ENC;
mStringEntity = new StringEntity(soapData, HTTP.UTF_8);
mStringEntity.setContentType(mContext.getString(R.string.text_xml_content));
mHttpPost.setEntity(mStringEntity);
mHttpResponse = mHttpClient.execute(mHttpPost);
Can anyone please help on this?
Thanks in advance.
Arindam
If you are using a method like URLEncoder, you should not pass the full URL, because it will escape even '//' symbols in url. For example, it will encode :// into %3A%2F%2F
Pass to the function just the parameters list you need to encode to escape special characters.
EDIT:
As I can see you are using: mHttpPost = new HttpPost("ashok-reddy:8080/hyd");
instead of: mHttpPost = new HttpPost("http;//ashok-reddy:8080/hyd");
I am using this httpclient: http://loopj.com/android-async-http/
I am getting a json with this httpclient.
I want to set character enconding of this httpclient. The JSONObject that the client returns contains turkish chars such as şğöü. But it is corrupted and i cant view this characters.
How can i set character encoding of this httpclient?
The correct would be that server provides the encoding of the returned page.
If it does that you will receive the correct one.
But if it doesn't provides the encoding Async-http seems to assume UTF-8 and looking at the code it doesn't seems to support providing a default alternative one.
Relevant code in AsyncHttpResponseHandler :
// Interface to AsyncHttpRequest
void sendResponseMessage(HttpResponse response) {
...
responseBody = EntityUtils.toString(entity, "UTF-8");
If you want to do you will need to user your own version of AsyncHttpResponseHandler or suggest a patch to be able to specify default encoding.
i resolved this problem by modifying the loopj source code file "AsyncHttpResponseHandler.java"...
void sendResponseMessage(HttpResponse response){
.........
//responseBody = EntityUtils.toString(entity, "UTF-8");
responseBody = EntityUtils.toString(entity, "ISO-8859-1");
}
ISO-8859-1 encoding will give you the correct characters..
I am having a curious problem that perhaps someone has insight into. I encode a query string into a URL on Android using the following code:
request = REQUEST_BASE + "?action=loadauthor&author=" + URLEncoder.encode(author, "UTF-8");
I then add a few other parameters to the string and create a URI like this:
uri = new URI(request);
At a certain point, I pull out the query string to make a checksum:
uri.getRawQuery().getBytes();
Then I send it on its way with:
HttpGet get = new HttpGet(uri);
On the Appengine server, I then retrieve the string and try to match the checksum:
String query = req.getQueryString();
Normally, this works fine. However, there are a few characters that seem to get unencoded on the way to the server. For example,
action=loadauthor&author=Charles+Alexander+%28Ohiyesa%29+Eastman×tamp=1343261225838&user=1479845600
shows up in the server logs (and in the GAE app) as:
action=loadauthor&author=Charles+Alexander+(Ohiyesa)+Eastman×tamp=1343261226837&user=1479845600
This only happens to a few characters (like parentheses). Other characters remain encoded all the way through. Does anyone have a thought about what I might be doing wrong? Any feedback is appreciated.
I never did find a solution for this problem. I worked around it by unencoding certain characters on the client before sending things to the server:
request = request.replace("%28", "(");
request = request.replace("%29", ")");
request = request.replace("%27", "'");
If anyone has a better solution, I am sure that I (and others) would be interested!
URLEncoder does not encode parentheses and certain other characters, as they are supposed to be "safe" for most servers. See URLEncoder. You will have to replace these yourself if necessary.
Example:
URI uri = new URI(request.replace("(","%28"));
If a lot of replacements are needed, you can try request.replaceAll(String regularExpression, String replacement). This, of course, requires knowledge of regular expressions.
When I display bullet-points, copyright symbols, trademark signs in a web browser, they
look fine.
// bullets: http://losangeles.craigslist.org/wst/acc/2900906683.html
// bullets: http://losangeles.craigslist.org/lac/acc/2902059059.html
// bullets: http://indianapolis.craigslist.org/acc/2867115357.html
// bullets: http://indianapolis.craigslist.org/ofc/2885697780.html
// bullets: http://indianapolis.craigslist.org/ofc/2887554512.html
// copyright: http://chicago.craigslist.org/nwc/acc/2854640931.html
But I get "question marks inside triangles" when I use an Android WebView with:
web.loadDataWithBaseURL(null, myHtml, null, "UTF-8", null);
Should I be using a different encoding?
Should I be searching/replacing certain characters myself... 1-by-1?
Try using WebView settings
myWebView = (WebView)findViewById(R.id.mywebView);
WebSettings settings = myWebView.getSettings();
settings.setDefaultTextEncodingName("UTF-8");
I've run into this problem before. I would make sure that your myHtml String already has good encoding before you load it into your WebView. You can check that by logging it using Log.d(). If the encoding is wrong in that String, that it won't show properly in WebView either. You'll see those weird characters in LogCat.
If that is the case, you'll want to make sure that when you're reading the data into your myHtml String, that you use something like an InputStreamReader and pass it "UTF-8" as the character encoding.
I would change the line of code that you're using from:
BufferedReader buffer = new BufferedReader(new InputStreamReader(content), 1000);
to:
BufferedReader buffer = new BufferedReader(new InputStreamReader(content, "UTF-8"), 1000);
This version of the constructor is documented to:
Constructs a new InputStreamReader on the InputStream in. The character converter that is used to decode bytes into characters is identified by name by enc. If the encoding cannot be found, an UnsupportedEncodingException error is thrown.
at http://developer.android.com/reference/java/io/InputStreamReader.html and look at the second one.
EDIT: If that doesn't work, you could try using:
String s = EntityUtils.toString(entity, HTTP.UTF_8);
which is from Android Java UTF-8 HttpClient Problem