What I want to do...
I have a webview in my android app. I get a huge html content from the server as a string and a search string from the application user(the android phone user). Now I break the search string and create a regex out of it. I want all the html content that matches my regex to be highlighted when I display it into my WebView.
What I tried...
Since it is html, I just want to wrap the regex matched words into a pair of tags with yellow background.
Simple regex and replaceAll on the html Content that i get. Very wrong because it screws and replaces even what is inside the '<' and '>'.
I tried using Matcher and Pattern combo. It is difficult to omit what is inside the tags.
I used JSOUP Parser and it worked!
I traverse the html using NodeTraversor class. I used Matcher and Pattern classes to find and replace matched words with tags as i wanted to do.
But it is very slow. And I basically want to use it on Android and the size of it is like 284kB. I removed some unwanted classes and it is now 201kB but it is still too much for an android device. Additionally, the html content can be really large. I looked into JSoup source as well. It kind of iterates over every single character when it parses. I do not know whether all the parsers do the same but it is definitely slow for large html documents.
Here is my code -
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Highlighter {
private String regex;
private String htmlContent;
Pattern pat;
Matcher mat;
public Highlighter(String searchString, String htmlString) {
regex = buildRegexFromQuery(searchString);
htmlContent = htmlString;
pat = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
}
public String getHighlightedHtml() {
Document doc = Jsoup.parse(htmlContent);
final List<TextNode> nodesToChange = new ArrayList<TextNode>();
NodeTraversor nd = new NodeTraversor(new NodeVisitor() {
#Override
public void tail(Node node, int depth) {
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
String text = textNode.getWholeText();
mat = pat.matcher(text);
if(mat.find()) {
nodesToChange.add(textNode);
}
}
}
#Override
public void head(Node node, int depth) {
}
});
nd.traverse(doc.body());
for (TextNode textNode : nodesToChange) {
Node newNode = buildElementForText(textNode);
textNode.replaceWith(newNode);
}
return doc.toString();
}
private static String buildRegexFromQuery(String queryString) {
String regex = "";
String queryToConvert = queryString;
/* Clean up query */
queryToConvert = queryToConvert.replaceAll("[\\p{Punct}]*", " ");
queryToConvert = queryToConvert.replaceAll("[\\s]*", " ");
String[] regexArray = queryString.split(" ");
regex = "(";
for(int i = 0; i < regexArray.length - 1; i++) {
String item = regexArray[i];
regex += "(\\b)" + item + "(\\b)|";
}
regex += "(\\b)" + regexArray[regexArray.length - 1] + "[a-zA-Z0-9]*?(\\b))";
return regex;
}
private Node buildElementForText(TextNode textNode) {
String text = textNode.getWholeText().trim();
ArrayList<MatchedWord> matchedWordSet = new ArrayList<MatchedWord>();
mat = pat.matcher(text);
while(mat.find()) {
matchedWordSet.add(new MatchedWord(mat.start(), mat.end()));
}
StringBuffer newText = new StringBuffer(text);
for(int i = matchedWordSet.size() - 1; i >= 0; i-- ) {
String wordToReplace = newText.substring(matchedWordSet.get(i).start, matchedWordSet.get(i).end);
wordToReplace = "<b>" + wordToReplace+ "</b>";
newText = newText.replace(matchedWordSet.get(i).start, matchedWordSet.get(i).end, wordToReplace);
}
return new DataNode(newText.toString(), textNode.baseUri());
}
class MatchedWord {
public int start;
public int end;
public MatchedWord(int start, int end) {
this.start = start;
this.end = end;
}
}
}
Here is how I call it -
htmlString = getHtmlFromServer();
Highlighter hl = new Highlighter("Hello World!", htmlString);
new htmlString = hl.getHighlightedHTML();
I am sure what i'm doing is not the most optimal way. But I can't seem to think of anything else.
I want to
- reduce the time it takes to highlight it.
- reduce the size of library
Any suggestions?
How about highlighting them using javascript?
You know, everybody love javascript, and you can find example like this blog.
JTidy and HTMLCleaner are aloso among the best Java HTML Parser.
see
Comparison between different Java HTML Parser
and
What are the pros and cons of the leading Java HTML parsers?
Related
I want to develop pattern searching algorithm for an music system application which searches for given keyword and plays the music whose text file contains the given keyword. Now there are many pattern searching algorithm which can do this efficiently(ex: KMP, hashing(may give error) etc). But my main problem is that the whole database is in language other than english( "Hindi" to be specific). Now the user enters the given keyword in "Hindi" language and I want to search in the database that also contains "Hindi" language. My main concern is that how to efficiently search in this database?
I think that we can't do KMP algorithm for non-english language because ascii charaters that we use only contains english alphabets and other numeric letters but doesn't contains letters of other language. So,please tell me how can I proceed further as I am not able to get solution or tell where I am thinking in wrong way?
KMP algorithm don't base on alphabet, it uses characters from given pattern and text. Moreover in languages like Java, strings use UTF-8 encoding, so u can use any langague you like and algorithm will work properly, in others you need to choose encoding explicitly. Here I give link to example on Ideone of using KMP with non ascii charset.
KMP algorithm
/* package whatever; // don't place package name! */
import java.util.*;
import java.lang.*;
import java.io.*;
class Ideone {
int[] f;
public void dfa(String pattern) {
int m = pattern.length();
f = new int[m+1];
f[0] = 0;
f[1] = 0;
for(int i=2; i<=m; i++) {
int j = f[i-1];
for(;;) {
if(pattern.charAt(j) == pattern.charAt(i-1)) {
f[i] = j +1;
break;
}
if(j==0) {
f[i] = 0;
break;
}
j = f[j];
}
}
}
public int match(String text, String pattern) {
dfa(pattern);
int n = text.length();
int m = pattern.length();
int i = 0;
int j = 0;
for(;;) {
if(i == n) break;
if(text.charAt(i) == pattern.charAt(j)) {
j++;
i++;
if(j == m) return i;
}
else if(j > 0) j =f[j];
else i++;
}
return -1;
}
public static void main(String[] args) {
Ideone kmp = new Ideone();
String text = "AĄĘĆABA";
String pattern = "ĄĘĆ";
System.out.println(kmp.match(text, pattern));
}
}
In my phonebook on my mobile I have all sorts of contacts like :
+(353) 085 123 45 67
00661234567
0871234567
(045)123456
I'm putting them all into E.164 format which I've largely completed but the question I need resolved is this:
How can I strip all characters (including spaces) except numbers in my string, apart from the first character if it is '+' or a number ?
string phoneNumberofContact;
So for example the cases above would look like :
+3530851234567
00661234567
0871234567
045123456
Update
To handle + only in the first position, you could do:
boolean starsWithPlus = input.charAt(0) == '+';
String sanitized = input.replaceAll("[^0-9]", "");
if (startsWithPlus) {
sanitized = "+" + sanitized;
}
So basically I'm checking to see if it starts with plus, then stripping out everything but digits, and then re-adding the plus if it was there.
Original
Assuming you only want to keep + or digits, a simple regex will work, and String provides the replaceAll() method to make it even easier.
String sanitized = input.replaceAll("[^+0-9]", "");
This method would do the trick
public String cleanPhoneDigits(String phonenum) {
StringBuilder builder = new StringBuilder();
if (phonenum.charAt(0).equals('+') {
builder.append('+');
}
for (int i = 1; i < phonenum.length(); i++) {
char c = phonenum.charAt(i);
if (Character.isDigit(c)) {
builder.append(c);
}
}
return builder.toString();
}
I should parse a string like this:
Small Intestine (T N M - Stage 0)
what I want to save is all the stuff before the first bracket, but I don't know if i'll have one, two ore more strings.
How can I do this in java ? What is the correctly regexp that I must use ?
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegExpTest
{
public static void main( String args[] ){
// String to be scanned to find the pattern.
String line = "Small Intestine (T N M - Stage 0)";
String pattern = "^.+?(?= *\\()";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
if (m.find( )) {
System.out.println("Found value: " + m.group(0));
} else {
System.out.println("NO MATCH");
}
}
}
You can use this regex:
^.+?(?= *\()
RegEx Demo
Is it possible to parse HTML code in a verbatim mode or something similar so that the source code fragments that eventually may appear (enclosed between pre and code HTML tags) can be displayed properly?
What I want to do is show source code in a user-friendly mode (easy to distinguish from the rest of the text, keep indentation, etc.), as Stack Overflow does :)
It seems that Html.fromHtml() supports only a reduced subset of HTML tags.
TextView will never succeed supporting all the html formating and styling you would want it to. Use WebView instead.
TextView is native and more lightweight, but exactly because of its lightweightedness it will not understand some of the directives you describe.
Finally I preparsed by myself the HTML code received, since Html.fromHtml does not support the pre and code tags, y replaced them with my custom format and pre-parsed the code inside those tags replacing "\n" with <br/> and " " with .
Then I send the results to Html.fromHtml, and the result is just fine:
public class HtmlParser {
public static Spanned parse(String text) {
if (text == null) return null;
text = parseSourceCode(text);
Spanned textSpanned = Html.fromHtml(text);
return textSpanned;
}
private static String parseSourceCode(String text) {
if (text.indexOf(ORIGINAL_PATTERN_BEGIN) < 0) return text;
StringBuilder result = new StringBuilder();
int begin;
int end;
int beginIndexToProcess = 0;
while (text.indexOf(ORIGINAL_PATTERN_BEGIN) >= 0) {
begin = text.indexOf(ORIGINAL_PATTERN_BEGIN);
end = text.indexOf(ORIGINAL_PATTERN_END);
String code = parseCodeSegment(text, begin, end);
result.append(text.substring(beginIndexToProcess, begin));
result.append(PARSED_PATTERN_BEGIN);
result.append(code);
result.append(PARSED_PATTERN_END);
//replace in the original text to find the next appearance
text = text.replaceFirst(ORIGINAL_PATTERN_BEGIN, PARSED_PATTERN_BEGIN);
text = text.replaceFirst(ORIGINAL_PATTERN_END, PARSED_PATTERN_END);
//update the string index to process
beginIndexToProcess = text.lastIndexOf(PARSED_PATTERN_END) + PARSED_PATTERN_END.length();
}
//add the rest of the string
result.append(text.substring(beginIndexToProcess, text.length()));
return result.toString();
}
private static String parseCodeSegment(String text, int begin, int end) {
String code = text.substring(begin + ORIGINAL_PATTERN_BEGIN.length(), end);
code = code.replace(" ", " ");
code = code.replace("\n","<br/>");
return code;
}
private static final String ORIGINAL_PATTERN_BEGIN = "<pre><code>";
private static final String ORIGINAL_PATTERN_END = "</code></pre>";
private static final String PARSED_PATTERN_BEGIN = "<font color=\"#888888\"><tt>";
private static final String PARSED_PATTERN_END = "</tt></font>";
}
I want to split a string and get a word finally. My data in database is as follows.
Mohandas Karamchand Gandhi (1869-1948), also known as Mahatma Gandhi, was born in Porbandar in the present day state of Gujarat in India on October 2, 1869.
He was raised in a very conservative family that had affiliations with the ruling family of Kathiawad. He was educated in law at University College, London.
src="/Leaders/gandhi.png"
From the above paragraph I want get the image name "gandhi". I am getting the index of "src=". But now how can I get the image name i.e "gandhi" finally.
My Code:
int index1;
public static String htmldata = "src=";
if(paragraph.contains("src="))
{
index1 = paragraph.indexOf(htmldata);
System.out.println("index1 val"+index1);
}
else
System.out.println("not found");
You can use the StringTokenizer class (from java.util package ):
StringTokenizer tokens = new StringTokenizer(CurrentString, ":");
String first = tokens.nextToken();// this will contain one word
String second = tokens.nextToken();// this will contain rhe other words
// in the case above I assumed the string has always that syntax (foo: bar)
// but you may want to check if there are tokens or not using the hasMoreTokens method
Try this code. Check if it working for you..
public String getString(String input)
{
Pattern pt = Pattern.compile("src=.*/(.*)\\..*");
Matcher mt = pt.matcher(input);
if(mt.find())
{
return mt.group(1);
}
return null;
}
Update:
Change for multiple item -
public ArrayList<String> getString(String input)
{
ArrayList<String> ret = new ArrayList<String>();
Pattern pt = Pattern.compile("src=.*/(.*)\\..*");
Matcher mt = pt.matcher(input);
while(mt.find())
{
ret.add(mt.group(1));
}
return ret;
}
Now you'll get an arraylist with all the name. If there is no name then you'll get an empty arraylist (size 0). Always make a check for size.