I using breakIterator to get each word from a sentence and there is problem when a sentence like "my mother-in-law is coming for a visit" where i am not able to get mother-in-law as a single word.
BreakIterator iterator = BreakIterator.getWordInstance(Locale.ENGLISH);
for (int end = iterator.next(); end != BreakIterator.DONE; start = end, end = iterator.next())
{
String possibleWord = sentence.substring(start, end);
if (Character.isLetterOrDigit(possibleWord.charAt(0)))
{
// grab the word
}
}
As I'm seeing in your code what are you trying to do is to check if the first character in every word are a character or a digit. Every time you use the BreakIterator.getWordInstance() you will always get all the words depending on the boundary rules of the Locale and it is a little hard to accomplish what you want to do with the use of this class until I know, so my advice is this:
String text = "my mother-in-law is coming for a visit";
String[] words = text.split(" ");
for (String word : words){
if (Character.isLetterOrDigit(word.charAt(0))){
// grab the word
}
}
Related
From the following image, I want to extract number below text Arzt-Nr (654321161).
I've used OCR reader but it is extracting texts randomly not in a sequence, making it difficult to add a logic to extract no below "Arzt-Nr".
I've used following code but texts are not in sequence.
Is there any way to achieve this?
String text = "";
for (int i = 0; i < detectedItems.size(); i++) {
TextBlock item = detectedItems.valueAt(i);
String detectedText = item.getValue();
List<Line> lines = (List<Line>) item.getComponents();
for (Line line : lines) {
List<Element> elements = (List<Element>) line.getComponents();
for (Element element : elements) {
String word = element.getValue();
text = text + " " + word;
}
text += "\n";
}
}
Try to check a fixed length to the words after "Arzt-Nr" position, try also to check the pattern of the word founded.. for example if you need only numbers ecc...
Extract tsv output of image using tesseract and find the nearest text below the location of keyword. Also have a look at page segmentation modes of tesseract.
Link to Generating tsv
Link to use page segmentation
I've seen many people do similar to this in order to get the last word of a String:
String test = "This is a sentence";
String lastWord = test.substring(test.lastIndexOf(" ")+1);
I would like to do similar but get the last few words after the last int, it can't be hard coded as the number could be anything and the amount of words after the last int could also be unlimited. I'm wondering whether there is a simple way to do this as I want to avoid using Patterns and Matchers again due to using them earlier on in this method to receive a similar effect.
Thanks in advance.
I would like to get the last few words after the last int.... as the number could be anything and the amount of words after the last int could also be unlimited.
Here's a possible suggestion. Using Array#split
String str = "This is 1 and 2 and 3 some more words .... foo bar baz";
String[] parts = str.split("\\d+(?!.*\\d)\\s+");
And now parts[1] holds all words after the last number in the string.
some more words .... foo bar baz
What about this one:
String test = "a string with a large number 1312398741 and some words";
String[] parts = test.split();
for (int i = 1; i < parts.length; i++)
{
try
{
Integer.parseInt(parts[i])
}
catch (Exception e)
{
// this part is not a number, so lets go on...
continue;
}
// when parsing succeeds, the number was reached and continue has
// not been called. Everything behind 'i' is what you are looking for
// DO YOUR STUFF with parts[i+1] to parts[parts.length] here
}
I want to highlight a particular word in a text view ( more specifically similar to a twitter feed). The word may occur multiple times. Below I will post a sample sentence from twitter.
" Mumbai Master Blaster! #Sachin. Greatest players of all times. The legend of cricket #sachin. "
Here I want to highlight the word " #Sachin " with a particular color. Also please note that we don't know how many times this word could get repeated in the whole string. Could anyone help me to solve this issue.
Use next code:
public CharSequence linkifyHashtags(String text) {
SpannableStringBuilder linkifiedText = new SpannableStringBuilder(text);
Pattern pattern = Pattern.compile("#\\w");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
int start = matcher.start();
int end = matcher.end();
String hashtag = text.substring(start, end);
ForegroundColorSpan span = new ForegroundColorSpan(Color.BLUE);
linkifiedText.setSpan(span, 0, hashtag.length(), 0);
}
return linkifiedText;
}
I have a List of Strings and i want to compare every i write in an EditText with that list. If there is a match then i have to add a "-" character as a prefix for that word.
I am using a TextWatcher and this is my code so far:
#Override
public void afterTextChanged(Editable s) {
String tmp = s.toString();
words = tmp.split(" ");
for (int i = 0; i < words.length; i++) {
for (Iterator iterator = myList.iterator(); iterator
.hasNext();) {
String str = (String) iterator.next();
if (str.equalsIgnoreCase(words[i])) {
if (!words[i].contains("-")) {
tmp = tmp.replace(words[i], "-" + words[i]);
}
editMain.setText(tmp);
editMain.setSelection(tmp.length());
}
}
}
}
It works but if i type the same word twice in my EditText, the first ocurrence gets two "--".
For example:
hello this is -android (works ok)
hello this is --android -android (does not work ok)
And the desired result should be:
hello this is -android android (because the repeated word already exists)
Any help? thanks in advance
Your question is not very clear. Maybe you mean android word has already been found and then it should not be prefixed by a -.
If that's the case, just remove a mathcing word from mylist. For that use a listIterator.
try to set a counter. If the counter is bigger than 1, then don't write the -
Could anybody post here some code how can I read word by word from file? I only know how to read line by line from file using BufferedReader. I'd like if anybody posted it with BufferedReader.
I solved it with this code:
StringBuilder word = new StringBuilder();
int i=0;
Scanner input = new Scanner(new InputStreamReader(a.getInputStream()));
while(input.hasNext()) {
i++;
if(i==prefNamePosition){
word.append(prefName);
word.append(" ");
input.next();
}
else{
word.append(input.hasNext());
word.append(" ");
}
}
There's no good way other than to read() and get a character at a time until you get a space or whatever criteria you want for determining what a "word" is.
If you're trying to replace the nth token with a special value, try this:
while (input.hasNext()) {
String currentWord = input.next();
if(++i == prefNamePosition) {
currentWord = prefName;
}
word.append(currentWord);
word.append(" ");
}
Another way is to employ a tokenizer (e.g. in Java) and using the delimiter space character (i.e. ' '). Then just iterate through the tokens to read each word from your file.
You can read lines and then use splits. There is no clear definition of word but if you want the ones separated by blank spaces you can do it.
You could also use regular expressions to do this.