I've been trying to find a good way to be able to keep only emojis and letters in a given text, but every article I found, I didn't have success with .
I've tried to use regex, but seems that I can not make it work.
I've tried to use emoji4j but it seems that this library is working with emojis in this form ":)", which don't help me, because my emojis are groups of unicode characters.
The result I want is the following :
"This is. a text π¨βπ©βπ§βπ¦,,1234" => "This is a text π¨βπ©βπ§βπ¦"
"π¨βπ©βπ§βπ¦" => "π¨βπ©βπ§βπ¦"
"π¨βπ©βπ§βπ¦π123abcπ¨βπ©βπ§βπ¦" => "π¨βπ©βπ§βπ¦πabcπ¨βπ©βπ§βπ¦"
Here's the emoji regex : ?:[\u2700-\u27bf]|(?:[\ud83c\udde6-\ud83c\uddff]){2}|[\ud800\udc00-\uDBFF\uDFFF]|[\u2600-\u26FF])[\ufe0e\ufe0f]?(?:[\u0300-\u036f\ufe20-\ufe23\u20d0-\u20f0]|[\ud83c\udffb-\ud83c\udfff])?(?:\u200d(?:[^\ud800-\udfff]|(?:[\ud83c\udde6-\ud83c\uddff]){2}|[\ud800\udc00-\uDBFF\uDFFF]|[\u2600-\u26FF])[\ufe0e\ufe0f]?(?:[\u0300-\u036f\ufe20-\ufe23\u20d0-\u20f0]|[\ud83c\udffb-\ud83c\udfff])?)*|[\u0023-\u0039]\ufe0f?\u20e3|\u3299|\u3297|\u303d|\u3030|\u24c2|[\ud83c\udd70-\ud83c\udd71]|[\ud83c\udd7e-\ud83c\udd7f]|\ud83c\udd8e|[\ud83c\udd91-\ud83c\udd9a]|[\ud83c\udde6-\ud83c\uddff]|[\ud83c\ude01-\ud83c\ude02]|\ud83c\ude1a|\ud83c\ude2f|[\ud83c\ude32-\ud83c\ude3a]|[\ud83c\ude50-\ud83c\ude51]|\u203c|\u2049|[\u25aa-\u25ab]|\u25b6|\u25c0|[\u25fb-\u25fe]|\u00a9|\u00ae|\u2122|\u2139|\ud83c\udc04|[\u2600-\u26FF]|\u2b05|\u2b06|\u2b07|\u2b1b|\u2b1c|\u2b50|\u2b55|\u231a|\u231b|\u2328|\u23cf|[\u23e9-\u23f3]|[\u23f8-\u23fa]|\ud83c\udccf|\u2934|\u2935|[\u2190-\u21ff] .
If I try something like :
val regex = "the_whole_regex_above | [^a-zA-Z]".toRegex()
myText.replace(regex,""), it won't replace anything, basically every character will pass
Basically I want to achieve pretty much the same thing as in this question, but using Kotlin.
You want to remove all punctuation, symbols (other than those used to form emojis) and digits.
To do that, you may use
myText = myText.replace("""[\p{N}\p{P}\p{S}&&[^\p{So}]]+""".toRegex(), "")
See the online Kotlin demo.
Details
[ - start of a character class that matches:
\p{N} - any Unicode digit
\p{P} - any Unicode punctuation proper
\p{S} - any Unicode symbol
&&[^\p{So}] - BUT the Unicode symbols belonging to Symbol, other Unicode category that are mostly used to form emojis
]+ - 1 or more occurrences.
Related
I need help with creating a regex that removes all special characters, including commas, but not periods. What I have tried to do is escape all the characters, symbols and punctuation I do not want. It is not working as intended.
replace("[-\\[\\]^/,'*:.!><~##\$%+=?|\"\\\\()]+".toRegex(), "")
I removed the period and tested that too. It did not work.
replace("[-\\[\\]^/,'*:!><~##\$%+=?|\"\\\\()]+".toRegex(), "")
For example, lets take the String "if {cat.is} in a hat, then I eat green eggs and ham!".
I want the result
if {cat.is} in a hat then I eat green eggs and ham (comma and exclamation symbol removed)
Note: I want to keep brackets, although braces are OK to omit.
Anyone have a solution for this?
You can use
"""[\p{P}\p{S}&&[^.]]+""".toRegex()
The [\p{P}\p{S}&&[^.]]+ pattern matches one or more (+) punctuation proper (\p{P}) or symbol (\p{S}) chars other than dots (&&[^.], using character class subtraction).
See a Kotlin demo:
println("a-b)h.".replace("""[\p{P}\p{S}&&[^.]]+""".toRegex(), ""))
// => abh.
I have a sample message . I need to create a regular expression to validate using android pattern.
sample message :
ERR|any digit|any digit;
checking validation:
1.Starting fixed characters :ERR
separator character :|
digit after | character
Message termination ;
I have tried like this way:^{ERR}+{|}+\d+{|}+\d+{;}$
Am I right? Please help to solve my problem.
The corrected regex you gave would be ^(ERR)+(\\|)+\\d+(\\|)+\\d+;$. Brackets are used for grouping, not braces. Also, in regex, + is used to represent "one or more of the previous expression". So writing (ERR)+ means "one or more of the string 'ERR'", so strings like "ERRERR|123|456;" would be matched (same thing goes for the pipe characters) - this is not what you are trying to do, I assume.
Having said that, try this: "^ERR\\|\\d+\\|\\d+;$"
I want to check if www page is in text. For example i have page address: www.taktik.com/trow and want check if text for www is in text.
I use Matcher mW = Pattern.compile("[a-zA-Z0-9_.+-]+.[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+\\/[a-zA-Z0-9-.]+").matcher(question); but I don't get any results. How can I check if text xxx.xxxx.xxx/xxx is in my String?
How can I check if text xxx.xxxx.xxx/xxx is in my String?
Fixing your regex, the pattern may look like
[a-zA-Z0-9_.+-]+\\.[a-zA-Z0-9-]+\\.[a-zA-Z0-9.-]+/[a-zA-Z0-9.-]+
Mind I escaped thr first dot and placed the hyphen at the end of the last two character classes (in yours, you have 9-. that creates a range that matches more than you'd want).
I tried to shorten the pattern a bit, but it's difficult since \w also matches Unicode characters in Android. Here is a possible regex:
(?i)[A-Z0-9_+-]+(?:\\.[A-Z0-9-]+){2}/[A-Z0-9-]+
I discovered today that Android can't display a small handful of Japanese characters that I'm using in my Japanese-English dictionary app.
The problem comes when I attempt to display the character via TextView.setText(). All of the characters below show up as blank when I attempt to display them in a TextView. It doesn't appear to be an issue with encoding, though - I'm storing the characters in a SQLite database and have verified that Android can understand the characters. Casting the characters to (int) retrieves proper Unicode decimal escapes for all but one of the characters:
String component = cursor.getString(cursor.getColumnIndex("component"));
Log.i("CursorAdapterGridComponents", "Character Code: " + (int) component.charAt(0) + "(" + component + ")");
I had to use Character.codePointAt() to get the decimal escape for the one problematic character:
int codePoint = Character.codePointAt(component, 0);
I don't think I'm doing anything wrong, and as String's are by default UTF-16 encoded, there should be nothing preventing them from displaying the characters.
Below are all of the decimal escapes for the seven problematic characters:
βΊ
Character Code: 11909(βΊ
)
βΊ Character Code: 11916(βΊ)
βΊΎ Character Code: 11966(βΊΎ)
β» Character Code: 11983(β»)
β» Character Code: 11990(β»)
βΊΉ Character Code: 11961(βΊΉ)
π ’ Character Code: 131490(π ’)
Plugging the first six values into http://unicode-table.com/en/ revealed their corresponding Unicode numbers, so I have no doubt that they're valid UTF-8 characters.
The seventh character could only be retrieved from a table of UTF-16 characters: http://www.fileformat.info/info/unicode/char/201a2/browsertest.htm. I could not use its 5-character Unicode number in setText() (as in "\u201a2") because, as I discovered earlier today, Android has no support for Unicode strings past 0xFFFF. As a result, the string was evaluated as "\u201a" + "2". That still doesn't explain why the first six characters won't show up.
What are my options at this point? My first instinct is to just make graphics out of the problematic characters, but Android's highly variable DPI environment makes this a challenging proposition. Is using another font in my app an option? Aside from that, I really have no idea how to proceed.
Is using another font in my app an option?
Sure. Find a font that you are licensed to distribute with your app and has these characters. Package the font in your assets/ directory. Create a Typeface object for that font face. Apply that font to necessary widgets using setTypeface() on TextView.
Here is a sample application demonstrating applying a custom font to a TextView.
Since AVD tools 16 I'm getting this warning:
Replace "..." with ellipsis character (..., β¦) ?
in my strings.xml
at this line
<string name="searching">Searching...</string>
How do I replace ...? Is it just literally β¦?
Could someone explain this encoding?
β¦ is the unicode for "β¦" so just replace it. It's better to have it as one char/symbol than three dots.
To make thing short just put β¦ in place ...
Link to XML character Entities List
Look at Unicode column of HTML for row named hellip
If you're using Eclipse then you can always do the following:
Right click on the warning
Select "Quick Fix" (shortcut is Ctrl + 1 by default)
Select "Replace with suggested characters"
This should replace your three dots with the proper Unicode character for ellipsis.
Just a note: The latest version of ADT (21.1) sometimes won't do the replace operation properly, but earlier versions had no problem doing this.
This is the character: β¦
The solution to your problem is:
Go to Window -> Preferences -> Android -> Lint Error Checking
And search for "ellipsis". Change the warning level to "Info" or "Ignore".
This answer is indirectly related to this question:
In my case textView1.setTextView("done…"); was showing some box/chinese character. Later, I checked into fileformat.info for what the value represents and I found this is a Han character.
So, what to do? I searched for "fileformat.info ellipse character" and then everything became clear to me once I saw its values are;
UTF-16 (hex) 0x2026 (2026)
UTF-16 (decimal) 8,230
So, you have several encoding available to represent a character (e.g. 10 in Decimal is represented as A in hexa) so it is very important to know when you are writing an unicode character, how receiving function decodes it. If it decodes as decimal value then you have to provide decimal value, if it accept hexadecimal then you have to provide hexadecimal.
In my case, setTextView() function accepts decimal encoded value but I was providing hexadecimal values so I was getting wrong character.
The quick fix shortcut in Android Studio is Alt + Enter by default.
Best not to ignore it as suggested by some, it seems to me. Use Android Studio to correct it (rather than actually typing in the character code), and the tool will replace the three dots with the three-dot unicode character. Won't be confusing to translators etc.