Custom Dictionary for Tesseract - android

I am currently working on a project for android using Tesseract OCR. I was hoping to fine-tune the results given to the user by adding a dictionary. According to tesseract OCR wiki, the best way to go about this would be to
Replace tessdata/eng.user-words with your own word list, in the same
format - UTF8 text, one word per line.
However there is no eng.user-words file in the tessdata folder, I assume that if I just make a text file with my dictionary in it, it will never be used...
Has anybody had a similar experience and knows what to do?

If you're using tesseract 3 (which I assume you are).
You'll have to rebuild your eng.trainddata file.
I intended to replace the word-dawg file completely to try to get better results (ie - the words I'm detecting are always the same).
You'll need combine_tessdata and wordlist2dawg executables in the training directory when you compile tesseract.
unpack everything (i did this just to back up my eng.word-dawg, you'll also need the unicharset later)
./combine_tessdata -u eng.traineddata
create a textfile of your wordlist (wordlistfile)
create a eng.word-dawg
./wordlist2dawg wordlistfile eng.word-dawg traineddat_backup/.unicharset
replace the word-dawg file
./combine_tessdata -o eng.traineddata eng.word-dawg
that should be it.

Related

Export missing string in Android Studio

I have to maintain an app translated into more than 10 different languages. Whenever a new version is developed, new strings are added to the source values.xml . The translation editor helps me to get an overview about which strings are missing in other languages, but at the moment, it looks like there is no option to get a diff xml with just the new strings added for each language. Since we use translation services we have to pay per translated word. Therefore I always have to manually create the files with the missing translations, which is very time consuming.
I can't imagine I'm the only one needing this particular feature. Is there a workaround / script / plugin which does solve this problem?
Back in the steam age I faced similar problem while trying to keep like 14 translations in sync, so I created small PHP script to to help me with this.
As I said it's pretty dated (2010 :) yet it should work. I just made it available on GitHub: https://github.com/MarcinOrlowski/android-strings-check
Basically what it does is diff two translation XMLs and generate human readable report:
./strings-check.php values/strings.xml values-pl/strings.xml
It will give you the output like this:
Missing in LANG (You need to add these)
File: values-pl/strings.xml
------------------------------------------------------
show_full_header_action
hide_full_header_action
recreating_account
Not present in BASE (remove it from your LANG file)
File: values/strings.xml
------------------------------------------------------------------
provider_note_yahoo
Summary
----------------
BASE file: 'values/strings.xml'
LANG file: 'values-pl/strings.xml'
3 missing strings
1 orphaned strings
Ok, I guess I found the solution to my problem, a python script called android-localization-helper:
https://github.com/jordanjoz1/android-localization-helper

What is the meaning of %©»ªµ in a PDF code header?

I am trying to create a PDF in my Android application using the Android PDF Writer. This is a very basic library that allows to create simple PDF files. It works quite well, but there is one thing I do not understand:
When I look at the generated PDF source code I can see, that the file starts with the following lines:
%PDF-1.4
%©»ªµ
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
...
What does the second line mean? I searched a lot of different PDF syntax documentations but I have found no hint what that line could mean. In all examples I found the the %PDF-VersionXY line is directly followed by the first object / the catalog.
I am not sure if this is valid PDF code at all, or if this some an error due to some charset/enconding problem with the libraries source code.
Any idea what this could be about? What information could be included at this place and is %©»ªµ valid PDF or some enconding error?**
When taking a look at the pdf-1.4 reference here (or also in the current 1.7 here) in section 3.4.1 it says
Note: If a PDF file contains binary data, as most do (see Section 3.1, “Lexical Conventions”),
it is recommended that the header line be immediately followed by a
comment line containing at least four binary characters—that is, characters whose
codes are 128 or greater. This will ensure proper behavior of file transfer applications
that inspect data near the beginning of a file to determine whether to treat the file’s
contents as text or as binary.
So your generator seems to include this additional comment-line by default, even if there is no binary data to follow. What's in there doesn't matter as long as each byte value is > 128 (that is: outside the ASCII-range). In your case it's hex values A9 BB AA B5, so everything is fine and you don't have to worry about this line.

Can I include a file of strings that would be autocompleting and crossplatform (iOS/Android)?

I'm not even sure what the vocabulary for this question is, but I'd like to have a file which is a list of strings which could be included as constants in Android and iOS.
I'm trying to find better vocab to describe this issue so comments are greatly appreciated too, thanks all.
Edit: For example, I'd like to have a file such as
color_names.txt
COLOR_NAME_BLUE "blue"
COLOR_NAME_RED "red"
COLOR_NAME_GREEN "green"
Which I can include in both an Android and an iOS project I have, in a way that in the code COLOR_NAME_BLUE is symbol checked, and if someone were to type COLOR_NAME_BLEU it would throw a compile error.
The actual file will be much larger and is something I want to be maintainable. I could put this in JSON but then I'd have to do the checking at run time, which isn't terrible I just am trying to figure out if there is a better way.
We also have iOS and Android apps that should be sharing strings.
You should use a python program (or some inferior scripting system) that takes your input file (checking it for errors) and outputs a Localizable.strings file for the iOS and strings.xml file for the Android.
So long as you have a good handle on your directory structure, you should be able to place both the Localizable.strings file and strings.xml file right where they need to be for your build.
For example, for a label pair like this:
PRIMARY_AGE_10 "Primary Age 10"
The label/string matchup is pretty obvious for the Android strings.xml:
<string name="PRIMARY_AGE_10">Primary Age 10</string>
The iOS Localizable.strings format is like this:
"PRIMARY_AGE_10" = "Primary Age 10";
Then when I want to use the label "Primary Age 10" instead of using an NSString, or #"Primary Age 10" i just make a call like this:
NSLocalizedString(#"PRIMARY_AGE_10", nil)
One other big advantage is if you need to localize, you can generate multiple Localizable.strings files and strings.xml files.

Android Strings

I wrote a big app with thousands of string in the code.... very bad idea, because now I want to translate each string.... big problem.
Copying all strings to the strings.xml takes a long time.
Eclipse has an option to take all selected strings and put them into messages.properties.
Does this work similiar like strings.xml? When, why all people use strings.xml.
Or should is use eclipse to seperate each string and than I should copy them to string.xml?
All people are using strings.xml because this is the normal way to do it on Android. You don't have to manage the load of the strings, to call any locale function in your script.
You can see the documentation here : http://developer.android.com/guide/topics/resources/index.html
BTW, you can easily transform your eclipse generated file to an strings.xml file after the extraction.
In Eclipse you can use the shortcut keys Alt + Shift A, S to extract an inline string in to the strings.xml file via a popup dialog - might be a bit easier than doing it by hand. And as the others say, yes you should ALWAYS use the strings.xml file so that you only have to look in one place when you want to change a string, instead of having to search through all your code.

Keep Strings into .txt file or put them into a Database?

Hey, I have a lot of Strings that I use into my app, the .txt file that I use has ~14000 lines.. and each 3-10 lines are divided into sections like <String="Chapter I"> ... </String> ..
Speaking of performance/speed, should I put the sections into a Database, Or read line by line through the .txt file and check if the section number is the current one? Will this affect speed/performance?
I could also divide each ~2000 lines into a different .txt file so there would be less lines to go through. Is this a bad way of storing data? Thanks
I think sqlite would do the trick. It will probably be way faster than parsing a text file, plus you wont have to maintain the headache of your own ad hoc text database, or build a parser in the first place. Basically, use it, its way easier.
The standard way to deal with Strings in Android is to put them into res/values/strings.xml (I'm pretty sure you can have multiple String files in that directory if you like). If you are developing in Eclipse it will automatically populate the R class (the resource class) with constants that you can use to reference these Strings in your code:
R.string.mystring
Or in XML layouts:
#string/mystring
Or if you're doing something more custom you can use:
String string = getString(R.string.hello);
I would definitely choose this over a .txt file. It's much easier. All the work is done for you! Have a read of this Android article about it.
This is what a database is for. Use it.

Categories

Resources