I am doing a project on OCR in android for recognize the numbers from image. My image contains only numbers. And i tried the 'eng.traineddata' file but the result's accuracy is too low is about below 40%. Does anyone know about the 'traineddata' file for digits ? Am tried the 'tesseract' raining. Since i don't familiar with training. Please help me to find a 'traineddata' file for recognizing numbers.
If you want to recognize only numbers, then set the Tesseract whitelist value to "0123456789".
Related
I am developing an Android OCR.
Using tess-two in the Android made the OCR.
I have downloaded the 'traineddata' file. And it has succeeded in output in English . But I want to output a number.
In the Internet gave me inform the sentence 'tesseract image.tif outputbase nobatch digits' and it has advised me to insert the generated files.
But I did not understand what I'm saying .
Please tell me the easiest way.
You will need to set tessedit_char_whitelist variable, as follows:
baseApi.SetVariable("tessedit_char_whitelist", "0123456789");
See Android OCR detecting digits only using popular tessercat fork tess-two or extracting numbers from Bitmap in android using tess-two library.
I've build an application that uses Tesseract (V3.03 rc1) to identify some specific text strings. These are, unfortunately, printed on a custom font that requires that I build my own traineddata file. I've built the application on both iOS (using https://github.com/gali8/Tesseract-OCR-iOS for inspiration) and Android (using https://github.com/rmtheis/tess-two/ for inspiration as well).
The workflow for both platforms is as follows:
I select a bounding box on the preview screen for where I can crop out the relevant text, and crop the image accordingly.
I use OpenCV to get a binary image (using OpenCV's adaptive threshold function with the same parameters for both platforms)
I pass this binary image to Tesseract. Both platforms (Android and iOS) use the same traineddata file.
And yet, iOS recognizes the text strings perfectly, while Android keeps misidentifying certain characters (6s for Ss, As for Hs).
On both platforms, I use the same white list string, I disable load_type_dawg and load_system_dawg, and also choose to save the blob choices.
Has anyone encountered this kind of situation before? Am I missing a setting on Android that's automatically handled in iOS? Is there something particular about Android that hasn't crossed my mind?
Any thoughts or advice would be greatly appreciated!
So, after a lot of work, I found out what was wrong with my Android application (thankfully, it wasn't an issue with Tesseract at all). As I'm more familiar with iOS apps than Android, I wasn't sure how I could load the traineddata file onto the application without requiring the user to have the file loaded on their external storage device. I found inspiration in this project (http://www.codeproject.com/Tips/840623/Android-Character-Recognition), as they autoload the trained data file.
However, I misunderstood how it worked. I originally thought that the TessDataManager did a file lookup on the project's local tesseract/tessdata folder in order to get the trained data file (as I do this also on iOS). However, that's not what it does. It, rather, checks the internal file structure (data/data/projectname/files/tesseract/tessdata/traineddatafilegoeshere) to see if the file exists and if it doesn't, it copies over the trained data file it keeps in the Resources/Raw directory. In my case, it defaulted to the eng file, so it never read my custom font file.
Hopefully this helps someone else having similar issues. Thanks to Robin and RmTheis for all of your help!
I am using tess-two Tesseract Android Tools for my project. From the research I've done I found from here a way to limit the types of characters, but not the range of characters.
The Tess-Two library I am using doesn't have a tessdata/config file, so how can I limit the possible characters tesseract recognises?
How can I limit Tesseract to recognise a range of digits (20 to 30)?
If you are using your digits are in an image and the image is clear you can use the following command
"tesseract imageName outputFileName.txt outputbase digits"
but if the image is not clear you will need to process it or you will not have the exact result.
Hope this will help you.
We are using Tesseract.NET (and the Android version too) to recognize and extract document data. It worked really good with Arial and Cambria fonts, but now we have to recognize documents like that:
Tesseract cannot recognize it. Absolutely nothing (except the big sized serial number on the right upper corner).
We tried to train it, but - maybe it's our fault - it's still unstable.
What can we do?
(Btw the font is use by national offices, we cannot get it as true type or other font format.
In the current form it is very hard for an OCR tool to recognize any letters.
Serif fonts are hard to ocr.
Letters are very close together. Some are joined.
A dictionary is not of any help.
You might be able to improve the result with the following:
As this looks like an vehicle registration certificate you should be able to predict the positions of the textstrings of interest and then ocr they separatly.
Thereby using the -psm=7 or 8 option (assume single line or word).
As some strings seem to be numbers only you can help tesseract by using the digits argument.
For the alphanumeric strings it might help to reduce the dictionary pruning (or completely remove the dawg files.)
If those strings like 'ETZ' or 'MZ' are abbreviations you could also build an dictionary with those.
Reducing the yellow and green color is also an (easy) option you could test.
Use the barcode instead of trying to ocr the string.
For tesseract questions it always helps if you specify the version used and, if you do image preprocessing, provide a sample image of the processed input.
I have a dictionary of words in a text file, separated by newlines. And I want to recognize the handwriting using Tesseract, and output the nearest matching line in the text file.
This is the first time I'll be using Tesseract, and it's already in my project workspace, I just need the training data.
Is it possible to train Tesseract to do this?
It's possible to train tesseract to recognize handwriting. Here are the instructions: https://tesseract-ocr.github.io/tessdoc/Training-Tesseract
But don't expect very good results. Academics have typically gotten accuracy results topping out about 90%. Here are a couple references for words and numbers. So if your use case can deal with at least 1/10 errors, this might work for you.