I am using tess-two Tesseract Android Tools for my project. From the research I've done I found from here a way to limit the types of characters, but not the range of characters.
The Tess-Two library I am using doesn't have a tessdata/config file, so how can I limit the possible characters tesseract recognises?
How can I limit Tesseract to recognise a range of digits (20 to 30)?
If you are using your digits are in an image and the image is clear you can use the following command
"tesseract imageName outputFileName.txt outputbase digits"
but if the image is not clear you will need to process it or you will not have the exact result.
Hope this will help you.
Related
I am developing an Android OCR.
Using tess-two in the Android made the OCR.
I have downloaded the 'traineddata' file. And it has succeeded in output in English . But I want to output a number.
In the Internet gave me inform the sentence 'tesseract image.tif outputbase nobatch digits' and it has advised me to insert the generated files.
But I did not understand what I'm saying .
Please tell me the easiest way.
You will need to set tessedit_char_whitelist variable, as follows:
baseApi.SetVariable("tessedit_char_whitelist", "0123456789");
See Android OCR detecting digits only using popular tessercat fork tess-two or extracting numbers from Bitmap in android using tess-two library.
We are using Tesseract.NET (and the Android version too) to recognize and extract document data. It worked really good with Arial and Cambria fonts, but now we have to recognize documents like that:
Tesseract cannot recognize it. Absolutely nothing (except the big sized serial number on the right upper corner).
We tried to train it, but - maybe it's our fault - it's still unstable.
What can we do?
(Btw the font is use by national offices, we cannot get it as true type or other font format.
In the current form it is very hard for an OCR tool to recognize any letters.
Serif fonts are hard to ocr.
Letters are very close together. Some are joined.
A dictionary is not of any help.
You might be able to improve the result with the following:
As this looks like an vehicle registration certificate you should be able to predict the positions of the textstrings of interest and then ocr they separatly.
Thereby using the -psm=7 or 8 option (assume single line or word).
As some strings seem to be numbers only you can help tesseract by using the digits argument.
For the alphanumeric strings it might help to reduce the dictionary pruning (or completely remove the dawg files.)
If those strings like 'ETZ' or 'MZ' are abbreviations you could also build an dictionary with those.
Reducing the yellow and green color is also an (easy) option you could test.
Use the barcode instead of trying to ocr the string.
For tesseract questions it always helps if you specify the version used and, if you do image preprocessing, provide a sample image of the processed input.
I am doing a project on OCR in android for recognize the numbers from image. My image contains only numbers. And i tried the 'eng.traineddata' file but the result's accuracy is too low is about below 40%. Does anyone know about the 'traineddata' file for digits ? Am tried the 'tesseract' raining. Since i don't familiar with training. Please help me to find a 'traineddata' file for recognizing numbers.
If you want to recognize only numbers, then set the Tesseract whitelist value to "0123456789".
I'm building an OCR. I have already binerized image. But I need to know how to match font with images. I have come to know about tesseract. But it is a built in tool. Actually I need to know what is behind algorithm for matching image text with a font in .ttf format. If tesseract is the only choice for android then would you please describe some steps for integrating with windows7 as I'm not clear from Gautam's Blog. If there is any other built in method for android that match image pattern with a .ttf file please suggest me. Thanks in advance.
You'll have to train the tesseract-enginge to your font. There is an exhaustive tutorial on this topic on the projects website. You don't have to train tesseract on an android device, but you will have to deploy the training results to it.
I have a dictionary of words in a text file, separated by newlines. And I want to recognize the handwriting using Tesseract, and output the nearest matching line in the text file.
This is the first time I'll be using Tesseract, and it's already in my project workspace, I just need the training data.
Is it possible to train Tesseract to do this?
It's possible to train tesseract to recognize handwriting. Here are the instructions: https://tesseract-ocr.github.io/tessdoc/Training-Tesseract
But don't expect very good results. Academics have typically gotten accuracy results topping out about 90%. Here are a couple references for words and numbers. So if your use case can deal with at least 1/10 errors, this might work for you.