I have a dictionary of words in a text file, separated by newlines. And I want to recognize the handwriting using Tesseract, and output the nearest matching line in the text file.
This is the first time I'll be using Tesseract, and it's already in my project workspace, I just need the training data.
Is it possible to train Tesseract to do this?
It's possible to train tesseract to recognize handwriting. Here are the instructions: https://tesseract-ocr.github.io/tessdoc/Training-Tesseract
But don't expect very good results. Academics have typically gotten accuracy results topping out about 90%. Here are a couple references for words and numbers. So if your use case can deal with at least 1/10 errors, this might work for you.
Related
I'm a word puzzle junky in my spare time, so I've spent a LOT of other spare time working on a helper program that allows wildcards in search patterns. It works great. On my Dell Laptop (i5, 8GB RAM) the search of a 140,000-word "dictionary" for wildcard matches for words has an almost imperceptible and definitely acceptable delay that occurs only if tens of thousands of words are returned. Java rules. So does its implementation of regex and match().
I was hoping to port it to Android. I worked all day getting a more-or-less equivalent app to compile. No chance with given code architecture.
The problem is that leading wildcard characters can (must) be allowed. E.g., ???ENE returns 15 matches--from achENE to xylENE and *RAT returns 22 matches--from aristocRAT through `zikuRAT--i.e., all 140,000 words must (?) be searched, which is going to take aaaaaaaaawhiiiiiiiiile on most (all?) Android devices. (Each took less than a second on my laptop.) (It takes my PC 3 seconds to return all 140,000 words and a little longer to eyeball them all.)
Since some word puzzles allow variable numbers of letters in words, disallowing leading wildcards cuts the heart out of the app for such puzzles. But if the search pattern had to start with a letter it would be easy enough to then do a binary search (or something quicker). (And it still might be unacceptably slow.)
Anyway, I was wondering if anybody might know some algorithm or can think of some approach that might be applied to speed up searches with leading wildcard characters.
I believe that the optimized version of what you are trying to do is widely known as the Unix/Linux utility "grep", which, if I remember correctly, uses the Boyer-Moore search algorithm.
Under the covers, Java's Pattern class uses Boyer-Moore. And it supports regex, so if you can write something to turn your wildcard search patterns into regular expressions, you can use Pattern.
There's an interesting Java implementation of grep at http://www.java2s.com/Code/Java/Regular-Expressions/AnotherGrep.htm
It uses memory-mapped files. I'm guessing that you won't be able to fit your entire word list into memory, but you could split it up into a bunch of smaller files - the implementation above memory-maps one file at a time. You'd have to do some testing to find the optimal size of a file.
I just Googled and found having a second list reverse alphabetized might be a way to then have a leading wildcard become trailing, opening door to binary search for pattern start. Interesting. But *a???ene* is also a legal search pattern in the program. What then? (Yeah. How often would you need such a search.)
I just found this about Apache Lucene:
Leading wildcards (e.g. *ook) are not supported by the QueryParser by default. As of Lucene 2.1, they can be enabled by calling QueryParser.setAllowLeadingWildcard( true ). Note that this can be an expensive operation: it requires scanning the list of tokens in the index in its entirety to look for those that match the pattern.
I am using tess-two Tesseract Android Tools for my project. From the research I've done I found from here a way to limit the types of characters, but not the range of characters.
The Tess-Two library I am using doesn't have a tessdata/config file, so how can I limit the possible characters tesseract recognises?
How can I limit Tesseract to recognise a range of digits (20 to 30)?
If you are using your digits are in an image and the image is clear you can use the following command
"tesseract imageName outputFileName.txt outputbase digits"
but if the image is not clear you will need to process it or you will not have the exact result.
Hope this will help you.
We are using Tesseract.NET (and the Android version too) to recognize and extract document data. It worked really good with Arial and Cambria fonts, but now we have to recognize documents like that:
Tesseract cannot recognize it. Absolutely nothing (except the big sized serial number on the right upper corner).
We tried to train it, but - maybe it's our fault - it's still unstable.
What can we do?
(Btw the font is use by national offices, we cannot get it as true type or other font format.
In the current form it is very hard for an OCR tool to recognize any letters.
Serif fonts are hard to ocr.
Letters are very close together. Some are joined.
A dictionary is not of any help.
You might be able to improve the result with the following:
As this looks like an vehicle registration certificate you should be able to predict the positions of the textstrings of interest and then ocr they separatly.
Thereby using the -psm=7 or 8 option (assume single line or word).
As some strings seem to be numbers only you can help tesseract by using the digits argument.
For the alphanumeric strings it might help to reduce the dictionary pruning (or completely remove the dawg files.)
If those strings like 'ETZ' or 'MZ' are abbreviations you could also build an dictionary with those.
Reducing the yellow and green color is also an (easy) option you could test.
Use the barcode instead of trying to ocr the string.
For tesseract questions it always helps if you specify the version used and, if you do image preprocessing, provide a sample image of the processed input.
I am wondering if there's a way to tell a given text is human readable. By human readable, I mean: it has some meanings, format like an article written by somebody, or at least generated by a software translator that is intended to be read by a human.
Here's the background story: recently I am making an app that allows user to upload a short text to a database. At the early stage of deployment I noticed some user always uploaded corrupted text due to a problem with encoding. This problem is fixed later, but leaves me wonder if there's a way to pick up non human readable text before serving the text back to users.
Any advice will be appreciated. The scope might be too large to include other languages, so at the moment let's limit the discussion to English only.
You can try a language identification tool, or something similar.
Basically you have to count the characters, or groups of character (character n-grams), and compare the distribution of the letters of the text submitted with the distribution of the letters of a collection of texts written in good english. (Make sure that such collection of texts is representative of the expected input).
In the continuity of a N-gram approach you might want to try a dictionary based approach and check for the presence of 'stop words' (e.g. 'the', 'a', 'an', 'of') in the input text.
Most of the NLP-Libraries will do the job (Spacy is a very common one). You can also go for language detection: Langdetect will support you on this
(https://pypi.org/project/langdetect/) as many others will do. If you need to be less specific (more math than language) you should look for Phonotactics (with BLICK for Python: https://github.com/mmcauliffe/python-BLICK) that looks into the construction of character order in a string.
Do a hexdump and make sure each character is less than or equal to 0x7f.
I am doing a project on OCR in android for recognize the numbers from image. My image contains only numbers. And i tried the 'eng.traineddata' file but the result's accuracy is too low is about below 40%. Does anyone know about the 'traineddata' file for digits ? Am tried the 'tesseract' raining. Since i don't familiar with training. Please help me to find a 'traineddata' file for recognizing numbers.
If you want to recognize only numbers, then set the Tesseract whitelist value to "0123456789".