Improve accuracy of Tesseract

Improve accuracy of Tesseract - android

I am using tess-two (Android port of Tesseract OCR engine). I am OCR'ing bills and receipts. I am using OpenCV 3.0 to preprocess the image. I have successfully made a bill into a binary image along with adaptive thresholding.
Original image:
Binary threshold of original:
Tesseract output.
PH:26051246/26145398
TIN=276302417620
CQSH/BILL
NO 009762 0 m 0 09-07-2015
DESCRIPTION on mm mu
_.__________
m 3.990 75.18 zoo-00
_____\
CASH 300-00
m: YOU----UISIT 9691!!
w TIN:276302417620
-\ c1 09:27:58 we no. 0
At present I have trained 3 dot-matrix fonts and two Merchant copy fonts. I have turned off the dictionary and added user-words for all five fonts. They are commonly used terms in Bills and Receipts. Strangely these changes don't seem to make any difference in the output.
I resized the image so that the font size is at least 12 pt. How do I improve the accuracy further? Can anyone specify a font or should I retrain the fonts.

Related

Print strings on Apex3 using printer command language

How to print strings on Apex3 using printer command language or SDK?
I created a library for android application to print Bitmap images on mobile thermal printers:
Apex3, 3nStar, PR3, Bixolon, Sewoo LK P30 using appropriate SDKs.
It works fine but pretty slow, every ticket of 30 cm length takes 20-40 secs to print (it depends on type of printer).
For 3nStar and Apex3 I started to do improvements to speed up a printing.
For 3nStar I developed a class which prints a header logo and formatted text with spanish characters.
(different alignments and different fonts for different parts of a ticket)
so, the ticket looks very similar as Bitmap image but a printing time is only 6 secs.
Now I have to do the same but with Apex3. And here I got stuck.
How I do it on 3nStar for strings:
I send in outputStream bytes which are commands for printer what to do.
outputStream.write(some_bytes)
First command always is
{0x1b, 0x74, 40} //Esc t ( -- [ISO8859-15 (Latin9)]
to print spanish characters.
Then, in a loop, for n strings:
I choose a font
{0x1B, 0x21, 0x00}//Esc ! 0 -- 0 is normal, 8 is bold etc.
where changing last byte I print different fonts: normal, bold, two height, two width and combined fonts.
Next I choose an alignment
{0x1b, 0x61, 48} //Esc a 48 for left, 49 for center, 50 for right
Then I convert a string in bytes using ISO_8859_1 to print spanish characters and also write in outputStream.
outputStream.write(messageString.getBytes(StandardCharsets.ISO_8859_1))
And last byte to send is
{0x0a} // Move on next line
And the above approach doesn't work with Apex3, also I failed using
http://www.old.adtech.pl/upload/Extech_Printer_Command_Language%20Rev_H_05062009.pdf
even though on page 1 of that book is written that is fit for Apex3.
I think I miss something, I start to see how to do it using some SDK feature of Android_SDK_ESC_V1.01.17.01PRO.jar
but I would prefer to do that using direct writing of bytes.

Answers I found from this
manual
Shortly differences with the approach that I described for 3nStar are:
1)Before printing set a char set
ESC F 1 //Esc t - for 3nStar
2)Set a font for text, for example
ESC K 1 3 CR //ESC F 1 - for 3nStar
3)Send a line of text with alignment
For 3nStar I can use an alignment command before sending a text, like
ESC 1 49 //Centering
But for Apex3 I have to know a line length which depends on type of font, also a length of printing string,
then I get
freeSpace = (lineLength - printingString)
and set spaces at the begining of a line (right alignment),
at the end (left alignment) or devide them (centering).
So, for both types of printers I use the same logic which differs only in 3 places.
It is simplified explanation as a real code includes several classes with hundreds lines of code.

Parse only a specific part of image with Tesseract

I am trying to use Tesseract OCR on Android to read the state of a gas meter when you take a picture of it:
This is the output when I parse this image:
vb"
22% BK-G4T ||||||||I||||I|||ii\|||\
’ 64 2007
22?: 06.0"! 'm'lm Mm. 23212274 ,
v 2,0 dm’ 1
pmn 0_5 bar tm ~25°C v‘40"(1 I
1amp é 0_o1m’ sb15°cl :Sp 20°c l
'I ELSTEQ~I¢¢>>InstrogwnSs HB Z _ 18 _ 1013 . ‘
a, 069373593435- 3 I
i'23212214 Y _ w w V'
g
The idea is to extract the first 5 digits of the state of the gas meter ( 06937 on this image ).
My question is, is there a way to train Tesseract to only parse this part of the image? Absolute coordinates are not an option since every picture would be different. I am guessing the best logic would be something like: parse only white numbers on black background.

By changing the page segmentation mode (psm), tesseract 4.00.00 alpha is able to read the meter line characters correctly as 06937598-m3 apart from other characters.
The command used is:
tesseract meter.jpg output --psm 11 -l eng
--psm 11 means to recognize "Sparse text. Find as much text as possible in no particular order".
Here is the output file with showing all the ASCII control characters.
If --psm 11 works on other meter images, then you could just need to search -m3 at the end of the line to extract the who meter line characters. With that, you can get the first 5 digits right away.
Hope this help.

How are the emoji images encoded in AndroidEmoji-htc.ttf file?

What file type is being used to embed the images in AndroidEmoji-htc.ttf? direct download: AndroidEmoji-htc.ttf
Images can be extracted from AppleColorEmoji.ttf easily because the PNG headers can be found using a hex editor. This ruby script can extract them. The algorithm is described here.
Sample of file in hex editor:
0706 2627 2626 2726 2627 2626 2727 2727 ..&'&&'&&'&&''''
2726 2627 2626 2726 2627 2636 3736 3637 '&&'&&'&&'&67667
3636 3736 3637 3737 3737 3636 3736 3637 6676677777667667
3636 0131 3636 3736 3635 3527 2626 3132 66.16676655'&&12
2627 2626 2726 2627 2721 2115 1533 3232 &'&&'&&''!!..322
1716 1617 1634 1110 0607 0606 0706 0623 .....4.........#
2315 1533 3335 3523 2226 2726 2627 3434 #..3355#"&'&&'44
3535 3332 3233 1616 1716 1617 1616 1716 553223..........
1617 1616 1716 1617 1616 1716 3237 3236 ............2726
3737 3736 3637 3636 3737 2323 0706 0607 7776676677##....
0606 0706 0627 2226 2726 2627 2626 2726 .....'"&'&&'&&'&
2627 2626 3130 2627 2626 2726 2627 2626 &'&&10&'&&'&&'&&
2727 3535 3736 3637 3636 0131 3232 3332 ''55766766.12232
1617 1616 1716 3017 1616 1516 0607 0606 ......0.........
0706 0607 0606 2323 3534 3437 3636 3736 ......##54476676
3608 bf05 020b 1c2a 0d09 0b0d 66a4 b72c 6......*....f..,
0202 0233 8a8d 9c9c 8d88 3302 0202 022a ...3......3....*
9a8d 631a 3c05 090b 3e1a 1a3e 0b09 0b0d ..c.<...>..>....
65a5 b918 1616 1d67 6572 4028 2a0f 251f e......ger#(*.%.
7365 691b 1814 1a98 8d65 183b 0409 0d2c sei......e.;...,
1a0b 0702 fb8b 1e1d 423e 3f44 2a0d 3a3f ........B>?D*.:?
034f 4f42 4435 3203 1e1b 040d 140d 0704 .OOBD52.........
0303 040d 1b0d 041b 1e03 3235 4442 4f4f ..........25DBOO
033f 3a0d 2a55 5265 1f21 3d40 4044 2a0d .?:.*URe.!=##D*.
3940 024d 4f44 4235 3502 1d1c 050d 1a0d 9#.MODB55.......
0502 0207 040d 140d 051c 1d02 3535 4244 ............55BD
4f4d 0240 390d 2a56 5301 c806 0d0b 2611 OM.#9.*VS.....&.
0b06 0503 120b 122e 1835 0837 33fe d9fe .........5.73...
dc2e 2c02 0b12 0407 0b02 0306 071a 1006 ..,.............
2a23 f8fb 2a30 070f 1d0d 022a 1818 0914 *#..*0.....*....
Update 6/18/2014:
At #naXa 's suggestion, opening the file in FontForge version 20120731-ML (current newest version) gave this error:
The following table(s) in the font have been ignored by FontForge
Ignoring 'dcmj'
In GID1 the advance width (2252) is greater than the stated maximum (2048)
Subsequent errors will not be reported.
Bad lookup table: format=6, first=65535 total glyphs in font=894
Somewhat expected because emojis in TTFs to this day are encoded proprietarily. The fact that I even see black and white emoji images using FontForge is a huge success because it means the TTF is standard for the most part. TTFs are not supposed to store color information I don't think.
They key is probably accessing the data in the dcmj table or wherever it is pointing to. Researching FontForge I found that BMP is a common image format for TTFs so I'm going to try and modify the ruby script using those assumptions and report back!
Update: 6/18/14
I found what appear to be BMP headers source1 source2, starting with 424D using a hex editor but the header doesn't seem valid.
Next I would try:
Parsing the TTF look at the data in each "glyph" to see if I can find more patterns. I imagine the ttf will say the start end end of the image data.
Look into the htc android apk to see how they are pulling and displaying emoji from the ttf.
I've run out of time on this for now, if anyone has any other suggestions I'm very interested.
UPDATE 6/20/2014
Double clicking on the glyph using #naXa's suggestion and exporting as any of the formats will give me non color icons of any size but still does not reveal the color bitmap emojis I was looking for.
I went down to the store to look at an HTC phone finally and saw, to my surprise, they are using Apple's emoji font seen through the messaging app:
I am almost certain these are stored in the HTC font provided above, but this conclusion has left extracting these images far less desirable.
Howver, it would still be cool to know, as a proof of concept, how to extract the color emojis. :)
EDIT: As Jasper pointed out, HTC does in fact have a custom emoji set as linked in his answer. The picture above was from a non updated phone. Still need to figure out how to extract these emojis!!

Unfortunately, I don't have an account, so let me start by apologizing for posting this as an answer instead of as a comment.
The picture you posted showing the Apple Color Emojis appear to come from a phone running an older version of Sense/Android, while the file you are referencing almost definitely comes from Sense 5-6/Android 4.3-4.4 If you look at the grayscale emojis you were able to extract from the file, you'll notice that they don't actually match up with the picture you provided. They do, however, match up with this: http://assets.hardwarezone.com/img/2013/10/HTC_One_Max_Emoticons_Keyboard_jpg.jpg
This leads me to conclude that it could be entirely possible that there are no conventional bitmaps stored in the TTF, rather there's some proprietary format that they use to assign colors to different parts of each emoji.
EDIT: Tried directly copying the file over to my phone to see what would happen (tried both replacing NotoColorFont.ttf as well as just directly copying and referencing it in fallback_fonts.xml, there doesn't seem to be any difference). Screenshot here: http://imgur.com/OGyq6T2
As you can see, they show up without color, yet we already know that both the default Android emoji and Apple Color Emoji both show up fine on Android devices, meaning that HTC doesn't follow whatever standard that Android and iOS use.
Tested on a Galaxy SII (i9100) running CyanogenMod 11 Milestone 8.

The emoji images in the AndroidEmoji-htc.ttf file are probably (since I don't have the font to test) stored in the same format as the standard Android emoji in Google's CBLC+CBDT OpenType tables.
You can disassemble/reassemble the font using ttx from FontTools(pypi, github) to confirm.
The direct answer to your question "What is the format?" is two options:
Uncompressed Color Bitmaps
The value ‘32’ of the bitDepth field of bitmapSizeTable struct defined in the CBLC table, to identify color bitmaps with 8-bit blue/green/red/alpha channels per pixel, encoded in that order for each pixel (referred to as BGRA from hereon). The color channels represent pre-multiplied color and are encode colors in the sRGB colorspace. For example, the color “full-green with half translucency” is encoded as \x00\x80\x00\x80, and not \x00\xFF\x00\x80.
All imageFormat values defined in the EBDT / EBLC tables are valid for use with the CBDT / CBLC tables.
Compressed Color Bitmaps
Images for each individual glyph are stored as straight PNG data. Only the following chunks are allowed in such PNG data: IHDR, PLTE, tRNS, sRGB, IDAT, and IEND. If other chunks are present, the behavior is undefined. The image data shall be in the sRGB colorspace, regardless of color information that may be present in other chunks in the PNG data. The individual images must have the same size as expected by the table in the bitmap metrics.

As I know, Emojis are stored in two different placfes - in .ttf - to display in text-only fields (for example, quick previews of message) and in images. Maybe you should dig into that way?

Haven't looked at Android emoji but I managed to extract out the iOS emoji
by jacking a few tools to do it as nothing on the net seems to do it 100%.
Hex Editor is key its all I used...
iOS 5.0 used uint8 type RGBA data stored as tuples
iOS 5.1 changed to pngs and these are written contiguously
iOS 6 combined both the iOS 5.0 & 5.1 format. Set 1 & 2 were uint8 type data & Set 3 (ipad # 96x96px) were optimised png format that Apple adopt e.g switching RGBA to BGRA...byte blitting apparently...
iOS 7 stayed the same as did iOS 8 to 8.2.
Hope that helps...

OpenCV: Issue in running same-code on Android vs OSx

I've wrote simple template-matching program using OpenCV, which produces surprisingly different results on Android and OSx.
First, see what I'm doing:
IplImage *image = cvLoadImage("test3a.png", -1);
Mat templateMat(image);
// detecting keypoints
OrbFeatureDetector detector(500);
std::vector<KeyPoint> templateKeypoints;
detector.detect(templateMat, templateKeypoints);
// computing descriptors
Mat templateDescriptors;
OrbDescriptorExtractor extractor;
extractor.compute(templateMat, templateKeypoints, templateDescriptors);
// matches
BFMatcher matcher(cv::NORM_HAMMING2);
std::vector<std::vector<DMatch> > matches;
matcher.knnMatch(templateDescriptors, templateDescriptors, matches, 2);
Now next see what I'm getting:
Running same snippet on Nexus i9250 running Android 4.2.2 and on OSx 10.7(Lion) give these results:
Mat Objects: Same on both OSes
Keypoints: [On Android][2], [On OSx][3], [DIFFERENCE][4]
Descriptors: [On Android][5], [On OSx][6], [DIFFERENCE][7]
Matches: [On Android][8], [On OSx][9], [DIFFERENCE][10]
NOTE:
There is no difference, if I sort these files; So what I'm not getting is, why I'm getting different ordered results??
Geting them in order is my requirement, as I need such for further computations.
Further, running same code-snippet on same platform always produces same ordered results.
Stackexchange limits my account to post more than 2 links in post, so please check comments for the links.

How to get Values from HTML String in Android

I have a problem that I want to get different values from different tags from HTML String as, <p><div class=\"image_wrapper\" style=\"width:320px;\"><img name=\"tccimg_100322484_s\" **title=\"2011 Chevrolet Corvette 2-door Coupe Z06 w/2LZ Angular Front Exterior View\"** src=\"http://images.thecarconnection.com/sml/2011-chevrolet-corvette-2-door-coupe-z06-w-2lz-angular-front-exterior-view_100322484_s.jpg\" alt=\"2011 Chevrolet Corvette 2-door Coupe Z06 w/2LZ Angular Front Exterior View\" width=\"320\" height=\"240\" /><p>2011 Chevrolet Corvette 2-door Coupe Z06 w/2LZ Angular Front Exterior View</p><a name=\"tccwrp_100322484\" class=\"enlarge\" href=\"/image/100322484_2011-chevrolet-corvette-2-door-coupe-z06-w-2lz-angular-front-exterior-view\" target=\"_blank\">Enlarge Photo</a></div></p>\n<p>The Chevrolet Corvette is an American icon: a rear-wheel drive, two-seat sports car that started its legendary run in 1953 and has seen 57 years of continuous production in Flint, Michigan, St. Louis, Missouri and most recently in Bowling Green, Kentucky. Over the years it has constantly evolved to lead performance and value, with occasional lows and numerous highs along the way. Though it has little domestic competition, cars as disparate as the Dodge Viper, Porsche Boxster and 911, and the Nissan GT-R and 370Z can be considered rivals in terms of performance and/or price. The Chevrolet Corvette is priced from $48,000 to $56,000 for the standard Coupe and Convertible, from $58,000 to $68,000 for the Grand Sport, from $75,000 to $82,000 for the Z06, and from $106,800 for the ZR1.</p>\n<p>Over the past 57 years of production, there have been six generations of Corvette. The first 1953 models featured solid rear axles and inline six-cylinder engines, though in 1955, the V-8 became standard. When the second generation \"Sting Ray\" debuted in 1963, independent rear suspension was added and output was increased to 360 horsepower. A big-block 6.5-liter model was added in 1965, before the famous 427 cubic inch (7.0-liter) engine joined in 1966. The third-gen car began its run in 1968, running for 13 years until 1982--the longest run of the various Corvette generations. The new, fender-flared body style was the primary new addition to the line, along with a three-year run for the ZR-1 performance edition, though emissions and fuel regulations conspired to restrict power output and potential of Corvettes throughout the 1970s. The fourth-generation Corvette hit the street in 1983 as a 1984 year model, bringing with it a complete redesign of the car aside from the engine, with a sleek, modern design and digital instruments, and the second ZR-1 performance version. The fifth-gen car, introduced in 1997, saw another major upgrade, with improved build quality, more performance, and better handling the result. The Z06 model was introduced in 2001, and engines continued to be upgraded, producing 405 horsepower in the Z06.</p>\n<p>The sixth and current Corvette generation debuted in 2005, and brought with it all new bodywork and improved suspension. Power climbed to 400 horsepower for the base Corvette initially, now up to 430 horsepower from its 6.2-liter V-8 LS3 engine, and 505 horsepower for the current 7.0-liter Z06. The ZR1 was added back to the lineup in late 2007 as a 2008 year model, producing 638 horsepower from a supercharged 6.2-liter V-8 engine. Currently available in Coupe, Convertible, a Grand Sport version with upgraded brakes and special bodywork, the high-performance Z06, and the supercar-rivaling ZR1.</p>\n<p>The Coupe and Convertible are the standard Corvettes, with 430 horsepower output and all the conveniences of a modern car, including available Bluetooth on some models, a choice of six-speed manual or automatic transmission, and available leather interior. The Grand Sport is also available as both a coupe and convertible, though the coupe gets a few performance upgrades over the soft top, including a dry-sump oil system when equipped with the six-speed manual transmission, plus the upgraded brakes and flared fenders that both variants get. The Corvette Z06 ups the performance ante with extensive use of carbon fiber body panels and components, an aluminum frame, and a 505-horsepower engine. The ZR1 is king of the hill, its massive power output combined with Brembo ceramic carbon brakes, visible carbon fiber weave components, and a 205-mph top speed. Despite their huge power and impressive performance figures, the brawny engines in the Corvette enable it to achieve up to 26 mpg on the highway.</p>\n<p>No major changes were made for the 2012 model year, though the range did get interior updates, new technology packages, and a range of new exterior colors. High-performance Z06 and ZR1 models also got new performance packages.</p>\n<p>For 2013, a new 427 Convertible Collector Edition has been added, pairing the Z06's LS7 V-8 engine with a Corvette Convertible chassis and unique 60th Anniversary touches. A 60th Anniversary Package will also be available on all 2013 model Corvettes, adding a special touch to celebrate six decades of the Corvette. The rest of the line carries forward largely unchanged from last year.</p>\n<p>The next major generational upgrade is expected to come in late 2013, with the seventh-generation car drawing on GM's global resources for its new design--the first time the Corvette team has looked outside the U.S. for the iconic 'Vette.</p>\n"
I want to get Image title value from this html string in the starting, but I am unable to get that. I am using Jsoup for parsing of this HTML String as,
Code:
Document doc = Jsoup.parse(html);
Elements element = doc.getAllElements();
for(Element e: element)
{
Elements str = e.getElementsByTag("img");
for(Element el: str)
{
String title = el.getElementsByAttribute("title").text();
System.out.println("The Title:"+title);
}
}
Please suggest me any solution regarding the same.
Thanks in advance.

Replace the following line:
String title = el.getElementsByAttribute("title").text();
with
String title = el.attr("title");
Explanation: the function call getElementsByAttribute("title") will return a List of Elements (see Jsoup Doumentation - Element), when really you just want to look at the attribute for a specific element. See also Jsoup Documentation - Node.

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.