Last week i have chosen my major project. It is a vision based system to monitor cyclists in time trial events passing certain points on the course. It should detect the bright yellow race number on a cyclist's back and extract the number from it, and besides record the time.
I done some research about it and i decided to use Tesseract Android Tools by Robert Theis called Tess Two. To speed up the process of recognizing the text i want to use a fact that the number is mend to be extracted from bright (yellow) rectangle on the cyclist back and to focus the actual OCR only on it. I have not found any piece of code or any ideas how to detect the geometric figures with specific color. Thank you for any help. And sorry if i made any mistakes I am pretty new on this website.
Where are the images coming from? I ask because I was asked to provide some technical help for the design of a similar application (we were working with footballer's shirts) and I can tell you that you'll have a few problems:
Use a high quality video feed rather than rely on a couple of digital camera images.
The number will almost certainly be 'curved' or distorted because of the movement of the rider and being able to use a series of images will sometimes allow you to work out what number it really is based on a series of 'false reads'
Train for the font you're using but also apply as much logic as you can (if the numbers are always two digits and never start with '9', use this information to help you get the right number
If you have the luxury of being able to position the camera (we didn't!), I would have thought your ideal spot would be above the rider and looking slightly forward so you can capture their back with the minimum of distortions.
We found that merging several still-frames from the video into one image gave us the best overall image of the number - however, the technology that was used for this was developed by a third-party and they do not want to release it, I'm afraid :(
Good luck!
Related
Context
I'm building an app which performs real-time object detection throught the camera module of the device. The render is like the image below.
Let's say I try to recognize an apple, most of the time the app will recognize an apple. However, sometimes, the app will recognize the wrong fruit (let's say a lemon) on a few camera frames.
Goal
As the recognition of a fruit triggers an action in my code, my goal is to programmatically prevent a brief wrong recognition to trigger an action, and only take into account the majority result.
What I've tried
I tried this way : if the same fruit is recognized several frames in a row, I assumed the result is supposed to be the right one. But as my device process image recognition several times per second, even a wrong guess can be recognized several times in a row, and leads to the wrong action.
Question
Is there any known techniques for avoiding this behavior ?
I feel like you've already answered your own question. In general the interpretation of a model's inference is it's own tuning step. You know for example in logistic regression tasks that the threshold does NOT have to be 0.5. In fact, it's quite common to flex the threshold to see what the recall and precision are at various thresholds, and you can pick a threshold that works given your business/product problem. (Fraud detection might favor high recall if you never want to miss any fraud... or high precision if you don't want to annoy users with lots of false positives).
In video this broad concept is extended to multiple frames as you know. You now have the tune the hyperparameters, "how many frames total?" and "how many frames voting [apple]"?
If you are analyzing fruit going down a conveyer belt one by one, and you know each piece of fruit will be in frame for X seconds and you are shooting at 60 fps, maybe you want 60 * X frames. And maybe you want 90% of the frames to agree.
You'll want to visualize how often your detector "flips" detections so you can make a business/product judgement call on what your threshold ought to be.
This answer hasn't been very helpful in giving you a bright line rule here, but I hope it's helpful in suggesting that there is in fact NO bright line rule. You have to understand the problem to set the key hyperparameters:
For each frame, is top-1 acc sufficient, or do I need [.75] or higher confidence?
How many frames get to vote? Say [100].
How many correlated votes are necessary to trigger an actual signal? maybe it's [85].
The above algo assumes you take a hardmax after step 1. another option would be to just average all 100 frames and pick a threshold. that's kind of a soft label version of the above algo.
I am developing app in which I need to get face landmarks points on a cam like mirror cam or makeup cam. I want it to be available for iOS too. Please guide me for a robust solution.
I have used Dlib and Luxand.
DLIB: https://github.com/tzutalin/dlib-android-app
Luxand: http://www.luxand.com/facesdk/download/
Dlib is slow and having a lag of 2 sec approximately (Please look at the demo video on the git page) and luxand is ok but it's paid. My priority is to use an open source solution.
I have also use the Google vision but they are not offering much face landmarks points.
So please give me a solution to make the the dlib to work fast or any other option keeping cross-platform in priority.
Thanks in advance.
You can make Dlib detect face landmarks in real-time on Android (20-30 fps) if you take a few shortcuts. It's an awesome library.
Initialization
Firstly you should follow all the recommendations in Evgeniy's answer, especially make sure that you only initialize the frontal_face_detector and shape_predictor objects once instead of every frame. The frontal_face_detector will initialize faster if you deserialize it from a file instead of using the get_serialized_frontal_faces() function. The shape_predictor needs to be initialized from a 100Mb file, and takes several seconds. The serialize and deserialize functions are written to be cross-platform and perform validation on the data, which is robust but makes it quite slow. If you are prepared to make assumptions about endianness you can write your own deserialization function that will be much faster. The file is mostly made up of matrices of 136 floating point values (about 120000 of them, meaning 16320000 floats in total). If you quantize these floats down to 8 or 16 bits you can make big space savings (e.g. you can store the min value and (max-min)/255 as floats for each matrix and quantize each separately). This reduces the file size down to about 18Mb and it loads in a few hundred milliseconds instead of several seconds. The decrease in quality from using quantized values seems negligible to me but YMMV.
Face Detection
You can scale the camera frames down to something small like 240x160 (or whatever, keeping aspect ratio correct) for faster face detection. It means you can't detect smaller faces but it might not be a problem depending on your app. Another more complex approach is to adaptively crop and resize the region you use for face detections: initially check for all faces in a higher res image (e.g. 480x320) and then crop the area +/- one face width around the previous location, scaling down if need be. If you fail to detect a face one frame then revert to detecting the entire region the next one.
Face Tracking
For faster face tracking, you can run face detections continuously in one thread, and then in another thread, track the detected face(s) and perform face feature detections using the tracked rectangles. In my testing I found that face detection took between 100 - 400ms depending on what phone I used (at about 240x160), and I could do 7 or 8 face feature detections on the intermediate frames in that time. This can get a bit tricky if the face is moving a lot, because when you get a new face detection (which will be from 400ms ago), you have to decide whether to keep tracking from the new detected location or the tracked location of the previous detection. Dlib includes a correlation_tracker however unfortunately I wasn't able to get this to run faster than about 250ms per frame, and scaling down the resolution (even drastically) didn't make much of a difference. Tinkering with internal parameters produced increase speed but poor tracking. I ended up using a CAMShift tracker based on the chroma UV planes of the preview frames, generating the color histogram based on the detected face rectangles. There is an implementation of CAMShift in OpenCV, but it's also pretty simple to roll your own.
Hope this helps, it's mostly a matter of picking the low hanging fruit for optimization first and just keep going until you're happy it's fast enough. On a Galaxy Note 5 Dlib does face+feature detections at about 100ms, which might be good enough for your purposes even without all this extra complication.
Dlib is fast enough for most cases. The most of processing time is taken to detect face region on image and its slow because modern smartphones are producing high-resolution images (10MP+)
Yes, face detection can take 2+ seconds on 3-5MP image, but it tries to find very small faces of 80x80 pixels size. I am really sure, that you dont need such small faces on high resolution images and the main optimization here is to reduce the size of image before finding faces.
After the face region is found, the next step - face landmarks detection is extremely fast and takes < 3 ms for one face, this time does not depend on resolution.
dlib-android port is not using dlib's detector the right way for now. Here is a list of recommendations how to make dlib-android port work much faster:
https://github.com/tzutalin/dlib-android/issues/15
Its very simple and you can implement it yourself. I am expecting performance gain about 2x-20x
Apart from OpenCV and Google Vision, there are widely-available web services like Microsoft Cognitive Services. The advantage is that it would be completely platform-independent, which you've listed as a major design goal. I haven't personally used them in an implementation yet but based on playing with their demos for awhile they seem quite powerful; they're pretty accurate and can offer quite a few details depending on what you want to know. (There are similar solutions available from other vendors as well by the way).
The two major potential downsides to something like that are the potential for added network traffic and API pricing (depending on how heavily you'll be using them).
Pricing-wise, Microsoft currently offers up to 5,000 transactions a month for free with added transactions beyond that being some fraction of a penny (depending on traffic, you can actually get a discount for high volume), but if you're doing, for example, millions of transactions per month the fees can start adding up surprisingly quickly. This is actually a fairly typical pricing model; before you select a vendor or implement this kind of a solution make sure you understand how they're going to charge you and how much you're likely to end up paying and how much you could be paying if you scale your user base. Depending on your traffic and business model it could be either very reasonable or cost-prohibitive.
The added network traffic may or may not be a problem depending on how your app is written and how much data you're sending. If you can do the processing asynchronously and be guaranteed reasonably fast Wi-Fi access that obviously wouldn't be a problem but unfortunately you may or may not have that luxury.
I am currently working with the Google Vision API and it seems to be able to detect landmarks out of the box. Check out the FaceTracker here:
google face tracker
This solution should detect the face, happiness, and left and right eye as is. For other landmarks, you can call the getLandmarks on a Face and it should return everything you need (thought I have not tried it) according to their documentation: Face reference
I want to programmatically read numbers on a page using mobile's camera instead from image, just like barcode scanning.
I know that we can read or scan barcode but is there any way to read numbers using same strategy. Another thing is i also know that we can read text or numbers from image using OCR but i don't want to take the photo/image and than process it but only scan and get ?
You mean to say that you don't want to click a picture and process it, instead you want to scan text by just hovering the camera, am I right?
It could be accomplished using a technology called Optical Character Recognition. (You mentioned something about OSR, I think this is want you meant). What it does is, it finds patterns in images to detect text in printed documents.
As far as I know, existing tools processes still images, so you will have to work around it to make it scan any moving images.
Character recognition demands significant amount of resources, so instead of processing moving pictures I would recommend you to write a program that takes images less frequently from a hovering camera and process it. Once text, or numbers in your case, are detected you could use a less efficient pattern matching algorithm to track the motion of the numbers.
Till date, the most powerful and popular software is Tesseract-OCR. You will find it at GitHub. You can use this to develop your mobile application.
In an Android app I'm making, I would like to detect when a user is holding a phone in his hand, makes a gesture like he would when throwing a frissbee. I have seen a couple of apps implementing this, but I can't find any example code or tutorial on the web.
It would be great with some thoughts on how this could be done, and ofc.
It would be even better with some example code or link to a tutorial.
Accelerometer provides you with a stream of 3d vectors. In case your phone is help in hand, its direction is opposite of earth gravity pull and size is the same. (this way you can determine phone orientation)
If user lets if fall, vector value will go to 0 (the process as weighlessness on space station)
If user makes some gesture without throwing it, directon will shift, and amplitude will rise, then fall and then rise again (when user stops movement). To determine how it looks like, you can do some research by recording accelerometer data and performing desireg gestures.
Keep in mind, that accelerometer is pretty noisy - you will have to do some averaging over nearby values to get meaningful results.
I think that one workable approach to match gesture would be invariant moments (like Hu moments used to image recognition) - accelerometer vector over time defines 4 dimensional space, and you will need set of scaling / rotation invariant moments. Designing such set is not easy, but comptuing is not complicated.
After you got your moments, you may use standart techniques of matching vectors to clusters. ( see "moments" and "cluster" modules from our javaocr project: http://javaocr.svn.sourceforge.net/viewvc/javaocr/trunk/plugins/ )
PS: you may get away with just speed over time, which produces 2-Dimensional space and can be analysed with javaocr on the spot.
Not exactly what you are looking for:
Store orientation to an array - and compare
Tracking orientation works well. Perhaps you can do something similar with the accelerometer data (without any integration).
A similar question is Drawing in air with Android phone.
I am curious what other answers you will get.
I am wondering how to do as I said in the title:
I want to have some objects counted reading an image from the camera of the portable device (such as iPhone or Android phones)
I need only two specific functions.
Recognize and count the amount of objects
Recognize the color of the object (so I can count how many of each color I have).
A very simple example.
I have a stack of pieces of LEGO, all of them the same dimensions. I know they always will be aligned horizontaly, sometimes they are not verically aligned. I need to count how many of each colour I have.
I know that I have pieces with the same dimensions, only the colour change.
I have i think only 10 colour avaible.
I can elaborate the image (such as blur and other stuff) but I don't know how to read how many pieces I have.
Can you tell me some Ideas how to do (and what kind of libraries to use both for iOS and Android -android first-) or maybe some publication (free pdf or books or even publicated books even if they're not free) teaching how to read data from images.
The program should be act as the same:
I start the program, when the program recognize it is looking at (using the integrated cam) some specific objects, ittake a picture and elaborate it, telling how much of each color I have.
Thanks in advance, ANY kind of help will be
I'll admit it is 10 years since I last dabbled with computer vision, but back then I used the OpenCV libraries, and these still seem to be going strong, and support on Android:
http://opencv.willowgarage.com/wiki/Android
and iOS:
http://www.eosgarden.com/en/opensource/opencv-ios/overview/