I am using the grabcut algorithm of OpenCV for the background subtraction of an image in android. Algorithms runs fine but the result it gives is not accurate.
E.g. My input image is:
Output image look like:
so How can we increase accuracy of Grabcut Algorithm?
P.S: Apology for not uploading example images due to low reputation :(
I have been battling with the same problem for quite some time now. I have a few tips and tricks for this
1> Improve your seeds. Considering that GrabCut is basically a black box, to whom you give seeds and expect the segmented image as output, the seeds are all you can control and it becomes imperative to select good seeds. There are a number of things you can do in this regard if you have some expectation for the image you want to segment. For a few cases consider these:
a> Will your image have humans? Use a face detector to find the face and mark those pixels as Probable/definite foreground, as you deem fit. You could also use skin colour models within some region of interest to further refine your seeds
b> If you have some data on what kind of foreground you expect after segmentation, you can train colour models and use them as well to mark even more pixels
The list will go on. You need to creatively come up with different ways to adds more accurate seeds.
2> Post Processing: Try simple post processing techniques like the Opening and Closing operations to smoothen your fgmask. They will help you get rid of a lot of noise in the final output.
In general graphcut (and hence grabcut) tends to snap quickly to edges and hence if you have strong edges close to your foreground boundary, you can expect inaccuracies in the result.
Related
Context
I'm building an app which performs real-time object detection throught the camera module of the device. The render is like the image below.
Let's say I try to recognize an apple, most of the time the app will recognize an apple. However, sometimes, the app will recognize the wrong fruit (let's say a lemon) on a few camera frames.
Goal
As the recognition of a fruit triggers an action in my code, my goal is to programmatically prevent a brief wrong recognition to trigger an action, and only take into account the majority result.
What I've tried
I tried this way : if the same fruit is recognized several frames in a row, I assumed the result is supposed to be the right one. But as my device process image recognition several times per second, even a wrong guess can be recognized several times in a row, and leads to the wrong action.
Question
Is there any known techniques for avoiding this behavior ?
I feel like you've already answered your own question. In general the interpretation of a model's inference is it's own tuning step. You know for example in logistic regression tasks that the threshold does NOT have to be 0.5. In fact, it's quite common to flex the threshold to see what the recall and precision are at various thresholds, and you can pick a threshold that works given your business/product problem. (Fraud detection might favor high recall if you never want to miss any fraud... or high precision if you don't want to annoy users with lots of false positives).
In video this broad concept is extended to multiple frames as you know. You now have the tune the hyperparameters, "how many frames total?" and "how many frames voting [apple]"?
If you are analyzing fruit going down a conveyer belt one by one, and you know each piece of fruit will be in frame for X seconds and you are shooting at 60 fps, maybe you want 60 * X frames. And maybe you want 90% of the frames to agree.
You'll want to visualize how often your detector "flips" detections so you can make a business/product judgement call on what your threshold ought to be.
This answer hasn't been very helpful in giving you a bright line rule here, but I hope it's helpful in suggesting that there is in fact NO bright line rule. You have to understand the problem to set the key hyperparameters:
For each frame, is top-1 acc sufficient, or do I need [.75] or higher confidence?
How many frames get to vote? Say [100].
How many correlated votes are necessary to trigger an actual signal? maybe it's [85].
The above algo assumes you take a hardmax after step 1. another option would be to just average all 100 frames and pick a threshold. that's kind of a soft label version of the above algo.
I am developing app in which I need to get face landmarks points on a cam like mirror cam or makeup cam. I want it to be available for iOS too. Please guide me for a robust solution.
I have used Dlib and Luxand.
DLIB: https://github.com/tzutalin/dlib-android-app
Luxand: http://www.luxand.com/facesdk/download/
Dlib is slow and having a lag of 2 sec approximately (Please look at the demo video on the git page) and luxand is ok but it's paid. My priority is to use an open source solution.
I have also use the Google vision but they are not offering much face landmarks points.
So please give me a solution to make the the dlib to work fast or any other option keeping cross-platform in priority.
Thanks in advance.
You can make Dlib detect face landmarks in real-time on Android (20-30 fps) if you take a few shortcuts. It's an awesome library.
Initialization
Firstly you should follow all the recommendations in Evgeniy's answer, especially make sure that you only initialize the frontal_face_detector and shape_predictor objects once instead of every frame. The frontal_face_detector will initialize faster if you deserialize it from a file instead of using the get_serialized_frontal_faces() function. The shape_predictor needs to be initialized from a 100Mb file, and takes several seconds. The serialize and deserialize functions are written to be cross-platform and perform validation on the data, which is robust but makes it quite slow. If you are prepared to make assumptions about endianness you can write your own deserialization function that will be much faster. The file is mostly made up of matrices of 136 floating point values (about 120000 of them, meaning 16320000 floats in total). If you quantize these floats down to 8 or 16 bits you can make big space savings (e.g. you can store the min value and (max-min)/255 as floats for each matrix and quantize each separately). This reduces the file size down to about 18Mb and it loads in a few hundred milliseconds instead of several seconds. The decrease in quality from using quantized values seems negligible to me but YMMV.
Face Detection
You can scale the camera frames down to something small like 240x160 (or whatever, keeping aspect ratio correct) for faster face detection. It means you can't detect smaller faces but it might not be a problem depending on your app. Another more complex approach is to adaptively crop and resize the region you use for face detections: initially check for all faces in a higher res image (e.g. 480x320) and then crop the area +/- one face width around the previous location, scaling down if need be. If you fail to detect a face one frame then revert to detecting the entire region the next one.
Face Tracking
For faster face tracking, you can run face detections continuously in one thread, and then in another thread, track the detected face(s) and perform face feature detections using the tracked rectangles. In my testing I found that face detection took between 100 - 400ms depending on what phone I used (at about 240x160), and I could do 7 or 8 face feature detections on the intermediate frames in that time. This can get a bit tricky if the face is moving a lot, because when you get a new face detection (which will be from 400ms ago), you have to decide whether to keep tracking from the new detected location or the tracked location of the previous detection. Dlib includes a correlation_tracker however unfortunately I wasn't able to get this to run faster than about 250ms per frame, and scaling down the resolution (even drastically) didn't make much of a difference. Tinkering with internal parameters produced increase speed but poor tracking. I ended up using a CAMShift tracker based on the chroma UV planes of the preview frames, generating the color histogram based on the detected face rectangles. There is an implementation of CAMShift in OpenCV, but it's also pretty simple to roll your own.
Hope this helps, it's mostly a matter of picking the low hanging fruit for optimization first and just keep going until you're happy it's fast enough. On a Galaxy Note 5 Dlib does face+feature detections at about 100ms, which might be good enough for your purposes even without all this extra complication.
Dlib is fast enough for most cases. The most of processing time is taken to detect face region on image and its slow because modern smartphones are producing high-resolution images (10MP+)
Yes, face detection can take 2+ seconds on 3-5MP image, but it tries to find very small faces of 80x80 pixels size. I am really sure, that you dont need such small faces on high resolution images and the main optimization here is to reduce the size of image before finding faces.
After the face region is found, the next step - face landmarks detection is extremely fast and takes < 3 ms for one face, this time does not depend on resolution.
dlib-android port is not using dlib's detector the right way for now. Here is a list of recommendations how to make dlib-android port work much faster:
https://github.com/tzutalin/dlib-android/issues/15
Its very simple and you can implement it yourself. I am expecting performance gain about 2x-20x
Apart from OpenCV and Google Vision, there are widely-available web services like Microsoft Cognitive Services. The advantage is that it would be completely platform-independent, which you've listed as a major design goal. I haven't personally used them in an implementation yet but based on playing with their demos for awhile they seem quite powerful; they're pretty accurate and can offer quite a few details depending on what you want to know. (There are similar solutions available from other vendors as well by the way).
The two major potential downsides to something like that are the potential for added network traffic and API pricing (depending on how heavily you'll be using them).
Pricing-wise, Microsoft currently offers up to 5,000 transactions a month for free with added transactions beyond that being some fraction of a penny (depending on traffic, you can actually get a discount for high volume), but if you're doing, for example, millions of transactions per month the fees can start adding up surprisingly quickly. This is actually a fairly typical pricing model; before you select a vendor or implement this kind of a solution make sure you understand how they're going to charge you and how much you're likely to end up paying and how much you could be paying if you scale your user base. Depending on your traffic and business model it could be either very reasonable or cost-prohibitive.
The added network traffic may or may not be a problem depending on how your app is written and how much data you're sending. If you can do the processing asynchronously and be guaranteed reasonably fast Wi-Fi access that obviously wouldn't be a problem but unfortunately you may or may not have that luxury.
I am currently working with the Google Vision API and it seems to be able to detect landmarks out of the box. Check out the FaceTracker here:
google face tracker
This solution should detect the face, happiness, and left and right eye as is. For other landmarks, you can call the getLandmarks on a Face and it should return everything you need (thought I have not tried it) according to their documentation: Face reference
I'm trying to capture images with 30 seconds exposure times in my app (I know it's possible since the stock camera allows it).
But SENSOR_INFO_EXPOSURE_TIME_RANGE (which it's supposed to be in nanoseconds) gives me the range :
13272 - 869661901
in seconds it would be just
0.000013272 - 0.869661901
Which obviously is less than a second.
How can I use longer exposure times?
Thanks in advance!.
The answer to your question:
You can't. You checked exactly the right information and interpreted it correctly. Any value you set for the exposure time longer than that will be clipped to that max amount.
The answer you want:
You can still get what you want, though, by faking it. You want 30 continuous seconds' worth of photons falling on the sensor, which you can't get. But you can get something (virtually) indistinguishable from it by accumulating 30 seconds' worth of photons with tiny missing intervals interspersed.
At a high level, what you need to do is create a List of CaptureRequests and pass it to CameraCaptureSession.captureBurst(...). This will take the shots with as minimal an interstitial time as possible. When each frame of image data is available, pass it to some new buffer somewhere and accumulate the information (simple point-wise addition). This is probably most properly done with an Allocation as the output Surface and some RenderScript.
Notes on data format:
The right way to do this is to use the RAW_SENSOR output format if you can. That way the accumulated output really is directly proportional to the light that was incident to the sensor over the whole 30s.
If you can't use that, for some reason, I would recommend using YUV_420_888 output, and make sure you set the tone map curve to be linear (unfortunately you have to do this manually by creating a curve with two points). Otherwise the non-linearity introduced will ruin our scheme. (Although I'm not sure simple addition is exactly right in a linear YUV space, but it's a first approach at least.) Whether you use this approach or RAW_SENSOR, you'll probably want to apply your own gamma curve/tone map after accumulation to make it "look right."
For the love of Pete don't use JPEG output, for many reasons, not the least of which is that this will most likely add a LOT of interstitial time between exposures, thereby ruining our approximation of 30s on continuous exposure.
Note on exposure equivalence:
This will produce almost exactly the exposure you want, but not quite. It differs in two ways.
There will be small missing periods of photon information in the middle of this chunk of exposure time. But on the time scale you are talking about (30s), missing a few milliseconds of light here and there is trivial.
The image will be slightly nosier than if you had taken a true single exposure of 30s. This is because each time you read out the pixel values from the actual sensor, a little electronic noise gets added to the information. So in the end you'll have 35 times as much of this additive noise (from the 35 exposures for your specific problem) as a single exposure would. There's no way around this, sorry, but it might not even be noticeable- this is usually fairly small relative to the meaningful photographic signal. It depends on the camera sensor quality (and ISO, but I imagine for this application you need that to be high.)
(Bonus!) This exposure will actually be superior in one way: Areas that might have been saturated (pure white) in a 30s exposure will still retain definition in these far shorter exposures, so you're basically guaranteed not to lose your high end details. :-)
You can't always trust SENSOR_INFO_EXPOSURE_TIME_RANGE as of May 2017. Try manually increasing the time and see what happens. I know my Pixel will actually take a 1.9 sec shot but SENSOR_INFO_EXPOSURE_TIME_RANGE has a value in the sub second range.
Last week i have chosen my major project. It is a vision based system to monitor cyclists in time trial events passing certain points on the course. It should detect the bright yellow race number on a cyclist's back and extract the number from it, and besides record the time.
I done some research about it and i decided to use Tesseract Android Tools by Robert Theis called Tess Two. To speed up the process of recognizing the text i want to use a fact that the number is mend to be extracted from bright (yellow) rectangle on the cyclist back and to focus the actual OCR only on it. I have not found any piece of code or any ideas how to detect the geometric figures with specific color. Thank you for any help. And sorry if i made any mistakes I am pretty new on this website.
Where are the images coming from? I ask because I was asked to provide some technical help for the design of a similar application (we were working with footballer's shirts) and I can tell you that you'll have a few problems:
Use a high quality video feed rather than rely on a couple of digital camera images.
The number will almost certainly be 'curved' or distorted because of the movement of the rider and being able to use a series of images will sometimes allow you to work out what number it really is based on a series of 'false reads'
Train for the font you're using but also apply as much logic as you can (if the numbers are always two digits and never start with '9', use this information to help you get the right number
If you have the luxury of being able to position the camera (we didn't!), I would have thought your ideal spot would be above the rider and looking slightly forward so you can capture their back with the minimum of distortions.
We found that merging several still-frames from the video into one image gave us the best overall image of the number - however, the technology that was used for this was developed by a third-party and they do not want to release it, I'm afraid :(
Good luck!
I have an application where I want to track 2 objects at a time that are rather small in the picture.
This application should be running on Android and iPhone, so the algorithm should be efficient.
For my customer it is perfectly fine if we deliver some patterns along with the software that are attached to the objects to be tracked to have a well-recognizable target.
This means that I can make up a pattern on my own.
As I am not that much into image processing yet, I don't know which objects are easiest to recognize in a picture even if they are rather small.
Color is also possible, although processing several planes separately is not desired because of the generated overhead.
Thank you for any advice!!
Best,
guitarflow
If I get this straight, your object should:
Be printable on an A4
Be recognizeable up to 4 meters
Rotational invariance is not so important (I'm making the assumption that the user will hold the phone +/- upright)
I recommend printing a large checkboard and using a combination of color-matching and corner detection. Try different combinations to see what's faster and more robust at difference distances.
Color: if you only want to work on one channel, you can print in red/green/blue*, and then work only on that respective channel. This will already filter a lot and increase contrast "for free".
Otherwise, a histogram backprojection is in my experience quite fast. See here.
Also, let's say you have only 4 squares with RGB+black (see image), it would be easy to get all red contours, then check if it has the correct neighbouring colors: a patch of blue to it's right and a patch of green below it, both of roughly the same area. This alone might be robust enough, and is equivalent to working on 1 channel since for each step you're only accessing one specific channel (search for contours in red, check right in blue, check below in green).
If you're getting a lot of false-positives, you can then use corners to filter your hits. In the example image, you have 9 corners already, in fact even more if you separate channels, and if it isn't enough you can make a true checkerboard with several squares in order to have more corners. It will probably be sufficient to check how many corners are detected in the ROI in order to reject false-positives, otherwise you can also check that the spacing between detected corners in x and y direction is uniform (i.e. form a grid).
Corners: Detecting corners has been greatly explored and there are several methods here. I don't know how efficient each one is, but they are fast enough, and after you've reduced the ROIs based on color, this should not be an issue.
Perhaps the simplest is to simply erode/dilate with a cross to find corners. See here .
You'll want to first threshold the image to create a binary map, probably based on color as metnioned above.
Other corner detectors such as Harris detector are well documented.
Oh and I don't recommend using Haar-classifiers. Seems unnecessarily complicated and not so fast (though very robust for complex objects: i.e. if you can't use your own pattern), not to mention the huge amount of work for training.
Haar training is your friend mate.
This tutorial should get you started: http://note.sonots.com/SciSoftware/haartraining.html
Basically you train something called a classifier based on sample images (2000 or so of the object you want to track). OpenCV already has the tools required to build these classifiers and functions in the library to detect objects.