I have an object that is displayed on my android in AR using Kudan markerless, but when I rotate my phone so its off screen and then back again the object is either not there anymore or has scaled to an undesirable level. Quite simply, once placed I want the object to continue to exist as if it's really in the world. Like AR is supposed to do right?
I have just started with Kudan and I'm running the Markerless Unity tutorial but it doesn't detail any further settings that would make the object seem as if its in real world space. Currently it only seems vaguely real if you don't move the camera that much. Even then the object is quite jittery. Any tips? Thanks
After some experimentation there seem to be a number of other issues with Markerless Kudan:
1/ very erratic frame rates going from 0 - 60 fps with just one object, even after I have halved the screen resolution. There seems to be no reason the fps drops or increases.
2/ Occasionally very long freezes of 15 seconds or more.
3/ Markerless objects always seem to come closer to the camera steadily as if gravitating. They will eventually end up inside/on top of the camera.
4/ They never look at all like they are in the real world. Always shaking and moving around.
5/ If I keep the camera very still and wave my hand around in front of it, this actually pushes the object around the screen. Why on earth would I want that to happen as default behavior? Is that a bug? Surely it should only move when the camera moves/rotates? Can someone explain why this is happening?
Am I doing something wrong or is this technology still way off from being usable?
1) Frame rate is dependant on a lot of things.
If you have an old or cheap phone, it could be that your processor simply isn't up to the task.
If you're in an environment that is poorly lit or otherwise difficult to track, then the tracker has to do more work and subsequently there is more load on the processor.
Random frame rate drops in Unity are a problem in many games / apps, just because of the nature of Unity.
2) "Freezes" are essentially just the frame rate dropping to 0. See 1).
3) That simply doesn't happen in any of Kudan's demos, so something else must be going on here, possibly the reasons mentioned in 1).
4) Don't really know what you mean by "shaking", but it isn't something I've seen all that much.
5) Markerless tracking works by tracking the camera image, if you wave your hand in front of the camera, your hand becomes part of the tracked image. If you then move your hand away, the tracker attempts to adjust for the change in its "environment" and move the object in relation to your hand.
Related
Context
I'm building an app which performs real-time object detection throught the camera module of the device. The render is like the image below.
Let's say I try to recognize an apple, most of the time the app will recognize an apple. However, sometimes, the app will recognize the wrong fruit (let's say a lemon) on a few camera frames.
Goal
As the recognition of a fruit triggers an action in my code, my goal is to programmatically prevent a brief wrong recognition to trigger an action, and only take into account the majority result.
What I've tried
I tried this way : if the same fruit is recognized several frames in a row, I assumed the result is supposed to be the right one. But as my device process image recognition several times per second, even a wrong guess can be recognized several times in a row, and leads to the wrong action.
Question
Is there any known techniques for avoiding this behavior ?
I feel like you've already answered your own question. In general the interpretation of a model's inference is it's own tuning step. You know for example in logistic regression tasks that the threshold does NOT have to be 0.5. In fact, it's quite common to flex the threshold to see what the recall and precision are at various thresholds, and you can pick a threshold that works given your business/product problem. (Fraud detection might favor high recall if you never want to miss any fraud... or high precision if you don't want to annoy users with lots of false positives).
In video this broad concept is extended to multiple frames as you know. You now have the tune the hyperparameters, "how many frames total?" and "how many frames voting [apple]"?
If you are analyzing fruit going down a conveyer belt one by one, and you know each piece of fruit will be in frame for X seconds and you are shooting at 60 fps, maybe you want 60 * X frames. And maybe you want 90% of the frames to agree.
You'll want to visualize how often your detector "flips" detections so you can make a business/product judgement call on what your threshold ought to be.
This answer hasn't been very helpful in giving you a bright line rule here, but I hope it's helpful in suggesting that there is in fact NO bright line rule. You have to understand the problem to set the key hyperparameters:
For each frame, is top-1 acc sufficient, or do I need [.75] or higher confidence?
How many frames get to vote? Say [100].
How many correlated votes are necessary to trigger an actual signal? maybe it's [85].
The above algo assumes you take a hardmax after step 1. another option would be to just average all 100 frames and pick a threshold. that's kind of a soft label version of the above algo.
I am developing app in which I need to get face landmarks points on a cam like mirror cam or makeup cam. I want it to be available for iOS too. Please guide me for a robust solution.
I have used Dlib and Luxand.
DLIB: https://github.com/tzutalin/dlib-android-app
Luxand: http://www.luxand.com/facesdk/download/
Dlib is slow and having a lag of 2 sec approximately (Please look at the demo video on the git page) and luxand is ok but it's paid. My priority is to use an open source solution.
I have also use the Google vision but they are not offering much face landmarks points.
So please give me a solution to make the the dlib to work fast or any other option keeping cross-platform in priority.
Thanks in advance.
You can make Dlib detect face landmarks in real-time on Android (20-30 fps) if you take a few shortcuts. It's an awesome library.
Initialization
Firstly you should follow all the recommendations in Evgeniy's answer, especially make sure that you only initialize the frontal_face_detector and shape_predictor objects once instead of every frame. The frontal_face_detector will initialize faster if you deserialize it from a file instead of using the get_serialized_frontal_faces() function. The shape_predictor needs to be initialized from a 100Mb file, and takes several seconds. The serialize and deserialize functions are written to be cross-platform and perform validation on the data, which is robust but makes it quite slow. If you are prepared to make assumptions about endianness you can write your own deserialization function that will be much faster. The file is mostly made up of matrices of 136 floating point values (about 120000 of them, meaning 16320000 floats in total). If you quantize these floats down to 8 or 16 bits you can make big space savings (e.g. you can store the min value and (max-min)/255 as floats for each matrix and quantize each separately). This reduces the file size down to about 18Mb and it loads in a few hundred milliseconds instead of several seconds. The decrease in quality from using quantized values seems negligible to me but YMMV.
Face Detection
You can scale the camera frames down to something small like 240x160 (or whatever, keeping aspect ratio correct) for faster face detection. It means you can't detect smaller faces but it might not be a problem depending on your app. Another more complex approach is to adaptively crop and resize the region you use for face detections: initially check for all faces in a higher res image (e.g. 480x320) and then crop the area +/- one face width around the previous location, scaling down if need be. If you fail to detect a face one frame then revert to detecting the entire region the next one.
Face Tracking
For faster face tracking, you can run face detections continuously in one thread, and then in another thread, track the detected face(s) and perform face feature detections using the tracked rectangles. In my testing I found that face detection took between 100 - 400ms depending on what phone I used (at about 240x160), and I could do 7 or 8 face feature detections on the intermediate frames in that time. This can get a bit tricky if the face is moving a lot, because when you get a new face detection (which will be from 400ms ago), you have to decide whether to keep tracking from the new detected location or the tracked location of the previous detection. Dlib includes a correlation_tracker however unfortunately I wasn't able to get this to run faster than about 250ms per frame, and scaling down the resolution (even drastically) didn't make much of a difference. Tinkering with internal parameters produced increase speed but poor tracking. I ended up using a CAMShift tracker based on the chroma UV planes of the preview frames, generating the color histogram based on the detected face rectangles. There is an implementation of CAMShift in OpenCV, but it's also pretty simple to roll your own.
Hope this helps, it's mostly a matter of picking the low hanging fruit for optimization first and just keep going until you're happy it's fast enough. On a Galaxy Note 5 Dlib does face+feature detections at about 100ms, which might be good enough for your purposes even without all this extra complication.
Dlib is fast enough for most cases. The most of processing time is taken to detect face region on image and its slow because modern smartphones are producing high-resolution images (10MP+)
Yes, face detection can take 2+ seconds on 3-5MP image, but it tries to find very small faces of 80x80 pixels size. I am really sure, that you dont need such small faces on high resolution images and the main optimization here is to reduce the size of image before finding faces.
After the face region is found, the next step - face landmarks detection is extremely fast and takes < 3 ms for one face, this time does not depend on resolution.
dlib-android port is not using dlib's detector the right way for now. Here is a list of recommendations how to make dlib-android port work much faster:
https://github.com/tzutalin/dlib-android/issues/15
Its very simple and you can implement it yourself. I am expecting performance gain about 2x-20x
Apart from OpenCV and Google Vision, there are widely-available web services like Microsoft Cognitive Services. The advantage is that it would be completely platform-independent, which you've listed as a major design goal. I haven't personally used them in an implementation yet but based on playing with their demos for awhile they seem quite powerful; they're pretty accurate and can offer quite a few details depending on what you want to know. (There are similar solutions available from other vendors as well by the way).
The two major potential downsides to something like that are the potential for added network traffic and API pricing (depending on how heavily you'll be using them).
Pricing-wise, Microsoft currently offers up to 5,000 transactions a month for free with added transactions beyond that being some fraction of a penny (depending on traffic, you can actually get a discount for high volume), but if you're doing, for example, millions of transactions per month the fees can start adding up surprisingly quickly. This is actually a fairly typical pricing model; before you select a vendor or implement this kind of a solution make sure you understand how they're going to charge you and how much you're likely to end up paying and how much you could be paying if you scale your user base. Depending on your traffic and business model it could be either very reasonable or cost-prohibitive.
The added network traffic may or may not be a problem depending on how your app is written and how much data you're sending. If you can do the processing asynchronously and be guaranteed reasonably fast Wi-Fi access that obviously wouldn't be a problem but unfortunately you may or may not have that luxury.
I am currently working with the Google Vision API and it seems to be able to detect landmarks out of the box. Check out the FaceTracker here:
google face tracker
This solution should detect the face, happiness, and left and right eye as is. For other landmarks, you can call the getLandmarks on a Face and it should return everything you need (thought I have not tried it) according to their documentation: Face reference
I am trying to create an application that will track movement of the device in 2D space. After doing research online, all I could find that one way to do it is integrate linear acceleration twice but the error is horrible.
Are there any solutions to this problem? I would like to be able to move my phone up, which would cause a vertical line to be drawn on the screen, to scale of how far the phone was moved. Then if I move the phone to the left, horizontal line would be drawn - effectively allowing me to draw on the screen using movements of the phone.
Can this be done at all? If so, what direction should I take in the development? I don't know where to start...
EDIT: More about the project:
I am trying to make an exercise app that will track the movement of the leg/arm: for example, when you are doing stomach crunches and the phone is attached with an armstrap to your ankle.
The app would track repeated movements of the leg.
Unfortunately the accelerometers in these phones are nowhere near what you need to implement an inertial measurement unit. The big problem is since you are integrating twice an integration always comes with a constant integral(x,dx) = x^2/2 +c this constant is what makes this difficult. To make things worse you get it twice, once when integrating to get velocity and once to get position.
One method of fixing this that I have seen in commercial innertial measurement units is called a zero velocity null, this is where you use some other source of data to tell it when you have stopped the motion of the device so you can zero out the velocity. For example I saw a project put an inertial measurement unit on a shoe and it would zero the velocity whenever it detected the shoe being put on the ground which vastly improved the accuracy. Its possible that you could use a camera or something to determine this, however I have not seen it done. If you would like to start messing with this then you are an awesome person and I would love to hear how it turns out.
Edit: I should clarify that the constant I mention above is where the error accumulates. If you can zero velocity null it then you periodically drop the accumulated error from your stored current velocity. The error in position will still accumulate, however this would make it not drift when they are holding it relatively still which may make it passable for drawing.
I know no other way other than integrating the acceleration twice.
Moreover I think that it's not possible if you don't have knowledge about other sensors that might be in your device (for example on one of my devices I have 7 (seven) sensors related to various physical signals the device might be receiving).
Other than that remember that the sensor data is noisy and almost always must be pre-filtered. For example you can use geometric mean of last 10 samples. That should lower your error by providing a smoother input data to the integrating function.
In an Android app I'm making, I would like to detect when a user is holding a phone in his hand, makes a gesture like he would when throwing a frissbee. I have seen a couple of apps implementing this, but I can't find any example code or tutorial on the web.
It would be great with some thoughts on how this could be done, and ofc.
It would be even better with some example code or link to a tutorial.
Accelerometer provides you with a stream of 3d vectors. In case your phone is help in hand, its direction is opposite of earth gravity pull and size is the same. (this way you can determine phone orientation)
If user lets if fall, vector value will go to 0 (the process as weighlessness on space station)
If user makes some gesture without throwing it, directon will shift, and amplitude will rise, then fall and then rise again (when user stops movement). To determine how it looks like, you can do some research by recording accelerometer data and performing desireg gestures.
Keep in mind, that accelerometer is pretty noisy - you will have to do some averaging over nearby values to get meaningful results.
I think that one workable approach to match gesture would be invariant moments (like Hu moments used to image recognition) - accelerometer vector over time defines 4 dimensional space, and you will need set of scaling / rotation invariant moments. Designing such set is not easy, but comptuing is not complicated.
After you got your moments, you may use standart techniques of matching vectors to clusters. ( see "moments" and "cluster" modules from our javaocr project: http://javaocr.svn.sourceforge.net/viewvc/javaocr/trunk/plugins/ )
PS: you may get away with just speed over time, which produces 2-Dimensional space and can be analysed with javaocr on the spot.
Not exactly what you are looking for:
Store orientation to an array - and compare
Tracking orientation works well. Perhaps you can do something similar with the accelerometer data (without any integration).
A similar question is Drawing in air with Android phone.
I am curious what other answers you will get.
It's my first question, I searched on google as always, round two was searching directly on SO, but still I couldn't get the exact answer.
I'm going to write 3D graphics engine for games I want to make in the future for Android platform. I played Doom recently on my mobile, on break in high-school, and when I lifted my head after being killed by Baron of Hell I saw more that twenty people struggling to see me playing, followed by loud "AAAAAAAW!" when they saw me dead. My jaw went down to the floor. None of them suspected that it was game from 1993.
But let's get back to the point. If you wanted to write "find it by yourself" you can stop typing now. I can't check this on my own, for different reasons. At first I own only one mid-cost device (HTC Wildfire) I can test my engine on. The second and more important reason is time. I don't have time to write whole OpenGL-ES 2.0 graphics engine and realize that my Wildfire can't even run it. Or it gets around 1FPS when walking.
For raycasting to be more realistic it needs few calculations, which are not important for me in OpenGL because I just set vertices & indices and it goes. I love the way that Doom levels are designed but I wanted to add ability to move your head up and down (rotation on X-axis) to look around and shoot more precisely, and the complexity of calculations in raycasting are growing (not for me, for CPU). I know Wildfire doesn't have any GPU and anything based on OGL (even 2D) is lagging as hell even if I overclock my CPU to #748 with PERFORMANCE governor (it turns off on higher values, and I got to move it from my Casemate to stop it from overheating). But Doom is running perfectly without any lag even if I underclock it to #448. Is this only because of lo-res textures and non-complex levels or because of raycasting?
And please don't flame for going back to such old thing like raycasting engine. Actual mobile devices, or smartphones - call them like you want - that doesn't cost $1k are in stage that computers was 18 (Doom release) years ago. There will always be a group of such devices. And even if I don't make my billion on this - I will have something to play on travel.
I've got some pretty nice ideas for games like this, and I want to hit the low cost devices as they don't have anything more spectacular than simple logic games. As this is my first question please correct anything I wrote wrong, poorly or just bad - I'll rewrite it.