3d object recognition for AR android app

3d object recognition for AR android app - android

I'm trying to develop an AR android application.
it should detect and recognize the object captured by the camera, I'm using OpenCV for this purpose, but I'm not very familiar with object recognition for mobile devices in the AR field.
I have two questions:
1- which algorithm is better (in the meaning of precision and speed) SIFT, SURF, FAST, ORB, or something else?
2- I wonder if the process of detecting and tracking would be something like this :
taking a camera frame, detect its key points, compute its descriptors then match it with each image(Mat of descriptors) available in the database to find which one it belongs to.
I feel that the mentioned steps will be computationally heavy and especially if they're repeated for each frame to keep tracking the object.
please provide me with some details about the algorithm and the steps that best fit my goal.
Thanks in advance

FAST is only a detector where as SIFT, SURF, ORB and BRISK are detectors and descriptors.
Your question is a very generalized one.
SIFT descriptor is a classic approach, also the “original”
inspiration for most of the descriptors proposed later. The drawback
is that it is mathematically complicated and computationally heavy.
SURF detector is recognized as a more efficient substitution for
SIFT. It has a Hessian-based detector and a distribution based
descriptor generator.
SIFT and SURF are based on the histograms of gradients. That is, the
gradients of each pixel in the patch need to be computed. These
computations cost time. Even though SURF speeds up the computation
using integral images, this still isn’t fast enough for some
applications
SIFT and SURF are most accurate but they are patent protected and they can't be used without purchase.
FAST is a standalone feature detector and it is not a descriptor
generator. It is designed to be very efficient and suitable for
real-time applications of any complexity.
BRIEF descriptor is a light-weight, easy-to-implement descriptor
based on binary strings. BRIEF descriptor is targeted to low-power
devices, and compensates some of its robustness and accuracy to the
efficiency.
Binary descriptors are an attractive solution to many modern applications, especially for mobile platforms where both compute and memory resources are limited.
In my view, I would like to prefer ORB as it is a binary based descriptor. It takes low computational and less Memory requirements when compared to BRISK.
You have to do a research work on all these available descriptors before finalizing it.

I know it is an old question but I feel it will be able to help others.
There is this good tutorial which is using Android, OpenCV and OpenGL ES 3.0 to build a small AR app with Android studio using the NDK.
It has good explainations and a Github repo to check the code.
http://www.anandmuralidhar.com/blog/android/simple-ar/
It uses ORB features to detect/match marker to spawn 3D object on the scene.
About your second point, the tutorial can give you an idea of how the process can work.

Related

How to record video with AR effects and save it?

I am trying to create an application like Snapchat that applies face filters while recording the video and saves it with the filter on.
I know there are packages like AR core and flutter_camera_ml_vision but these are not helping me.
I want to provide face filters and apply them at the time of video recording on the face, and also save the video with the filter on the face.

Not The Easiest Question To Answer, But...
I'll give it a go, let's see how things turn out.
First of all, you should fill in some more details about the statements given in the question, especially what you're trying to say here:
I know there are packages like AR core and flutter_camera_ml_vision but these are not helping me.
How did you approach the problem and what makes you say that it didn't help you?
In the Beginning...
First of all, let's get some needed basics out of the way to better understand your current situation and level in the prerequisite areas of knowledge:
Do you have any experience using Computer Vision & Machine Learning frameworks in other languages / in other apps?
Do you have the required math skills needed to use this technology?
As you're using Flutter, my guess is that cross-platform compatibility is high priority, have you done much Flutter programming before and what devices are your main targets?
So, What is required for creating a Snapchat-like filter for use in live video recording?
Well, quite a lot of work happens behind the scenes when you apply a filter to live video using any app that implements this in a decent way.
Snapchat uses in-house software that they've built up over years, using technology acquired from multiple multi-million dollar company acquisitions, often established companies that specialized in Computer Vision and AR technology, in addition to their own efforts, and has steadily grown to be quite impressive through the last 5-6 years in particular.
This isn't something you can throw together by yourself as an "all night'er" and expect good results. But there are tools available for easing the general learning curve, but these tools also require a firm understanding of the underlying concepts and technologies being used, and quite a lot of math.
The Technical Detour
OK, I know I may have went a bit overboard here, but this is fundamental building blocks, not so many are aware of the actual amount of computation needed for seemingly "basic" functionality, so please, TLDR; or not, this is fundamental stuff.
To create a good filter for live capture using a camera on something like an iPhone or Android device, you could, and most probably would, use AR as you mentioned you wanted to use in the end, but realize that this is a sub-set of the broad field of Computer Vision (CV) that uses various algorithms from Artificial Intelligence (AI) and Machine Learning (ML) for the main tasks of:
Facial Recognition
Given frames of video content from the live camera, define the area containing a human face (some also works with animals, but let's keep it as simple as possible) and output a rectangle suitable for use as a starting point in (x, y, for width & height).
The analysis phase alone will require a rather complex combination of algorithms / techniques from different parts of the AI universe, and this being video, not a single static image file, this must be continuously updated as the person / camera moves, so it must be done in close to real-time, in the millisecond range.
I believe different implementations combining HOG (Histogram of Oriented Gradients) from Computer Vision and SVMs (Support Vector Machines / Networks) from Machine Learning are still pretty common.
Detection of Facial Landmarks
This is what will define how well a certain effect / filter will adapt to different types of facial features and detect accessories like glasses, hats etc. Also called "facial keypoint detection", "facial feature detection" and other variants in different literature on the subject.
Head Pose Estimation
Once you know a few landmark points, you can also estimate the pose of the head.
This is an important part of effects like "face swap" to correctly re-align one face with another in an acceptable manner. A toolkit like OpenFace (Uses Python, OpenCV, OpenBLAS, Dlib ++) contains a lot of useful functionality, capable of facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation, delivering pretty decent results.
The Compositing of Effects into the Video Frames
After the work with the above is done, the rest involves applying the target filter, dog ears, rabbit teeth, whatever to the video frames, using compositing techniques.
As this answer is starting to look more like an article, I'll leave it to you to go figure out if you want to know more of the details in this part of the process.
Hey, Dude. I asked for AR in Flutter, remember?
Yep.
I know, I can get a bit carried away.
Well, my point is that it takes a lot more than one would usually imagine creating something like you ask for.
BUT.
My best advice if Flutter is your tool of choice would be to learn how to use the Cloud-Based ML services from Google's Firebase suite of tools, Firebase Machine Learning and Google's MLKit.
Add to this some AR-specific plugins, like the ARCore Plugin, and I'm sure you'll be able to get the pieces together if you have the right background and attitude, plus a good amount of aptitude for learning.
Hope this wasn't digressing too far from your core question, but there are no shortcuts that I know of that cut more corners than what I've already mentioned.

You could absolutely use the flutter_camera_ml_vision plugin and it's face recognition which will give you positions for landmarks of a face, such as, nose, eyes etc. Then simply stack the CameraPreview with a CustomPaint(foregroundPainter: widget in which you draw your filters using the different landmarks as coordinates for i.e. glasses, beards or whatever you want at the correct position of the face in the camera preview.
Google ML Kit also has face recognition that produces landmarks and you could write your own flutter plugin for that.
You can capture frames from the live camera preview and reformat them and then it as a byte buffer to ML kit or ML vision. I am currently writing a flutter plugin for ML kit pose detection with live capture so if you have any specific question about that let me know.
You will then have to merge the two surfaces and save to file in appropriate format. This is unknown territory for me so I can not provide any details about this part.

Detecting face landmarks points in android

I am developing app in which I need to get face landmarks points on a cam like mirror cam or makeup cam. I want it to be available for iOS too. Please guide me for a robust solution.
I have used Dlib and Luxand.
DLIB: https://github.com/tzutalin/dlib-android-app
Luxand: http://www.luxand.com/facesdk/download/
Dlib is slow and having a lag of 2 sec approximately (Please look at the demo video on the git page) and luxand is ok but it's paid. My priority is to use an open source solution.
I have also use the Google vision but they are not offering much face landmarks points.
So please give me a solution to make the the dlib to work fast or any other option keeping cross-platform in priority.
Thanks in advance.

You can make Dlib detect face landmarks in real-time on Android (20-30 fps) if you take a few shortcuts. It's an awesome library.
Initialization
Firstly you should follow all the recommendations in Evgeniy's answer, especially make sure that you only initialize the frontal_face_detector and shape_predictor objects once instead of every frame. The frontal_face_detector will initialize faster if you deserialize it from a file instead of using the get_serialized_frontal_faces() function. The shape_predictor needs to be initialized from a 100Mb file, and takes several seconds. The serialize and deserialize functions are written to be cross-platform and perform validation on the data, which is robust but makes it quite slow. If you are prepared to make assumptions about endianness you can write your own deserialization function that will be much faster. The file is mostly made up of matrices of 136 floating point values (about 120000 of them, meaning 16320000 floats in total). If you quantize these floats down to 8 or 16 bits you can make big space savings (e.g. you can store the min value and (max-min)/255 as floats for each matrix and quantize each separately). This reduces the file size down to about 18Mb and it loads in a few hundred milliseconds instead of several seconds. The decrease in quality from using quantized values seems negligible to me but YMMV.
Face Detection
You can scale the camera frames down to something small like 240x160 (or whatever, keeping aspect ratio correct) for faster face detection. It means you can't detect smaller faces but it might not be a problem depending on your app. Another more complex approach is to adaptively crop and resize the region you use for face detections: initially check for all faces in a higher res image (e.g. 480x320) and then crop the area +/- one face width around the previous location, scaling down if need be. If you fail to detect a face one frame then revert to detecting the entire region the next one.
Face Tracking
For faster face tracking, you can run face detections continuously in one thread, and then in another thread, track the detected face(s) and perform face feature detections using the tracked rectangles. In my testing I found that face detection took between 100 - 400ms depending on what phone I used (at about 240x160), and I could do 7 or 8 face feature detections on the intermediate frames in that time. This can get a bit tricky if the face is moving a lot, because when you get a new face detection (which will be from 400ms ago), you have to decide whether to keep tracking from the new detected location or the tracked location of the previous detection. Dlib includes a correlation_tracker however unfortunately I wasn't able to get this to run faster than about 250ms per frame, and scaling down the resolution (even drastically) didn't make much of a difference. Tinkering with internal parameters produced increase speed but poor tracking. I ended up using a CAMShift tracker based on the chroma UV planes of the preview frames, generating the color histogram based on the detected face rectangles. There is an implementation of CAMShift in OpenCV, but it's also pretty simple to roll your own.
Hope this helps, it's mostly a matter of picking the low hanging fruit for optimization first and just keep going until you're happy it's fast enough. On a Galaxy Note 5 Dlib does face+feature detections at about 100ms, which might be good enough for your purposes even without all this extra complication.

Dlib is fast enough for most cases. The most of processing time is taken to detect face region on image and its slow because modern smartphones are producing high-resolution images (10MP+)
Yes, face detection can take 2+ seconds on 3-5MP image, but it tries to find very small faces of 80x80 pixels size. I am really sure, that you dont need such small faces on high resolution images and the main optimization here is to reduce the size of image before finding faces.
After the face region is found, the next step - face landmarks detection is extremely fast and takes < 3 ms for one face, this time does not depend on resolution.
dlib-android port is not using dlib's detector the right way for now. Here is a list of recommendations how to make dlib-android port work much faster:
https://github.com/tzutalin/dlib-android/issues/15
Its very simple and you can implement it yourself. I am expecting performance gain about 2x-20x

Apart from OpenCV and Google Vision, there are widely-available web services like Microsoft Cognitive Services. The advantage is that it would be completely platform-independent, which you've listed as a major design goal. I haven't personally used them in an implementation yet but based on playing with their demos for awhile they seem quite powerful; they're pretty accurate and can offer quite a few details depending on what you want to know. (There are similar solutions available from other vendors as well by the way).
The two major potential downsides to something like that are the potential for added network traffic and API pricing (depending on how heavily you'll be using them).
Pricing-wise, Microsoft currently offers up to 5,000 transactions a month for free with added transactions beyond that being some fraction of a penny (depending on traffic, you can actually get a discount for high volume), but if you're doing, for example, millions of transactions per month the fees can start adding up surprisingly quickly. This is actually a fairly typical pricing model; before you select a vendor or implement this kind of a solution make sure you understand how they're going to charge you and how much you're likely to end up paying and how much you could be paying if you scale your user base. Depending on your traffic and business model it could be either very reasonable or cost-prohibitive.
The added network traffic may or may not be a problem depending on how your app is written and how much data you're sending. If you can do the processing asynchronously and be guaranteed reasonably fast Wi-Fi access that obviously wouldn't be a problem but unfortunately you may or may not have that luxury.

I am currently working with the Google Vision API and it seems to be able to detect landmarks out of the box. Check out the FaceTracker here:
google face tracker
This solution should detect the face, happiness, and left and right eye as is. For other landmarks, you can call the getLandmarks on a Face and it should return everything you need (thought I have not tried it) according to their documentation: Face reference

Object detection in android using ORB

I'm trying to detect certain objects with the camera of an android device. I've tried the OpenCV people detection sample using HOG descriptor but that seems to be pretty slow. Then I tried using Haar Cascades which gave a better average frame rate of 10 fps and better accuracy as well.
After reading a bit more, another viable option seems to be the ORB feature detector. What I could understand is that I need to save the feature vectors of the images against which I want to compare the current frame. So,
What would be the best way to store these vectors on an android device
How big a database of images would I need for decent accuracy (suppose I use it for people detection) & what could be the overhead of comparing against a large database
Also, what limitations does ORB present regarding color differences of objects and distance from camera

To use the JNI, or not to use the JNI (Android performance)

I just added some computationally expensive code to an Android game I am developing. The code in question is a collection of collision detection routines that get called very often (every iteration of the game-loop) and are doing a large amount of computation. I feel my collision detection implementation is fairly well developed, and as reasonably fast as I can make it in Java.
I've been using Traceview to profile the code, and this new piece of collision detection code has somewhat unsurprisingly doubled the duration of my game logic. That's obviously a concern since for certain devices, this performance hit could take my game from a playable to an unplayable state.
I have been considering different ways to optimize this code, and I am wondering if by moving the code into C++ and accessing it with the JNI, if I will get some noticeable performance savings?
The above question is my main concern and my reason for asking. I've determined that the two following reasons would be other positive results from using the JNI. However, it is not enough to persuade me to port my code to C++.
This would make the code cleaner. Since most of the collision detection is some sort of vector math, it is much cleaner to be able to use overloaded operators rather than using some more verbose vector classes in Java.
Memory management would be simpler. Simpler you say? Well, this is a game so the garbage collector running is not welcome because the GC could end up ruining the performance of your game if it constantly has to interrupt to clean up. In C I don't have to worry about the garbage collector, so I can avoid all the ugly things I do in Java with temporary static variables and just rely on the good old stack memory of C++
Long-winded as this question may be, I think I covered all my points. Given this information, would it be worth porting my code from Java to C++ and accessing it with the JNI (for reasons of improving performance)? Also, is there a way to measure or estimate a potential performance gain?
EDIT:
So I did it. Results? Well from TraceView's perspective, it was a 6x increase in speed of my collision detection routine.
It wasn't easy getting there though. Besides having to do the JNI dance, I also had to make some optimizations that I did not expect. Mainly, using a directly allocated float buffer to pass data from Java to native. My initial attempt just used a float array to hold the data in question because the conversion from Java to C++ was more natural, but that was realllly reallllly slow. The direct buffer completely side-stepped performance issues with array copying between java and native, and left me with a 6x bump.
Also, instead of rolling my own vector class, I just used the Eigen math library. I'm not sure how much of an affect this has had on performance, but at the least, it saved me the time of dev'ing my own (less efficient) vector class.
Another lesson learned is that excessive logging is bad for performance (jic that isn't obvious).

Not really a direct answer to your question, but the following links might be of use to you:
Android Developers, JNI Tips.
Android Developers, Designing for Performance
In the second link the following is written:
Native code isn't necessarily more efficient than Java. For one thing,
there's a cost associated with the Java-native transition, and the JIT
can't optimize across these boundaries. If you're allocating native
resources (memory on the native heap, file descriptors, or whatever),
it can be significantly more difficult to arrange timely collection of
these resources. You also need to compile your code for each
architecture you wish to run on (rather than rely on it having a JIT).
You may even have to compile multiple versions for what you consider
the same architecture: native code compiled for the ARM processor in
the G1 can't take full advantage of the ARM in the Nexus One, and code
compiled for the ARM in the Nexus One won't run on the ARM in the G1.
Native code is primarily useful when you have an existing native
codebase that you want to port to Android, not for "speeding up" parts
of a Java app.

If you are still at a fairly early stage of game development, you can consider using a Game Engine which provides a good collision detection mechanism, like Libgdx which does a fairly good job of box2d collision detection.

What android devices don't support draw_texture?

i am developing an android app using OpenGL ES for drawing and i use the draw_texture extension as it's the fastest.
I read you have to query the string to check and see if the drawing method is supported on the phone and degrade gracefully if not. My main concern is, how common is it really to have a device which doesn't support this?
I mean, drawing textured quads (the only method standard in OpenGL) is so slow the game would hardly be enjoyable on these devices.
I'm just curious if it's worth the time to support these devices.

I don't know an example of Android device lacking the draw_texture extension, but it is highly likely that such devices actually exists in minimal amounts. It's definitely not worthed to dedicate effort in supporting them, but on the other hand it is nearly trivial to switch between drawTex and quads, especially if your code supports rotated sprites.

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.