How can I do Holistic tracking on MLKit

How can I do Holistic tracking on MLKit - android

I was wondering if there are any plans to have the Holistic Detection (face + pose + hand tracking) implemented in MLKit, or if there is an easy and efficient way to add the face and the hand detection to the pose detection results.

We are looking at this indeed. However, since it is a very compute intensive set of tasks, we want to make sure it can run on a wide range of devices, rather than just flagship phones. If you are looking for this functionality today, I would go with the MediaPipe solution like you reference as well.

Related

How to record video with AR effects and save it?

I am trying to create an application like Snapchat that applies face filters while recording the video and saves it with the filter on.
I know there are packages like AR core and flutter_camera_ml_vision but these are not helping me.
I want to provide face filters and apply them at the time of video recording on the face, and also save the video with the filter on the face.

Not The Easiest Question To Answer, But...
I'll give it a go, let's see how things turn out.
First of all, you should fill in some more details about the statements given in the question, especially what you're trying to say here:
I know there are packages like AR core and flutter_camera_ml_vision but these are not helping me.
How did you approach the problem and what makes you say that it didn't help you?
In the Beginning...
First of all, let's get some needed basics out of the way to better understand your current situation and level in the prerequisite areas of knowledge:
Do you have any experience using Computer Vision & Machine Learning frameworks in other languages / in other apps?
Do you have the required math skills needed to use this technology?
As you're using Flutter, my guess is that cross-platform compatibility is high priority, have you done much Flutter programming before and what devices are your main targets?
So, What is required for creating a Snapchat-like filter for use in live video recording?
Well, quite a lot of work happens behind the scenes when you apply a filter to live video using any app that implements this in a decent way.
Snapchat uses in-house software that they've built up over years, using technology acquired from multiple multi-million dollar company acquisitions, often established companies that specialized in Computer Vision and AR technology, in addition to their own efforts, and has steadily grown to be quite impressive through the last 5-6 years in particular.
This isn't something you can throw together by yourself as an "all night'er" and expect good results. But there are tools available for easing the general learning curve, but these tools also require a firm understanding of the underlying concepts and technologies being used, and quite a lot of math.
The Technical Detour
OK, I know I may have went a bit overboard here, but this is fundamental building blocks, not so many are aware of the actual amount of computation needed for seemingly "basic" functionality, so please, TLDR; or not, this is fundamental stuff.
To create a good filter for live capture using a camera on something like an iPhone or Android device, you could, and most probably would, use AR as you mentioned you wanted to use in the end, but realize that this is a sub-set of the broad field of Computer Vision (CV) that uses various algorithms from Artificial Intelligence (AI) and Machine Learning (ML) for the main tasks of:
Facial Recognition
Given frames of video content from the live camera, define the area containing a human face (some also works with animals, but let's keep it as simple as possible) and output a rectangle suitable for use as a starting point in (x, y, for width & height).
The analysis phase alone will require a rather complex combination of algorithms / techniques from different parts of the AI universe, and this being video, not a single static image file, this must be continuously updated as the person / camera moves, so it must be done in close to real-time, in the millisecond range.
I believe different implementations combining HOG (Histogram of Oriented Gradients) from Computer Vision and SVMs (Support Vector Machines / Networks) from Machine Learning are still pretty common.
Detection of Facial Landmarks
This is what will define how well a certain effect / filter will adapt to different types of facial features and detect accessories like glasses, hats etc. Also called "facial keypoint detection", "facial feature detection" and other variants in different literature on the subject.
Head Pose Estimation
Once you know a few landmark points, you can also estimate the pose of the head.
This is an important part of effects like "face swap" to correctly re-align one face with another in an acceptable manner. A toolkit like OpenFace (Uses Python, OpenCV, OpenBLAS, Dlib ++) contains a lot of useful functionality, capable of facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation, delivering pretty decent results.
The Compositing of Effects into the Video Frames
After the work with the above is done, the rest involves applying the target filter, dog ears, rabbit teeth, whatever to the video frames, using compositing techniques.
As this answer is starting to look more like an article, I'll leave it to you to go figure out if you want to know more of the details in this part of the process.
Hey, Dude. I asked for AR in Flutter, remember?
Yep.
I know, I can get a bit carried away.
Well, my point is that it takes a lot more than one would usually imagine creating something like you ask for.
BUT.
My best advice if Flutter is your tool of choice would be to learn how to use the Cloud-Based ML services from Google's Firebase suite of tools, Firebase Machine Learning and Google's MLKit.
Add to this some AR-specific plugins, like the ARCore Plugin, and I'm sure you'll be able to get the pieces together if you have the right background and attitude, plus a good amount of aptitude for learning.
Hope this wasn't digressing too far from your core question, but there are no shortcuts that I know of that cut more corners than what I've already mentioned.

You could absolutely use the flutter_camera_ml_vision plugin and it's face recognition which will give you positions for landmarks of a face, such as, nose, eyes etc. Then simply stack the CameraPreview with a CustomPaint(foregroundPainter: widget in which you draw your filters using the different landmarks as coordinates for i.e. glasses, beards or whatever you want at the correct position of the face in the camera preview.
Google ML Kit also has face recognition that produces landmarks and you could write your own flutter plugin for that.
You can capture frames from the live camera preview and reformat them and then it as a byte buffer to ML kit or ML vision. I am currently writing a flutter plugin for ML kit pose detection with live capture so if you have any specific question about that let me know.
You will then have to merge the two surfaces and save to file in appropriate format. This is unknown territory for me so I can not provide any details about this part.

Check if user is looking at screen from code

In newer Android devices there's the possibility to unlock the phone with your face. It will also be possible with the Iphone X.
Is there a way of using these sensors/camera to check if the user is looking at the screen?
Edit:
I found that there's also a Vision Framework from Google: Vision Framework

Yes and no.
The built-in Face ID feature on iPhone X can unlock the device and authorize other built-in features (Apple Pay, iTunes/App Store payment, etc). You can also use it as a method of authorization in your app — the same LocalAuthentication framework calls that you use to support Touch ID on other devices automatically use Face ID instead on iPhone X.
Face ID, by default, requires the user to be looking at the screen. Thus, if your use case for attention detection has to do with authorization or unlocking, you can use LocalAuthentication to do it. (However, the user can disable attention detection in Accessibility settings, reducing the security but increasing the usability of Face ID. Third-party apps can't control or even read this setting.)
If you're talking about more directly doing attention detection or gaze tracking... Apple doesn't provide any API that exposes the inner workings of Face ID, or at least the gaze tracking part. Here's what they do have:
ARKit offers ARFaceTrackingConfiguration (see also sample code), which provides a detailed 3D model of the face in real time (supposedly using some of the same Neural Engine stuff as Face ID for detail and performance).
But as far as ARKit is concerned, eyes are just two holes in the face — there's no gaze tracking.
Apple's Vision framework offers face detection and face landmark recognition (that is, it locates eyes, nose, mouth, etc). Vision does identify the eye outline and the pupil, which you could theoretically use as a basis for gaze tracking.
However, since Vision offers such data only in 2D and doesn't get a 3D pose for the face, you're still left with a hefty computer vision problem if you want to build gaze tracking yourself. Vision processes 2D images, which means that it doesn't require iPhone X (but also means that it doesn't benefit from the TrueDepth camera on iPhone X either).
AVCapture offers access to the TrueDepth camera, so you can get the same color + depth imagery that Face ID and ARKit use to do their magic. (You just don't get said magic for yourself.)
None of this is to say that gaze tracking isn't possible on iOS in general or iPhone X specifically — all the building blocks are there, so given enough R&D effort you can implement it yourself. But Apple doesn't provide any developer access to the built-in gaze tracking mechanism.

Yes, in iOS 11 developer can use this feature in their third party application too through the iOS latest Vision Framework

Whole idea behind this feature is using front camera with facial recognition.
But you have to optimise it for when to capture images for processing
Tips
On application become active or become in foreground.
Also when user interact with any UI control or widget like (buttons,
table , touch events etc ).
Make sure stop or pause processing when
application not active.
Also you can use Gyroscope and other sensors to find device physical state.

If you are open to bulk up your app with a ML model, Google's media pipe is another option. You can even track the user's iris in this way:
https://google.github.io/mediapipe/solutions/iris
Obviously this is a overkill for simple eye detection, but you should be able to do much more with these models and framework.

Detecting face landmarks points in android

I am developing app in which I need to get face landmarks points on a cam like mirror cam or makeup cam. I want it to be available for iOS too. Please guide me for a robust solution.
I have used Dlib and Luxand.
DLIB: https://github.com/tzutalin/dlib-android-app
Luxand: http://www.luxand.com/facesdk/download/
Dlib is slow and having a lag of 2 sec approximately (Please look at the demo video on the git page) and luxand is ok but it's paid. My priority is to use an open source solution.
I have also use the Google vision but they are not offering much face landmarks points.
So please give me a solution to make the the dlib to work fast or any other option keeping cross-platform in priority.
Thanks in advance.

You can make Dlib detect face landmarks in real-time on Android (20-30 fps) if you take a few shortcuts. It's an awesome library.
Initialization
Firstly you should follow all the recommendations in Evgeniy's answer, especially make sure that you only initialize the frontal_face_detector and shape_predictor objects once instead of every frame. The frontal_face_detector will initialize faster if you deserialize it from a file instead of using the get_serialized_frontal_faces() function. The shape_predictor needs to be initialized from a 100Mb file, and takes several seconds. The serialize and deserialize functions are written to be cross-platform and perform validation on the data, which is robust but makes it quite slow. If you are prepared to make assumptions about endianness you can write your own deserialization function that will be much faster. The file is mostly made up of matrices of 136 floating point values (about 120000 of them, meaning 16320000 floats in total). If you quantize these floats down to 8 or 16 bits you can make big space savings (e.g. you can store the min value and (max-min)/255 as floats for each matrix and quantize each separately). This reduces the file size down to about 18Mb and it loads in a few hundred milliseconds instead of several seconds. The decrease in quality from using quantized values seems negligible to me but YMMV.
Face Detection
You can scale the camera frames down to something small like 240x160 (or whatever, keeping aspect ratio correct) for faster face detection. It means you can't detect smaller faces but it might not be a problem depending on your app. Another more complex approach is to adaptively crop and resize the region you use for face detections: initially check for all faces in a higher res image (e.g. 480x320) and then crop the area +/- one face width around the previous location, scaling down if need be. If you fail to detect a face one frame then revert to detecting the entire region the next one.
Face Tracking
For faster face tracking, you can run face detections continuously in one thread, and then in another thread, track the detected face(s) and perform face feature detections using the tracked rectangles. In my testing I found that face detection took between 100 - 400ms depending on what phone I used (at about 240x160), and I could do 7 or 8 face feature detections on the intermediate frames in that time. This can get a bit tricky if the face is moving a lot, because when you get a new face detection (which will be from 400ms ago), you have to decide whether to keep tracking from the new detected location or the tracked location of the previous detection. Dlib includes a correlation_tracker however unfortunately I wasn't able to get this to run faster than about 250ms per frame, and scaling down the resolution (even drastically) didn't make much of a difference. Tinkering with internal parameters produced increase speed but poor tracking. I ended up using a CAMShift tracker based on the chroma UV planes of the preview frames, generating the color histogram based on the detected face rectangles. There is an implementation of CAMShift in OpenCV, but it's also pretty simple to roll your own.
Hope this helps, it's mostly a matter of picking the low hanging fruit for optimization first and just keep going until you're happy it's fast enough. On a Galaxy Note 5 Dlib does face+feature detections at about 100ms, which might be good enough for your purposes even without all this extra complication.

Dlib is fast enough for most cases. The most of processing time is taken to detect face region on image and its slow because modern smartphones are producing high-resolution images (10MP+)
Yes, face detection can take 2+ seconds on 3-5MP image, but it tries to find very small faces of 80x80 pixels size. I am really sure, that you dont need such small faces on high resolution images and the main optimization here is to reduce the size of image before finding faces.
After the face region is found, the next step - face landmarks detection is extremely fast and takes < 3 ms for one face, this time does not depend on resolution.
dlib-android port is not using dlib's detector the right way for now. Here is a list of recommendations how to make dlib-android port work much faster:
https://github.com/tzutalin/dlib-android/issues/15
Its very simple and you can implement it yourself. I am expecting performance gain about 2x-20x

Apart from OpenCV and Google Vision, there are widely-available web services like Microsoft Cognitive Services. The advantage is that it would be completely platform-independent, which you've listed as a major design goal. I haven't personally used them in an implementation yet but based on playing with their demos for awhile they seem quite powerful; they're pretty accurate and can offer quite a few details depending on what you want to know. (There are similar solutions available from other vendors as well by the way).
The two major potential downsides to something like that are the potential for added network traffic and API pricing (depending on how heavily you'll be using them).
Pricing-wise, Microsoft currently offers up to 5,000 transactions a month for free with added transactions beyond that being some fraction of a penny (depending on traffic, you can actually get a discount for high volume), but if you're doing, for example, millions of transactions per month the fees can start adding up surprisingly quickly. This is actually a fairly typical pricing model; before you select a vendor or implement this kind of a solution make sure you understand how they're going to charge you and how much you're likely to end up paying and how much you could be paying if you scale your user base. Depending on your traffic and business model it could be either very reasonable or cost-prohibitive.
The added network traffic may or may not be a problem depending on how your app is written and how much data you're sending. If you can do the processing asynchronously and be guaranteed reasonably fast Wi-Fi access that obviously wouldn't be a problem but unfortunately you may or may not have that luxury.

I am currently working with the Google Vision API and it seems to be able to detect landmarks out of the box. Check out the FaceTracker here:
google face tracker
This solution should detect the face, happiness, and left and right eye as is. For other landmarks, you can call the getLandmarks on a Face and it should return everything you need (thought I have not tried it) according to their documentation: Face reference

AR complex CAD model tracking

I need some help with 3D-CAD model tracking.
I have to develop an android application that tracks predefined CAD models like parts of a car, e.g. you stand in front of the car with the engine hood opened and the application should tell you what you have to do now (perhabs refill some fluid) with visual support like arrows or cicles. The following youtube video describes my intention https://www.youtube.com/watch?v=4LE_IocFnL0
I have tried this with the Metaio SDK but when I try to transform the CAD models with the MetaioCreator to edge and surface models you cannot recognize any part of the model. I think this is because my models are very detailed (~400.000 polygons each). In addition for test purpose I reduced the polygons to a much lower count (~7.000 polygons), but when I create the edge and surface model an load this models in my test application my test device (Samsung Galaxy Tab S) laggs extremly and its not possible to track the model.
So I would like to ask you if this is the right way because I don't think so.. Perhaps you could give me an advice which tracking method I should get use of.
So far I used the MetaioSDK hybrid 3D tracking witch is a mix from an edgebased and an featurebased trackingmethod. Is there annother method witch is better to reach my goal? I've read about the openCV (witch is available for android too) but i dont now if this is a good method for 3d CAD tracking. Has anyone experience in this kind of augmented reality?
I have the following requirements:
- the framework / toolkit must be running on android
- the tracking should be independent from changing light ratio
- I have to track many different CAD models (the user select one wich shoud now be tracked)
- the user selected CAD model can be more than once in the current viewport and every single must selectable for further rendering operations
- the performance must be well when its running on an wearable device
In addition when there is a group of switches which shall be tracked, is there a possibility to track when the user pressed the marked switch? When I know the exact relative position from all my CAD models is there a possibility to join them together? My intent is that a user tracks model A and by selecting another trackable the device knows the approximately position based on the position from model A and the relative position difference to the new model.
Hope for responses,
lost1994
PS: If something is ambiguous or I didn't explain it cleary please don't be afraid of asking.

I know my answer is maybe a bit late, but anyway:
In general 3d-edge-based tracking is a good choice here. You are using the hybrid version which is good if your AR world wont change (means your car stays on a static position and wont be moved).
The reason for your laggs is that you still have 7000 polygons. That's to much for mobile. Reduce it to 3000 or less (3000 is fine on iPhone6).
Note: Metaio has closed the doors (they have been bought by Apple).

Object detection in android using ORB

I'm trying to detect certain objects with the camera of an android device. I've tried the OpenCV people detection sample using HOG descriptor but that seems to be pretty slow. Then I tried using Haar Cascades which gave a better average frame rate of 10 fps and better accuracy as well.
After reading a bit more, another viable option seems to be the ORB feature detector. What I could understand is that I need to save the feature vectors of the images against which I want to compare the current frame. So,
What would be the best way to store these vectors on an android device
How big a database of images would I need for decent accuracy (suppose I use it for people detection) & what could be the overhead of comparing against a large database
Also, what limitations does ORB present regarding color differences of objects and distance from camera

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.