Does Tensorflow TFDetect demo track objects end-to-end?

Does Tensorflow TFDetect demo track objects end-to-end? - android

This question is regarding TFDetect demo which is available as a part of Tensorflow Android Camera Demo. The description says,
Demonstrates a model based on Scalable Object Detection using Deep
Neural Networks to localize and track people in the camera preview in
real-time.
When I ran the demo, the app was creating boxes around detected objects with a fractional number assigned to each object (I guess the confidence score). My question is, how is tracking being performed here. Is it multiple object tracking (described here) where there is an id assigned to each track and the tracks are stored in the memory, or is it just detection of objects across multiple frames to see how the object is moving?
Please correct me if I missed out on anything.

Two main things are going on here:
1: detection is being done in a background thread. This takes on the order of 100-1000ms depending on device, so is not sufficient to maintain smooth tracking.
2: tracking is being done in the UI thread. This generally takes < 5ms per frame, and can be done on every frame once the position of objects is known. The tracker implements pyrimidal lucas-kanade optical flow on the median movement of FAST features -- press the volume key and you'll see the individual features being tracked.
The tracker runs on every frame, storing optical flow keypoints at every timestamp. Thus, when a detection comes in, the tracker is able to figure out where it currently is by walking the position forward along the collected keypoints deltas. There is also some non-max suppression being done as well by multiboxtracker.
Once an object is tracked by the tracker, no further input from the detector is required. The tracker will automatically drop the track when the normalized cross-correlation with the original detection drops below a certain threshold, or update the position/appearance when the detector finds a better match with significant overlap.

Related

Run multiple image processor using firebase MLKit

I'm trying to detect objects and text using the firebase MLKit on a live camera feed in android. There are specific recognizers(FirebaseVisionTextRecognizer, FirebaseVisionObjectDetector) to process the image. If I use these recognizers one by one it is working fine, I'm able to get the desire response.
However, I want to detect both the objects and text simultaneously using the same camera feed same as Google Lens app. To achieve this, First, I tried to run both the recognizers together but there is more latency(Time is taken to perform a specific frame) as both runs sequentially and hence only text detection was working but not the Object detection. That means there is no result from the object detection.
Then, I tried to perform both the recognizers parallel, the latency gets decreased but not enough that the detection API returns the response. When there is no text in the camera feed, the object detection works well but when there is text in the camera feed, the latency is getting increased and so there are no track objects.
Note: I checked the latency of the after detection function call(Code which executes after detecting the object) and it doesn't take much time. The recognisers take more time to process the image in case of parallel execution. I'm testing on Samsung Galaxy S30s phone and I guess it has not that much poor processor.
Few outline from the code:
Using FirebaseVisionObjectDetectorOptions.STREAM_MODE, enableMultipleObjects=false and enableClassification=false for object detection
Using FirebaseVisionImageMetadata.IMAGE_FORMAT_NV21 format while building FirebaseVisionImageMetadata
As per best practices defined by Google, dropping the latest frames if the detection is in process
Using OnDeviceObjectDetector for the object detection
For text detection, I use OnDeviceTextRecognizer
I need help to understand how the Google Lens app performs multiple recognisers together but not in my application. What I can do to enable multiple recognizers on the same camera frame?

For now, the way to run multiple detectors on the same image frame is to run them sequentially, because we internally run them in a single thread. We are actively adding supports for running different detectors in parallel.
...as both runs sequentially and hence only text detection was working
but not the Object detection.
The ObjectDetection feature with STREAM_MODE expects the latency between two image frames is small, say < 300ms. If you run text recognition in between, the latency may be too long so that the ObjectDetection feature can not function properly. You may change the STREAM_MODE to SINGLE_IMAGE_MODE to get result in your setting, but the latency would be higher.

Android: MotionEvent Need some clarification of GetHistoricalX and GetHistoricalY

In my application, I required to draw strokes to the view with touches. I would have to save a previous coordinated when touch down and touch move (after processed the touch move). When I looked at the API, there is GetHistoricalX and GetHistoricalY.
1) How does these historical data work. Will they ever be removed?
2) Will it start keeping all the historical data when the touch start moving?
Since I using Xamarin Form which also implement for IOS. Does IOS has the same thing as this.

Android:
On Android, GetHistoricalX|Y will contain the X/Y that have not been reported since the last ACTION_MOVE event (i.e. this are batched into a single touch event for efficiency).
The coordinates are "historical" only insofar as they are older than the current coordinates in the batch; however, they are still distinct from any other coordinates reported in prior motion events. To process all coordinates in the batch in time order, first consume the historical coordinates then consume the current coordinates.
Note: Since there is no standard Input Sampling rates defined for /dev/input/event0, the rate is determined by the hardware developer and how their digitizer grid driver is written/configured. Android will then collect the number of samples available and offer those to the developer within the historical data along with the most current X/Y sample. If everyone knows how to get this frequency from the OS, I would love to know ;-)
You can use the GetHistorySize to get the number of "points" available, process them first and then process the current X/Y, but remember these are only the locations since the last move batch event.
There is sample Java code under the Batching section # https://developer.android.com/reference/android/view/MotionEvent.html
iOS:
On iOS, the number of touch events reported are based on a 60hz sampling rate of their digitizer. Some iDevices have a faster frequency (newer iPads at 120hz & iPad Pro at 240hz). These 'extra" points are reported within the coalescedTouchesForTouch method (Xamarin.iOS = GetCoalescedTouches).
Note: iOS even has predictedTouchesForTouch (Xamarin.iOS = GetPredictedTouches) that might be available within the UIEvent. These can be used to "pre-draw" where the user might be moving to, Apple has dev code samples of this when using the Apple Pencil to prevent a visual "lag" from the tip of the pencil...
Net Result:
In the end, if you need to preserve a history of X/Y touch locations in order to replay them, you will need to store these yourself as neither iOS or Android is going to buffer these outside of a touch event.

Detecting face landmarks points in android

I am developing app in which I need to get face landmarks points on a cam like mirror cam or makeup cam. I want it to be available for iOS too. Please guide me for a robust solution.
I have used Dlib and Luxand.
DLIB: https://github.com/tzutalin/dlib-android-app
Luxand: http://www.luxand.com/facesdk/download/
Dlib is slow and having a lag of 2 sec approximately (Please look at the demo video on the git page) and luxand is ok but it's paid. My priority is to use an open source solution.
I have also use the Google vision but they are not offering much face landmarks points.
So please give me a solution to make the the dlib to work fast or any other option keeping cross-platform in priority.
Thanks in advance.

You can make Dlib detect face landmarks in real-time on Android (20-30 fps) if you take a few shortcuts. It's an awesome library.
Initialization
Firstly you should follow all the recommendations in Evgeniy's answer, especially make sure that you only initialize the frontal_face_detector and shape_predictor objects once instead of every frame. The frontal_face_detector will initialize faster if you deserialize it from a file instead of using the get_serialized_frontal_faces() function. The shape_predictor needs to be initialized from a 100Mb file, and takes several seconds. The serialize and deserialize functions are written to be cross-platform and perform validation on the data, which is robust but makes it quite slow. If you are prepared to make assumptions about endianness you can write your own deserialization function that will be much faster. The file is mostly made up of matrices of 136 floating point values (about 120000 of them, meaning 16320000 floats in total). If you quantize these floats down to 8 or 16 bits you can make big space savings (e.g. you can store the min value and (max-min)/255 as floats for each matrix and quantize each separately). This reduces the file size down to about 18Mb and it loads in a few hundred milliseconds instead of several seconds. The decrease in quality from using quantized values seems negligible to me but YMMV.
Face Detection
You can scale the camera frames down to something small like 240x160 (or whatever, keeping aspect ratio correct) for faster face detection. It means you can't detect smaller faces but it might not be a problem depending on your app. Another more complex approach is to adaptively crop and resize the region you use for face detections: initially check for all faces in a higher res image (e.g. 480x320) and then crop the area +/- one face width around the previous location, scaling down if need be. If you fail to detect a face one frame then revert to detecting the entire region the next one.
Face Tracking
For faster face tracking, you can run face detections continuously in one thread, and then in another thread, track the detected face(s) and perform face feature detections using the tracked rectangles. In my testing I found that face detection took between 100 - 400ms depending on what phone I used (at about 240x160), and I could do 7 or 8 face feature detections on the intermediate frames in that time. This can get a bit tricky if the face is moving a lot, because when you get a new face detection (which will be from 400ms ago), you have to decide whether to keep tracking from the new detected location or the tracked location of the previous detection. Dlib includes a correlation_tracker however unfortunately I wasn't able to get this to run faster than about 250ms per frame, and scaling down the resolution (even drastically) didn't make much of a difference. Tinkering with internal parameters produced increase speed but poor tracking. I ended up using a CAMShift tracker based on the chroma UV planes of the preview frames, generating the color histogram based on the detected face rectangles. There is an implementation of CAMShift in OpenCV, but it's also pretty simple to roll your own.
Hope this helps, it's mostly a matter of picking the low hanging fruit for optimization first and just keep going until you're happy it's fast enough. On a Galaxy Note 5 Dlib does face+feature detections at about 100ms, which might be good enough for your purposes even without all this extra complication.

Dlib is fast enough for most cases. The most of processing time is taken to detect face region on image and its slow because modern smartphones are producing high-resolution images (10MP+)
Yes, face detection can take 2+ seconds on 3-5MP image, but it tries to find very small faces of 80x80 pixels size. I am really sure, that you dont need such small faces on high resolution images and the main optimization here is to reduce the size of image before finding faces.
After the face region is found, the next step - face landmarks detection is extremely fast and takes < 3 ms for one face, this time does not depend on resolution.
dlib-android port is not using dlib's detector the right way for now. Here is a list of recommendations how to make dlib-android port work much faster:
https://github.com/tzutalin/dlib-android/issues/15
Its very simple and you can implement it yourself. I am expecting performance gain about 2x-20x

Apart from OpenCV and Google Vision, there are widely-available web services like Microsoft Cognitive Services. The advantage is that it would be completely platform-independent, which you've listed as a major design goal. I haven't personally used them in an implementation yet but based on playing with their demos for awhile they seem quite powerful; they're pretty accurate and can offer quite a few details depending on what you want to know. (There are similar solutions available from other vendors as well by the way).
The two major potential downsides to something like that are the potential for added network traffic and API pricing (depending on how heavily you'll be using them).
Pricing-wise, Microsoft currently offers up to 5,000 transactions a month for free with added transactions beyond that being some fraction of a penny (depending on traffic, you can actually get a discount for high volume), but if you're doing, for example, millions of transactions per month the fees can start adding up surprisingly quickly. This is actually a fairly typical pricing model; before you select a vendor or implement this kind of a solution make sure you understand how they're going to charge you and how much you're likely to end up paying and how much you could be paying if you scale your user base. Depending on your traffic and business model it could be either very reasonable or cost-prohibitive.
The added network traffic may or may not be a problem depending on how your app is written and how much data you're sending. If you can do the processing asynchronously and be guaranteed reasonably fast Wi-Fi access that obviously wouldn't be a problem but unfortunately you may or may not have that luxury.

I am currently working with the Google Vision API and it seems to be able to detect landmarks out of the box. Check out the FaceTracker here:
google face tracker
This solution should detect the face, happiness, and left and right eye as is. For other landmarks, you can call the getLandmarks on a Face and it should return everything you need (thought I have not tried it) according to their documentation: Face reference

Detect pattern of motion on an Android device

I want to detect a specific pattern of motion on an Android mobile phone, e.g. if I do five sit-stands.
[Note: I am currently detecting the motion but the motion in all direction is the same.]
What I need is:
I need to differentiate the motion downward, upward, forward and backward.
I need to find the height of the mobile phone from ground level (and the height of the person holding it).
Is there any sample project which has pattern motion detection implemented?

This isn't impossible, but it may not be extremely accurate, given that the accuracy of the accelerometer and gyroscopes in phones have improved a lot.
What your app will doing is taking sensor data, and doing a regression analysis.
1) You will need to build a model of data that you classify as five sit and stands. This could be done by asking the user to do five sit and stands, or by loading the app with a more fine-tuned model from data that you've collected beforehand. There may be tricks you could do, such as loading several models of people with different heights, and asking the user to submit their own height in the app, to use the best model.
2) When run, your app will be trying to fit the data from the sensors (Android has great libraries for this), to the model that you've made. Hopefully, when the user performs five sit-stands, he will generate a set of motion data similar enough to your definition of five sit-stands that your algorithm accepts it as such.
A lot of the work here is assembling and classifying your model, and playing with it until you get an acceptable accuracy. Focus on what makes a stand-sit unique to other up and down motions - For instance, there might be a telltale sign of extending the legs in the data, followed by a different shape for straightening up fully. Or, if you expect the phone to be in the pocket, you may not have a lot of rotational motion, so you can reject test sets that registered lots of change from the gyroscope.

It is impossible. You can recognize downward and upward comparing acceleration with main gravity force but how do you know is your phone is in the back pocket when you rise or just in your waving hand when you say hello? Was if 5 stand ups or 5 hellos?
Forward and backward are even more unpredictable. What is forward for upside-down phone? What if forward at all from phone point of view?
And ground level as well as height are completely out of measurement. Phone will move and produce accelerations in exact way for dwarf or giant - it more depends on person behavior or motionless then on height.

It's a topic of research and probably I'm way too late to post it here, but I'm foraging the literature anyway, so what?
All kind of machine learning approaches have been set on the issue, I'll mention some on the way. Andy Ng's MOOC on machine learning gives you an entry point to the field and into Matlab/Octave that you instantly can put to practice, it demystifies the monsters too ("Support vector machine").
I'd like to detect if somebody is drunk from phone acceleration and maybe angle, therefore I'm flirting with neuronal networks for the issue (they're good for every issue basically, if you can afford the hardware), since I don't want to assume pre-defined patterns to look for.
Your task could be approached pattern based it seems, an approach applied to classify golf play motions, dancing, behavioural every day walking patterns, and two times drunk driving detection where one addresses the issue of finding a base line for what actually is longitudinal motion as opposed to every other direction, which maybe could contribute to find the baselines you need, like what is the ground level.
It is a dense shrub of aspects and approaches, below just some more.
Lim e.a. 2009: Real-time End Point Detection Specialized for Acceleration Signal
He & Yin 2009: Activity Recognition from acceleration data Based on
Discrete Consine Transform and SVM
Dhoble e.a. 2012: Online Spatio-Temporal Pattern Recognition with Evolving Spiking Neural Networks utilising Address Event Representation, Rank Order, and Temporal Spike Learning
Panagiotakis e.a.: Temporal segmentation and seamless stitching of motion patterns for synthesizing novel animations of periodic dances
This one uses visual data, but walks you through a matlab implementation of a neuronal network classifier:
Symeonidis 2000: Hand Gesture Recognition Using Neural Networks

I do not necessarily agree with Alex's response. This is possible (although maybe not as accurate as you would like) using accelerometer, device rotation and ALOT of trial/error and data mining.
The way I see that this can work is by defining a specific way that the user holds the device (or the device is locked and positioned on the users' body). As they go through the motions the orientation combined with acceleration and time will determine what sort of motion is being performed. You will need to use class objects like OrientationEventListener, SensorEventListener, SensorManager, Sensor and various timers e.g. Runnables or TimerTasks.
From there, you need to gather a lot of data. Observe, record and study what the numbers are for doing specific actions, and then come up with a range of values that define each movement and sub-movements. What I mean by sub-movements is, maybe a situp has five parts:
1) Rest position where phone orientation is x-value at time x
2) Situp started where phone orientation is range of y-values at time y (greater than x)
3) Situp is at final position where phone orientation is range of z-values at time z (greater than y)
4) Situp is in rebound (the user is falling back down to the floor) where phone orientation is range of y-values at time v (greater than z)
5) Situp is back at rest position where phone orientation is x-value at time n (greatest and final time)
Add acceleration to this as well, because there are certain circumstances where acceleration can be assumed. For example, my hypothesis is that people perform the actual situp (steps 1-3 in my above breakdown) at a faster acceleration than when they are falling back. In general, most people fall slower because they cannot see what's behind them. That can also be used as an additional condition to determine the direction of the user. This is probably not true for all cases, however, which is why your data mining is necessary. Because I can also hypothesize that if someone has done many situps, that final situp is very slow and then they just collapse back down to rest position due to exhaustion. In this case the acceleration will be opposite of my initial hypothesis.
Lastly, check out Motion Sensors: http://developer.android.com/guide/topics/sensors/sensors_motion.html
All in all, it is really a numbers game combined with your own "guestimation". But you might be surprised at how well it works. Perhaps (hopefully) good enough for your purposes.
Good luck!

Project Tango: Converting between coordinate systems and merging point clouds

I am trying to convert point clouds sampled and stored in XYZij data (which, according to the document, stores data in camera space) into a world coordinate system so that they can be merged. The frame pair I use for the Tango listener has COORDINATE_FRAME_START_OF_SERVICE as the base frame and COORDINATE_FRAME_DEVICE as the target frame.
This is the way I implement the transformation:
Retrieve the rotation quaternion from TangoPoseData.getRotationAsFloats() as q_r, and the point position from XYZij as p.
Apply the following rotation, where q_mult is a helper method computing the Hamilton product of two quaternions (I have verified this method against another math library):
p_transformed = q_mult(q_mult(q_r, p), q_r_conjugated);
Add the translate retrieved from TangoPoseData.getTranslationAsFloats() to p_transformed.
But eventually, points at p_transformed always seem to end up in clutter of partly overlapped point clouds instead of an aligned, merged point cloud.
Am I missing anything here? Is there a conceptual mistake in the transformation?
Thanks in advance.

Ken & Vincenzo, thanks for the reply.
I somehow get better results by performing ICP registration using CloudCompare on individual point clouds after they are transformed into world coordinates using pose data alone. Below is a sample result from ~30 scans of a computer desk. Points on the farther end are still a bit off, but with carefully tuned parameters this might be improved. Also CloudCompare's command line interface makes it suitable for batch processing.
Besides the inevitable integration error that needs to be corrected, a mistake I made earlier was wrongly taking the camera space frame (the camera on the device), which is described here in the documentation, to be the same as the OpenGL camera frame, which is the same as the device frame as described here. But they are not.
Also, moving the camera slowly to get more overlap between two adjacent frames also helps registration. And a good visible lighting setup of the scene is important, since besides the motion sensors, Tango also relies on the fish eye camera on its back for motion tracking.
Hope the tips also work for more general cases other than mine.

There are two different "standard" forms of the quaternion notation. One has the rotation angle first, i.e. x i j k, and one has the rotation angle last, i.e. x y z w. The Tango API docs list the TangoPoseData::orientation as x y z w. The Wikipedia page on quaternions lists them as x i j k. You might want to check what notation is assumed in your product method.

Where is your pose data coming from? Are you getting the most recent pose after you are in the callback for the point cloud data or are you asking for the pose that corresponds to the timestamp in the XYZij struct? You should be asking for the pose at time "timestamp" from the XYZij struct.
I tried it, it does not work.
I tried to queue the pose and get the nearest one to the XYZij.
Look at the blue wall
The real wall

we from roomplan.de created an opensource sample how to use pcl in project tango apps. It records pointclouds and transforms them into a common coordinate frame (the StartOf Service Frame). You can find the sample code here: https://github.com/roomplan/tango-examples-java/tree/master/PointCloudJava_with_PCL the specific funtion is in jni/jni_part.cpp function: Java_com_tangoproject_experiments_javapointcloud_PointCloudActivity_saveRotatedPointCloud
If you want the sample to compile, you need to clone the complete folder and integrate pcl into your project. A solution how this can be done can be found on our website.
sample pictures can be viewed at the demo app in the playstore. (Cant post them here yet) https://play.google.com/store/apps/details?id=com.tangoproject.experiments.javapointcloud&hl=en

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.