I'm trying to detect objects and text using the firebase MLKit on a live camera feed in android. There are specific recognizers(FirebaseVisionTextRecognizer, FirebaseVisionObjectDetector) to process the image. If I use these recognizers one by one it is working fine, I'm able to get the desire response.
However, I want to detect both the objects and text simultaneously using the same camera feed same as Google Lens app. To achieve this, First, I tried to run both the recognizers together but there is more latency(Time is taken to perform a specific frame) as both runs sequentially and hence only text detection was working but not the Object detection. That means there is no result from the object detection.
Then, I tried to perform both the recognizers parallel, the latency gets decreased but not enough that the detection API returns the response. When there is no text in the camera feed, the object detection works well but when there is text in the camera feed, the latency is getting increased and so there are no track objects.
Note: I checked the latency of the after detection function call(Code which executes after detecting the object) and it doesn't take much time. The recognisers take more time to process the image in case of parallel execution. I'm testing on Samsung Galaxy S30s phone and I guess it has not that much poor processor.
Few outline from the code:
Using FirebaseVisionObjectDetectorOptions.STREAM_MODE, enableMultipleObjects=false and enableClassification=false for object detection
Using FirebaseVisionImageMetadata.IMAGE_FORMAT_NV21 format while building FirebaseVisionImageMetadata
As per best practices defined by Google, dropping the latest frames if the detection is in process
Using OnDeviceObjectDetector for the object detection
For text detection, I use OnDeviceTextRecognizer
I need help to understand how the Google Lens app performs multiple recognisers together but not in my application. What I can do to enable multiple recognizers on the same camera frame?
For now, the way to run multiple detectors on the same image frame is to run them sequentially, because we internally run them in a single thread. We are actively adding supports for running different detectors in parallel.
...as both runs sequentially and hence only text detection was working
but not the Object detection.
The ObjectDetection feature with STREAM_MODE expects the latency between two image frames is small, say < 300ms. If you run text recognition in between, the latency may be too long so that the ObjectDetection feature can not function properly. You may change the STREAM_MODE to SINGLE_IMAGE_MODE to get result in your setting, but the latency would be higher.
Related
This question is regarding TFDetect demo which is available as a part of Tensorflow Android Camera Demo. The description says,
Demonstrates a model based on Scalable Object Detection using Deep
Neural Networks to localize and track people in the camera preview in
real-time.
When I ran the demo, the app was creating boxes around detected objects with a fractional number assigned to each object (I guess the confidence score). My question is, how is tracking being performed here. Is it multiple object tracking (described here) where there is an id assigned to each track and the tracks are stored in the memory, or is it just detection of objects across multiple frames to see how the object is moving?
Please correct me if I missed out on anything.
Two main things are going on here:
1: detection is being done in a background thread. This takes on the order of 100-1000ms depending on device, so is not sufficient to maintain smooth tracking.
2: tracking is being done in the UI thread. This generally takes < 5ms per frame, and can be done on every frame once the position of objects is known. The tracker implements pyrimidal lucas-kanade optical flow on the median movement of FAST features -- press the volume key and you'll see the individual features being tracked.
The tracker runs on every frame, storing optical flow keypoints at every timestamp. Thus, when a detection comes in, the tracker is able to figure out where it currently is by walking the position forward along the collected keypoints deltas. There is also some non-max suppression being done as well by multiboxtracker.
Once an object is tracked by the tracker, no further input from the detector is required. The tracker will automatically drop the track when the normalized cross-correlation with the original detection drops below a certain threshold, or update the position/appearance when the detector finds a better match with significant overlap.
I am developing app in which I need to get face landmarks points on a cam like mirror cam or makeup cam. I want it to be available for iOS too. Please guide me for a robust solution.
I have used Dlib and Luxand.
DLIB: https://github.com/tzutalin/dlib-android-app
Luxand: http://www.luxand.com/facesdk/download/
Dlib is slow and having a lag of 2 sec approximately (Please look at the demo video on the git page) and luxand is ok but it's paid. My priority is to use an open source solution.
I have also use the Google vision but they are not offering much face landmarks points.
So please give me a solution to make the the dlib to work fast or any other option keeping cross-platform in priority.
Thanks in advance.
You can make Dlib detect face landmarks in real-time on Android (20-30 fps) if you take a few shortcuts. It's an awesome library.
Initialization
Firstly you should follow all the recommendations in Evgeniy's answer, especially make sure that you only initialize the frontal_face_detector and shape_predictor objects once instead of every frame. The frontal_face_detector will initialize faster if you deserialize it from a file instead of using the get_serialized_frontal_faces() function. The shape_predictor needs to be initialized from a 100Mb file, and takes several seconds. The serialize and deserialize functions are written to be cross-platform and perform validation on the data, which is robust but makes it quite slow. If you are prepared to make assumptions about endianness you can write your own deserialization function that will be much faster. The file is mostly made up of matrices of 136 floating point values (about 120000 of them, meaning 16320000 floats in total). If you quantize these floats down to 8 or 16 bits you can make big space savings (e.g. you can store the min value and (max-min)/255 as floats for each matrix and quantize each separately). This reduces the file size down to about 18Mb and it loads in a few hundred milliseconds instead of several seconds. The decrease in quality from using quantized values seems negligible to me but YMMV.
Face Detection
You can scale the camera frames down to something small like 240x160 (or whatever, keeping aspect ratio correct) for faster face detection. It means you can't detect smaller faces but it might not be a problem depending on your app. Another more complex approach is to adaptively crop and resize the region you use for face detections: initially check for all faces in a higher res image (e.g. 480x320) and then crop the area +/- one face width around the previous location, scaling down if need be. If you fail to detect a face one frame then revert to detecting the entire region the next one.
Face Tracking
For faster face tracking, you can run face detections continuously in one thread, and then in another thread, track the detected face(s) and perform face feature detections using the tracked rectangles. In my testing I found that face detection took between 100 - 400ms depending on what phone I used (at about 240x160), and I could do 7 or 8 face feature detections on the intermediate frames in that time. This can get a bit tricky if the face is moving a lot, because when you get a new face detection (which will be from 400ms ago), you have to decide whether to keep tracking from the new detected location or the tracked location of the previous detection. Dlib includes a correlation_tracker however unfortunately I wasn't able to get this to run faster than about 250ms per frame, and scaling down the resolution (even drastically) didn't make much of a difference. Tinkering with internal parameters produced increase speed but poor tracking. I ended up using a CAMShift tracker based on the chroma UV planes of the preview frames, generating the color histogram based on the detected face rectangles. There is an implementation of CAMShift in OpenCV, but it's also pretty simple to roll your own.
Hope this helps, it's mostly a matter of picking the low hanging fruit for optimization first and just keep going until you're happy it's fast enough. On a Galaxy Note 5 Dlib does face+feature detections at about 100ms, which might be good enough for your purposes even without all this extra complication.
Dlib is fast enough for most cases. The most of processing time is taken to detect face region on image and its slow because modern smartphones are producing high-resolution images (10MP+)
Yes, face detection can take 2+ seconds on 3-5MP image, but it tries to find very small faces of 80x80 pixels size. I am really sure, that you dont need such small faces on high resolution images and the main optimization here is to reduce the size of image before finding faces.
After the face region is found, the next step - face landmarks detection is extremely fast and takes < 3 ms for one face, this time does not depend on resolution.
dlib-android port is not using dlib's detector the right way for now. Here is a list of recommendations how to make dlib-android port work much faster:
https://github.com/tzutalin/dlib-android/issues/15
Its very simple and you can implement it yourself. I am expecting performance gain about 2x-20x
Apart from OpenCV and Google Vision, there are widely-available web services like Microsoft Cognitive Services. The advantage is that it would be completely platform-independent, which you've listed as a major design goal. I haven't personally used them in an implementation yet but based on playing with their demos for awhile they seem quite powerful; they're pretty accurate and can offer quite a few details depending on what you want to know. (There are similar solutions available from other vendors as well by the way).
The two major potential downsides to something like that are the potential for added network traffic and API pricing (depending on how heavily you'll be using them).
Pricing-wise, Microsoft currently offers up to 5,000 transactions a month for free with added transactions beyond that being some fraction of a penny (depending on traffic, you can actually get a discount for high volume), but if you're doing, for example, millions of transactions per month the fees can start adding up surprisingly quickly. This is actually a fairly typical pricing model; before you select a vendor or implement this kind of a solution make sure you understand how they're going to charge you and how much you're likely to end up paying and how much you could be paying if you scale your user base. Depending on your traffic and business model it could be either very reasonable or cost-prohibitive.
The added network traffic may or may not be a problem depending on how your app is written and how much data you're sending. If you can do the processing asynchronously and be guaranteed reasonably fast Wi-Fi access that obviously wouldn't be a problem but unfortunately you may or may not have that luxury.
I am currently working with the Google Vision API and it seems to be able to detect landmarks out of the box. Check out the FaceTracker here:
google face tracker
This solution should detect the face, happiness, and left and right eye as is. For other landmarks, you can call the getLandmarks on a Face and it should return everything you need (thought I have not tried it) according to their documentation: Face reference
I want to programmatically read numbers on a page using mobile's camera instead from image, just like barcode scanning.
I know that we can read or scan barcode but is there any way to read numbers using same strategy. Another thing is i also know that we can read text or numbers from image using OCR but i don't want to take the photo/image and than process it but only scan and get ?
You mean to say that you don't want to click a picture and process it, instead you want to scan text by just hovering the camera, am I right?
It could be accomplished using a technology called Optical Character Recognition. (You mentioned something about OSR, I think this is want you meant). What it does is, it finds patterns in images to detect text in printed documents.
As far as I know, existing tools processes still images, so you will have to work around it to make it scan any moving images.
Character recognition demands significant amount of resources, so instead of processing moving pictures I would recommend you to write a program that takes images less frequently from a hovering camera and process it. Once text, or numbers in your case, are detected you could use a less efficient pattern matching algorithm to track the motion of the numbers.
Till date, the most powerful and popular software is Tesseract-OCR. You will find it at GitHub. You can use this to develop your mobile application.
I'm trying to use Android and OpenGL 2.0 to create a sort-of desert racing game. At least that's the end goal. For the time being I'm really just working with generating an endless desert, through the use of a Perlin noise algorithm. However, I'm coming across a lot of problems with regard to concurrency and synchronization. The program consists of three threads: a "render" thread, a "geometry" thread which essentially sits in the background generating tiles of perlin noise (eventually sending them through to the render thread to process in its own time) and a "main" thread which updates the camera's position and updates the geometry thread if new perlin noise tiles need to be created.
Aforementioned perlin tiles are stored in VBOs and only rendered when they're within a certain distance of the camera. Buffer initialization always begins immediately.
This all works well, without any noticeable problems.
HOWEVER.
When the tiles are uploaded to the GPU through glBufferData() (after processing by the separate geometry thread), the render thread always appears to block. I presume this is because Android implicitly calls glFinish() before the screen buffer is rendered. Obviously, I'd like the data uploading to be performed in the background while everything else is being drawn - even taking place over multiple frames if necessary.
I've looked on google and the only solution I could find is to use glMapBuffer/glMapBufferRange(), but these two methods aren't supported in GLES2.0. Neither are any of the synchronization objects - glFenceSync etc. so...
....
any help?
P.S. I haven't provided any code as I didn't think it was necessary, as the problem seems more theoretical to me. However I can certainly produce some on request.
A screenshot of the game so far:
http://i.stack.imgur.com/Q6S0k.png
Android does not call glFinish() (glFinish() is actually a no-op on IMG's GPUs). The problem is that glBufferData() is not an asynchronous API. What you really want is PBOs which are only available in OpenGL ES 3.0 and do offer the ability to perform asynchronous copies (including texture uploads.)
Are you always using glBufferData()? You should use glBufferSubData() as much as possible to avoid reallocating your VBO every time.
For my next project I started analyzing apps that measure pulse via camera (you press a finger against the camera and you get your pulse info).
I concluded that the apps receives data from the camera with the help of a light. How do the achieve this? Can you direct me to any area I should Investigate?
If anyone is in a mood to help me explaining how does pulse measure apps work? I cannot find ANY doc on the net on this topic.
Thanks in advance
To complement Robert's answer from a non-programming perspective (and since you asked it), pulse measure apps are based on Pulse Oxymetry.
The idea is to measure the absorbance of red light, which will vary when oxygenated blood is passing through your fingertips. When that happens there will a be a peak in absorbance, you only have to measure the number of times a peak is registered and divide by the respective time frame to compute cardiac frequency.
IMHO, to perform it on a mobile device is not fairly reliable, since it requires good lighting conditions and infra-red pulses, and there are several factors that makes this task very difficult:
1) Some phones may not have the flash LED light right near the camera
2) Some phones may not have a flash light at all
3) You don't have access to infra-red data.
4) The phone has to be absolutely still, or the image will be constantly changing, making the brightness measurement unreliable.
AFAIK such apps are using the preview mode of the Camera.
Using the method setPreviewCallBack(..) (and of course startPreview())you can register your own listener that receives continuously calls from the camera containing the current seen picture:
onPreviewFrame(byte[] data, Camera camera)
The image data is contained in the data byte array. The format of the data can be set via setPreviewFormat(). Using this data you can for example process the image and reduce it it's brightness at certain point in the image. Over the time the image brightness should show pulses.
I don't think that the necessary image algorithms are available by default in the Android runtime, therefore you have to develop own algorithms or look for 3rd party libraries that can be used on Android.