I'm trying to detect items held in a hand using ML-Kit image labeling through a camera. If for example, I show it a soda can it picks up objects such as the hand, face, background etc... Things I'm not interested in and then doesn't find the object in the hand even at a .25 min accuracy using cloud vision.
Is there a way to limit what the vision looks for or another way to increase accuracy?
PS: I am also willing to switch APIs if there is something better for this task.
//This is mostly from a google tutorial
private fun runCloudImageLabeling(bitmap: Bitmap) {
//Create a FirebaseVisionImage
val image = FirebaseVisionImage.fromBitmap(bitmap)
val detector = FirebaseVision.getInstance().visionCloudLabelDetector
//Use the detector to detect the labels inside the image
detector.detectInImage(image)
.addOnSuccessListener {
// Task completed successfully
progressBar.visibility = View.GONE
itemAdapter.setList(it)
sheetBehavior.setState(BottomSheetBehavior.STATE_EXPANDED)
}
.addOnFailureListener {
// Task failed with an exception
progressBar.visibility = View.GONE
Toast.makeText(baseContext, "Sorry, something went wrong!", Toast.LENGTH_SHORT).show()
}
}
The ability to detect what's in the hand at high accuracy.
There is no setting that controls the accuracy in the built-in object detection model that Firebase ML Kit uses.
If you want more accurate detect, you have two options:
Call out to Cloud Vision, the server-side API that can detect many more object categories, and typically with much higher accuracy. This is a paid API, but it does come with a free quota. This the comparison page in the documentation for details.
Train your own model that is better equipped for the image types you care about. You can then use this custom model in your app to get better accuracy.
ML Kit provides the Object Detection & Tracking API that you can use to locate objects.
That API allows you to filter on the prominent object (close to center of viewfinder), which is the soda can in your example. The API returns the bounding box around the object, which you can use to crop and subsequently feed it through the Image Labeling API. This allows you to filter out all the non-relevant background and or other objects.
Related
I already make the app for Text Recognition and Barcode Scanner separately. But is it possible if i want to use Text Recognition and Barcode Scanner at the same time at Live Stream?
I confused after i read this code
mCameraSource.setMachineLearningFrameProcessor(barcodeScanningProcessor);
Is that indicate that only one camerasource per one MachineLearning?
Yes it is, however the sample code you are using is tailored for using one frame processor at a time.
One way to achieve what you are looking for is to process each frame individually, which would allow you to pass it to multiple APIs.
CameraView is one package that allows frame processing. You may want to throttle and only take one out of every X frames, since processing a frame is computationally intensive.
cameraView.addFrameProcessor(new FrameProcessor() {
#Override
#WorkerThread
public void process(Frame frame) {
byte[] data = frame.getData();
int rotation = frame.getRotation();
long time = frame.getTime();
Size size = frame.getSize();
int format = frame.getFormat();
// Process frame
// This is where you'd pass the image to the Text recognition API
// and then to the Barcode scanning API.
}
}
You'd process each frame as described in the documentation for text recognition, and similarly for barcode scanning.
The ARCore sceneform sample project "hellosceneform" is cool and works really well.
Problem is the requirement to move the phone around in order to get a surface on which to place anchors. It's too slow.
My application does not require anything to show up on a vertical plane (a wall), but only ever on the floor. Is there anyway I can skip the "move the phone around" step or at least speed it up?
I've tried:
session.getConfig().setPlaneFindingMode(Config.PlaneFindingMode.HORIZONTAL);
Thinking that if I remove the need to look for vertical planes then it would all work faster..... not quite fast enough it seems.
Thanks!
Unfortunately the framework is limited by (read: enabled by) the computer vision models that it uses to detect planes. The plane discovery controller (i.e. the "move the phone around" step) is a nudge to the user to provide the models with the depth information through the camera that they need to detect those planes. Removing this step won't speed up the process, it'll just leave the user without any instructions.
Without improvements to the core plane detection models I wouldn't expect that there's a way to make this faster. The best that we can do is come up with UX nudges that encourage the user to move the phone laterally more efficiently.
To hide the animation that shows users how they should move their phone
use
arFragment.planeDiscoveryController.hide()
arFragment.planeDiscoveryController.setInstructionView(null)
To speed up plane detection in ARCore is quite easy. Here's a code snippet:
class MainActivity : AppCompatActivity() {
lateinit var arFrag: ArFragment
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
setContentView(R.layout.activity_ux)
arFrag = supportFragmentManager.findFragmentById(R.id.ux_fragment) as ArFragment
arFrag.planeDiscoveryController.hide()
arFrag.planeDiscoveryController.setInstructionView(null)
arFrag.arSceneView.planeRenderer.isEnabled = false
arFrag.arSceneView.scene.setOnUpdateListener(::onFrame)
}
// ........................................................
}
Hope this helps.
After some weeks of waiting I finally have my Project Tango. My idea is to create an app that generates a point cloud of my room and exports this to .xyz data. I'll then use the .xyz file to show the point cloud in a browser! I started off by compiling and adjusting the point cloud example that's on Google's github.
Right now I use the onXyzIjAvailable(TangoXyzIjData tangoXyzIjData) to get a frame of x y and z values; the points. I then save these frames in a PCLManager in the form of Vector3. After I'm done scanning my room, I simple write all the Vector3 from the PCLManager to a .xyz file using:
OutputStream os = new FileOutputStream(file);
size = pointCloud.size();
for (int i = 0; i < size; i++) {
String row = String.valueOf(pointCloud.get(i).x) + " "
+ String.valueOf(pointCloud.get(i).y) + " "
+ String.valueOf(pointCloud.get(i).z) + "\r\n";
os.write(row.getBytes());
}
os.close();
Everything works fine, not compilation errors or crashes. The only thing that seems to be going wrong is the rotation or translation of the points in the cloud. When I view the point cloud everything is messed up; the area I scanned is not recognizable, though the amount of points is the same as recorded.
Could this have to do something with the fact that I don't use PoseData together with the XyzIjData? I'm kind of new to this subject and have a hard time understanding what the PoseData exactly does. Could someone explain it to me and help me fix my point cloud?
Yes, you have to use TangoPoseData.
I guess you are using TangoXyzIjData correctly; but the data you get this way is relative to where the device is and how the device is tilted when you take the shot.
Here's how i solved this:
I started from java_point_to_point_example. In this example they get the coords of 2 different points with 2 different coordinate system and then write those coordinates wrt the base Coordinate frame pair.
First of all you have to setup your exstrinsics, so you'll be able to perform all the transformations you'll need. To do that I call mExstrinsics = setupExtrinsics(mTango) function at the end of my setTangoListener() function. Here's the code (that you can find also in the example I linked above).
private DeviceExtrinsics setupExtrinsics(Tango mTango) {
//camera to IMU tranform
TangoCoordinateFramePair framePair = new TangoCoordinateFramePair();
framePair.baseFrame = TangoPoseData.COORDINATE_FRAME_IMU;
framePair.targetFrame = TangoPoseData.COORDINATE_FRAME_CAMERA_COLOR;
TangoPoseData imu_T_rgb = mTango.getPoseAtTime(0.0,framePair);
//IMU to device transform
framePair.targetFrame = TangoPoseData.COORDINATE_FRAME_DEVICE;
TangoPoseData imu_T_device = mTango.getPoseAtTime(0.0,framePair);
//IMU to depth transform
framePair.targetFrame = TangoPoseData.COORDINATE_FRAME_CAMERA_DEPTH;
TangoPoseData imu_T_depth = mTango.getPoseAtTime(0.0,framePair);
return new DeviceExtrinsics(imu_T_device,imu_T_rgb,imu_T_depth);
}
Then when you get the point Cloud you have to "normalize" it. Using your exstrinsics is pretty simple:
public ArrayList<Vector3> normalize(TangoXyzIjData cloud, TangoPoseData cameraPose, DeviceExtrinsics extrinsics) {
ArrayList<Vector3> normalizedCloud = new ArrayList<>();
TangoPoseData camera_T_imu = ScenePoseCalculator.matrixToTangoPose(extrinsics.getDeviceTDepthCamera());
while (cloud.xyz.hasRemaining()) {
Vector3 rotatedV = ScenePoseCalculator.getPointInEngineFrame(
new Vector3(cloud.xyz.get(),cloud.xyz.get(),cloud.xyz.get()),
camera_T_imu,
cameraPose
);
normalizedCloud.add(rotatedV);
}
return normalizedCloud;
}
This should be enough, now you have a point cloud wrt you base frame of reference.
If you overimpose two or more of this "normalized" cloud you can get the 3D representation of your room.
There is another way to do this with rotation matrix, explained here.
My solution is pretty slow (it takes around 700ms to the dev kit to normalize a cloud of ~3000 points), so it is not suitable for a real time application for 3D reconstruction.
Atm i'm trying to use Tango 3D Reconstruction Library in C using NDK and JNI. The library is well documented but it is very painful to set up your environment and start using JNI. (I'm stuck at the moment in fact).
Drifting
There still is a problem when I turn around with the device. It seems that the point cloud spreads out a lot.
I guess you are experiencing some drifting.
Drifting happens when you use Motion Tracking alone: it consist of a lot of very small error in estimating your Pose that all together cause a big error in your pose relative to the world. For instance if you take your tango device and you walk in a circle tracking your TangoPoseData and then you draw you trajectory in a spreadsheet or whatever you want you'll notice that the Tablet will never return at his starting point because he is drifting away.
Solution to that is using Area Learning.
If you have no clear ideas about this topic i'll suggest watching this talk from Google I/O 2016. It will cover lots of point and give you a nice introduction.
Using area learning is quite simple.
You have just to change your base frame of reference in TangoPoseData.COORDINATE_FRAME_AREA_DESCRIPTION. In this way you tell your Tango to estimate his pose not wrt on where it was when you launched the app but wrt some fixed point in the area.
Here's my code:
private static final ArrayList<TangoCoordinateFramePair> FRAME_PAIRS =
new ArrayList<TangoCoordinateFramePair>();
{
FRAME_PAIRS.add(new TangoCoordinateFramePair(
TangoPoseData.COORDINATE_FRAME_AREA_DESCRIPTION,
TangoPoseData.COORDINATE_FRAME_DEVICE
));
}
Now you can use this FRAME_PAIRS as usual.
Then you have to modify your TangoConfig in order to issue Tango to use Area Learning using the key TangoConfig.KEY_BOOLEAN_DRIFT_CORRECTION. Remember that when using TangoConfig.KEY_BOOLEAN_DRIFT_CORRECTION you CAN'T use learningmode and load ADF (area description file).
So you cant use:
TangoConfig.KEY_BOOLEAN_LEARNINGMODE
TangoConfig.KEY_STRING_AREADESCRIPTION
Here's how I initialize TangoConfig in my app:
TangoConfig config = tango.getConfig(TangoConfig.CONFIG_TYPE_DEFAULT);
//Turning depth sensor on.
config.putBoolean(TangoConfig.KEY_BOOLEAN_DEPTH, true);
//Turning motiontracking on.
config.putBoolean(TangoConfig.KEY_BOOLEAN_MOTIONTRACKING,true);
//If tango gets stuck he tries to autorecover himself.
config.putBoolean(TangoConfig.KEY_BOOLEAN_AUTORECOVERY,true);
//Tango tries to store and remember places and rooms,
//this is used to reduce drifting.
config.putBoolean(TangoConfig.KEY_BOOLEAN_DRIFT_CORRECTION,true);
//Turns the color camera on.
config.putBoolean(TangoConfig.KEY_BOOLEAN_COLORCAMERA, true);
Using this technique you'll get rid of those spreads.
PS
In the Talk i linked above, at around 22:35 they show you how to port your application to Area Learning. In their example they use TangoConfig.KEY_BOOLEAN_ENABLE_DRIFT_CORRECTION. This key does not exist anymore (at least in Java API). Use TangoConfig.KEY_BOOLEAN_DRIFT_CORRECTION instead.
I have some code that allows me to detect faces in a live camera preview and draw a few GIFs over their landmarks using the play-services-vision library provided by Google.
It works well enough when the face is static, but when the face moves at a moderate speed, the face detector takes longer than the camera's framerate to detect the landmarks at the face's new position. I know it might have something to do with the bitmap draw speed, but I took steps to minimize the lag in them.
(Basically I get complaints that the GIFs' repositioning isn't 'smooth enough')
EDIT: I did try getting the coordinate detection code...
List<Landmark> landmarksList = face.getLandmarks();
for(int i = 0; i < landmarksList.size(); i++)
{
Landmark current = landmarksList.get(i);
//canvas.drawCircle(translateX(current.getPosition().x), translateY(current.getPosition().y), FACE_POSITION_RADIUS, mFacePositionPaint);
//canvas.drawCircle(current.getPosition().x, current.getPosition().y, FACE_POSITION_RADIUS, mFacePositionPaint);
if(current.getType() == Landmark.LEFT_EYE)
{
//Log.i("current_landmark", "l_eye");
leftEyeX = translateX(current.getPosition().x);
leftEyeY = translateY(current.getPosition().y);
}
if(current.getType() == Landmark.RIGHT_EYE)
{
//Log.i("current_landmark", "r_eye");
rightEyeX = translateX(current.getPosition().x);
rightEyeY = translateY(current.getPosition().y);
}
if(current.getType() == Landmark.NOSE_BASE)
{
//Log.i("current_landmark", "n_base");
noseBaseY = translateY(current.getPosition().y);
noseBaseX = translateX(current.getPosition().x);
}
if(current.getType() == Landmark.BOTTOM_MOUTH) {
botMouthY = translateY(current.getPosition().y);
botMouthX = translateX(current.getPosition().x);
//Log.i("current_landmark", "b_mouth "+translateX(current.getPosition().x)+" "+translateY(current.getPosition().y));
}
if(current.getType() == Landmark.LEFT_MOUTH) {
leftMouthY = translateY(current.getPosition().y);
leftMouthX = translateX(current.getPosition().x);
//Log.i("current_landmark", "l_mouth "+translateX(current.getPosition().x)+" "+translateY(current.getPosition().y));
}
if(current.getType() == Landmark.RIGHT_MOUTH) {
rightMouthY = translateY(current.getPosition().y);
rightMouthX = translateX(current.getPosition().x);
//Log.i("current_landmark", "l_mouth "+translateX(current.getPosition().x)+" "+translateY(current.getPosition().y));
}
}
eyeDistance = (float)Math.sqrt(Math.pow((double) Math.abs(rightEyeX - leftEyeX), 2) + Math.pow(Math.abs(rightEyeY - leftEyeY), 2));
eyeCenterX = (rightEyeX + leftEyeX) / 2;
eyeCenterY = (rightEyeY + leftEyeY) / 2;
noseToMouthDist = (float)Math.sqrt(Math.pow((double)Math.abs(leftMouthX - noseBaseX), 2) + Math.pow(Math.abs(leftMouthY - noseBaseY), 2));
...in a separate thread within the View draw method, but it just nets me a SIGSEGV error.
My questions:
Is syncing the Face Detector's processing speed with the Camera Preview framerate the right thing to do in this case, or is it the other way around, or is it some other way?
As the Face Detector finds the faces in a camera preview frame, should I drop the frames that the preview feeds before the FD finishes? If so, how can I do it?
Should I just use setClassificationMode(NO_CLASSIFICATIONS) and setTrackingEnabled(false) in a camera preview just to make the detection faster?
Does the play-services-vision library use OpenCV, and which is actually better?
EDIT 2:
I read one research paper that, using OpenCV, the face detection and other functions available in OpenCV is faster in Android due to their higher processing power. I was wondering whether I can leverage that to hasten the face detection.
There is no way you can guarantee that face detection will be fast enough to show no visible delay even when the head motion is moderate. Even if you succeed to optimize the hell of it on your development device, you will sure find another model among thousands out there, that will be too slow.
Your code should be resilient to such situations. You can predict the face position a second ahead, assuming that it moves smoothly. If the users decide to twitch their head or device, no algorithm can help.
If you use the deprecated Camera API, you should pre-allocate a buffer and use setPreviewCallbackWithBuffer(). This way you can guarantee that the frames arrive to you image processor one at a time. You should also not forget to open the Camera on a background thread, so that the [onPreviewFrame()](http://developer.android.com/reference/android/hardware/Camera.PreviewCallback.html#onPreviewFrame(byte[], android.hardware.Camera)) callback, where your heavy image processing takes place, will not block the UI thread.
Yes, OpenCV face-detection may be faster in some cases, but more importantly it is more robust that the Google face detector.
Yes, it's better to turn the classificator off if you don't care about smiles and open eyes. The performance gain may vary.
I believe that turning tracking off will only slow the Google face detector down, but you should make your own measurements, and choose the best strategy.
The most significant gain can be achieved by turning setProminentFaceOnly() on, but again I cannot predict the actual effect of this setting for your device.
There's always going to be some lag, since any face detector takes some amount of time to run. By the time you draw the result, you will usually be drawing it over a future frame in which the face may have moved a bit.
Here are some suggestions for minimizing lag:
The CameraSource implementation provided by Google's vision library automatically handles dropping preview frames when needed so that it can keep up the best that it can. See the open source version of this code if you'd like to incorporate a similar approach into your app: https://github.com/googlesamples/android-vision/blob/master/visionSamples/barcode-reader/app/src/main/java/com/google/android/gms/samples/vision/barcodereader/ui/camera/CameraSource.java#L1144
Using a lower camera preview resolution, such as 320x240, will make face detection faster.
If you're only tracking one face, using the setProminentFaceOnly() option will make face detection faster. Using this and LargestFaceFocusingProcessor as well will make this even faster.
To use LargestFaceFocusingProcessor, set it as the processor of the face detector. For example:
Tracker<Face> tracker = *your face tracker implementation*
detector.setProcessor(
new LargestFaceFocusingProcessor.Builder(detector, tracker).build());
Your tracker implementation will receive face updates for only the largest face that it initially finds. In addition, it will signal back to the detector that it only needs to track that face for as long as it is visible.
If you don't need to detect smaller faces, using the setMinFaceSize() larger will make face detection faster. It's faster to detect only larger faces, since it doesn't need to spend time looking for smaller faces.
You can turn of classification if you don't need eyes open or smile indication. However, this would only give you a small speed advantage.
Using the tracking option will make this faster as well, but at some accuracy expense. This uses a predictive algorithm for some intermediate frames, to avoid the expense running full face detection on every frame.
Is there a known API or way to SCAN the text from a card without actually manually saving (and uploading) the picture? (iOS and Android)
Then I would need to know if that API can determine the marquee within the camera that should be scanned.
I want a behaviour similar to the one of QR scanners, or Augmented Reality apps. Where the user just directs the camera and the action occurs.
I have printed cards with a Redeem code in Text, and including QR will need to change the current card production.
The text is inside a white box, which may make it easier to recognise:
On iOS, you would use CIDetector with an AVCaptureSession. It can be used to process capture session output buffers as they come in from the camera without having to take a picture and provide text scanning.
For text detection, using CIDetector with CIDetectorTypeText will return areas that are likely to have text in them, but you would have to perform additional processing for Optical Character Recognition.
You could also use OpenCV for a not out of the box solution.
You can try this: https://github.com/gali8/Tesseract-OCR-iOS
Usage:
// Specify the image Tesseract should recognize on
tesseract.image = [[UIImage imageNamed:#"image_sample.jpg"] g8_blackAndWhite];
// Optional: Limit the area of the image Tesseract should recognize on to a rectangle
tesseract.rect = CGRectMake(20, 20, 100, 100);
// Optional: Limit recognition time with a few seconds
tesseract.maximumRecognitionTime = 2.0;
// Start the recognition
[tesseract recognize];