I am very new in terms of Augmented Reality with some experience on OpenGL. I wanna draw something such as a triangle with OpenGL ES(Android or iOS, doesn't matter for the platform). The background needs to be captured by mobile's camera. Basically the result likes the Pokemon Go, via the camera you got the real world as the background and a Pokemon inserts into the real world.
So, where should I begin?
One approach.
Capture the image and load into a 2D texture map.
Render a quadrilateral with this 2D texture. The quadrilateral will need to be far enough from your virtual camera so that it forms a background (you will want to enable depth testing). The quadrilateral will have to be large enough to cover the background (which depends on both the distance from the camera and your perspective field of view).
Now render your scene (triangles).
The figure below shows the view frustum in the xz-plane once you have things in view coordinates. N and F are the distances to the near and far clipping planes and θ is the vertical field of view -- let a = aspect ratio = w/h of your image (which should match the aspect ratio of your viewport). H is the height of the quad you wish to render and Q is the distance to the quad from the camera. The height of the quad should then be H = 2*Qtan(θ/2). The width of your quad is W = aH.
The distances from the camera of the objects you wish to be in the foreground should be between N and Q.
I am assuming you know how to set of the view matrix (via the "look-at transform") to position of your camera, and set the projection matrix to specify the perspective projection. Also assuming many other things (like how to load a texture map, draw a filled quad with texture coordinates, enable depth testing, etc...). If you need more details let me know.
If you wish to embed the object within the scene that will require some computer vision techniques (ascertaining depth via stereo image pairs). This is a non-trivial problem.
Related
Problem:
I am reading "OpenGL ES 2 for Android:A Quick-Start Guide" and got to chapter 5 where they start taking the devices' aspect ratio into consideration and I can't seem to understand the link that they make between the orthographic projection and a devices aspect ratio. In the book they mention the use of an orthographic projection that will allow one to step out of the normalized coordinate space that OpenGL uses and into a virtual coordinate space to be able to account for the aspect ratio. Then they state once the aspect ratio has been considered you have to take the virtual coordinates back into the normalized system. I was bit confused as to their use of the words "virtual coordinate system" here as well.
What I comprehend:
I understand that if you were to make a circle and placed it in OpenGL's normalized coordinate system the circle would get stretched and squished depending on the orientation of the device because a device has a different aspect ratio than the aspect ratio of 1 that the normalized coordinate system of OpenGL has. What I don't understand is how using an Orthographic projection will help us solve the issue. I think I understand what an Orthographic projection is but in case I don't can someone define this in simple terms?
I think I understand what an Orthographic projection is but in case I don't can someone define this in simple terms?
I think this is part of the problem. You understand the term projection in the mathematical sense - an idempotent mapping, which is typical when reducing the dimensionality of your data. In a typical render pipeline, the "projection" matrix doesn't do any projection at all. Instead, the rendering API defines some conventions for a 3D view volume. In OpenGL, the viewing volume is defined as the cube -1 <= x,y,z <= 1 in normalized device coordinates. The sides of each cube form the six clip planes. Any geometry outside these will be clipped or culled - so these planes simply represent the edges of the screen (or actually, the viewport, but imagine the screen here is more intuitive).
The task of the projection matrix (in combination of the perspective divide by w) is to just transform from 3D eye space (some cartesian coordinate system relative to the "camera", if one wants to think in those terms) to 3D normalized device space. There is no mathematical projection happening in the normal case. This also means that the projection matrix defines the position of the 6 clipping planes in eye space. You can basically just take the well-defined corners of the view volume in normalized device space, and apply the inverse of the projection matrix (and do another perspective divide) and get back the eight corners of the viewing volume in eye space.
As a result, the projection matrix defines which extent of the world is mapped to the screen, and the aspect ratio of the viewing frustum must equal the aspect ratio of the viewport used for rendering if the object shall appear undistorted.
For a orthographic "projection", all what the projection matrix does is defining some cuboid in eye space (usually an axis-aligned one, so it boils down to scale and a translation per dimension). Typically, such ortho transformation is defined by directly specifying the viewing volume in eye space, i.e. specifying the left, right, top, bottom and near and far values. The projection matrix now simply maps x_eye=left to -1 (the left clipping plane in NDC), and x_eye=right to 1 (the right clipping plane in NDC), and so on.
In case of a perspective "projection", the viewing volume will be a pyramid frustum in eye space. The math for that is a bit more complicated as we have to play with the homogenuous w component, and I don't want to go into details here, but the key point I'm trying to get through here is that there still is no projection. A perspective projection transforms the pyramid view frustum into a cube in NDC, and it transforms everything inside of this volume with it - a cube in the view frustum will actually by deformed to a somewhat "inverted" pyramid frustum, where the farther away parts actually get smaller in NDC.
The only case where the real projection is happening is during rasterization when only the x and y coordiantes are considered - and this is always an orthographic projection along zm and it is not done by any proejction matrix.
I have imported a model (e.g. a teapot) using Rajawali into my scene.
What I would like is to label parts of the model (e.g. the lid, body, foot, handle and the spout)
using plain Android views, but I have no idea how this could be achieved. Specifically, positioning
the labels on the right place seems challenging. The idea is that when I transform my model's position in the scene, the tips of the labels are still correctly positioned
Rajawali tutorial show how Android views can be placed on top of the scene here https://github.com/Rajawali/Rajawali/wiki/Tutorial-08-Adding-User-Interface-Elements
. I also understand how using the transformation matrices a 3D coordinate on the model can be
transformed into a 2D coordinate on the screen, but I have no idea how to determine the exact 3D coordinates
on the model itself. The model is exported to OBJ format using Blender, so I assume there is some clever way of determining
the coordinates in Blender and exporting them to a separate file or include them somehow in the OBJ file (but not
render those points, only include them as metadata), but I have no idea how I could do that.
Any ideas are very appreciated! :)
I would use a screenquad, not a view. This is a general GL solution, and will also work with iOS.
You must determine the indices of the desired model vertices. Using the text rendering algo below, you can just fiddle them until you hit the right ones.
Create a reasonable ARGB bitmap with same aspect ratio as the screen.
Create the screenquad texture using this bitmap
Create a canvas using this bitmap
The rest happens in onDrawFrame(). Clear the canvas using clear paint.
Use the MVP matrix to convert desired model vertices to canvas coordinates.
Draw your desired text at the canvas coordinates
Update the texture.
Your text will render very precisely at the vertices you specfied. The GL thread will double-buffer and loop you back to #4. Super smooth 3D text animation!
Use double floating point math to avoid loss of precision during coordinate conversion, which results in wobbly text. You could even use the z value of the vertex to scale the text. Fancy!
The performance bottleneck is #7 since the entire bitmap must be copied to GL texture memory, every frame. Try to keep the bitmap as small as possible, maintaining aspect ratio. Maybe let the user toggle the labels.
Note that the copy to GL texture memory is redundant since in OpenGL-ES, GL memory is just regular memory. For compatibility reasons, a redundant chunk of regular memory is reserved to artificially enforce the copy.
In OpenCV I use the camera to capture a scene containing two squares a and b, both at the same distance from the camera, whose known real sizes are, say, 10cm and 30cm respectively. I find the pixel widths of each square, which let's say are 25 and 40 pixels (to get the 'pixel-width' OpenCV detects the squares as cv::Rect objects and I read their width field).
Now I remove square a from the scene and change the distance from the camera to square b. The program gets the width of square b now, which let's say is 80. Is there an equation, using the configuration of the camera (resolution, dpi?) which I can use to work out what the corresponding pixel width of square a would be if it were placed back in the scene at the same distance as square b?
The math you need for your problem can be found in chapter 9 of "Multiple View Geometry in Computer Vision", which happens to be freely available online: https://www.robots.ox.ac.uk/~vgg/hzbook/hzbook2/HZepipolar.pdf.
The short answer to your problem is:
No not in this exact format. Given you are working in a 3D world, you have one degree of freedom left. As a result you need to get more information in order to eliminate this degree of freedom (e.g. by knowing the depth and/or the relation of the two squares with respect to each other, the movement of the camera...). This mainly depends on your specific situation. Anyhow, reading and understanding chapter 9 of the book should help you out here.
PS: to me it seems like your problem fits into the broader category of "baseline matching" problems. Reading around about this, in addition to epipolar geometry and the fundamental matrix, might help you out.
Since you write of "squares" with just a "width" in the image (as opposed to "trapezoids" with some wonky vertex coordinates) I assume that you are considering an ideal pinhole camera and ignoring any perspective distortion/foreshortening - i.e. there is no lens distortion and your planar objects are exactly parallel to the image/sensor plane.
Then it is a very simple 2D projective geometry problem, and no separate knowledge of the camera geometry is needed. Just write down the projection equations in the first situation: you have 4 unknowns (the camera focal length, the common depth of the squares, the horizontal positions of their left sides (say), and 4 equations (the projections of each of the left and right sides of the squares). Solve the system and keep the focal length and the relative distance between the squares. Do the same in the second image, but now with known focal length, and compute the new depth and horizontal location of square b. Then add the previously computed relative distance to find where square a would be.
In order to understand the transformations performed by the camera to project the 3D world in the 2D image you need to know its calibration parameters. These are basically divided into two sets:
Intrensic parameters: These are fixed parameters that are specific for each camera. They are normally represented by a Matrix called k.
Extrensic parameters: These depend on the camera position in the 3D world. Normally they are represented by two matrices: R and T where the first one represents the rotation and the second one represents the translation
In order to calibrate a camera your need some pattern (basically a set of 3D points which coordinates are known). There are several examples for this in OpenCV library which provides support to perform the camera calibration:
http://docs.opencv.org/doc/tutorials/calib3d/camera_calibration/camera_calibration.html
Once you have your camera calibrated you can transform from 3D to 2D easily by the following equation:
Pimage = K · R · T · P3D
So it will not only depend on the position of the camera but it depends on all the calibration parameters. The following presentation go through the camera calibration details and the different steps and equations that are used during the 3D <-> Image transformations.
https://www.cs.umd.edu/class/fall2013/cmsc426/lectures/camera-calibration.pdf
With this in mind you can project whatever 3D point to the image and get its coordinate on it. The reverse transformation is not unique since going back from 2D to 3D will give you a line instead of a unique point.
In My OpenGL application I have created a SkyBox which works great if I render it from the origin. I render it with the PVMatrix (ProjectionMatrix * ViewMatrix [Camera]) and as I said if I render it from the origin all works great. But if I move the camera to let's say (0,6,-8) it does not work and the SkyBox is rendered as normal cube.
I thought that it is enought to create a ModelMatrix for the SkyBox and set the position in it to the position of the camera but this does not help. Of course I render the SkyBox with the new MVPMatrix now. Do you have any ideas why that does not work and what can I do to get it work?
In general translating the skybox to the camera should work out. Have you checked if you move it to the correct position?
One of the most common failures is that you move the camera in the same direction as your object, which results in moving the object twice away in the wrong direction. To explain this, let's have a look on the different coordinate systems:
Each model is defined in it's own model space. Lets call this space M. Using the model matrix (m), we are able to transform from model space to world space (W).
M ---m---> W
Now we have a second object in our scene, the camera, with its view space V and cameras model matrix c. Again we can transform
V ---c---> W
But since we need everything in view space instead of getting the camera to world space, we have to invert this transformation such that
W ---v---> V
In general this is given by v = c^-1, which is the view-matrix one specifies in the application. From knowing this, it should be quite clear why you have to move your object by z=-8 when your view-matrix contains a translation to z=8 (since T(8) = T(-8)^-1). For more details have a look at this presentation (starting from slide 6)
I know there is some post talking about this topic but I could not find my answer.
I want to calibrate my android camera without chessboard for 3d reconstruction, so I need my intrinsic and extrinsic parameters.
My first goal is to extract the 3D real system to be able to put some 3d Model on screen.
My step :
From a picture of a building I extract 4 points that represent the real 3D system
/!\ this step require camera calibration /!\
Convert them to 3d Point (solvePnP for exemple)
Then from my 3D Axis I create a OpenGL projection and modelview matrix
My main problem is that I want to avoid a calibration step, so how can calibrate without chessboard? I have some data from android such as focal length. I can guess that the projection center is the center of my camera picture.
Any idea or advice? or other way to do it ?
here is nochess calibation of qtcalib.
This scheme is recomended when you need obtain a camera calibration
from a image that don't have calibration chessboard. In this case, you
can approximate the camera calibration if you know 4 points in the
image forming a flat rectangle in real world. Is important to remark
that the aproximated calibration depends on the 4 selected points and
the values that you will set for the dimensions of the rectangle