Using OpenGL ES 2.0 and Galaxy S4 phone, I have a Render Target 1024x1024 RGBA8888 where some textures are rendered each frame. I need to calculate how much red RGBA(1, 0, 0, 1) pixels was rendered on the render target (twice a second).
The main problem is that getting the texture from the GPU is very performance-expensive (~300-400 ms), and freezes are not applicable for my application.
I know about OES_shader_image_atomic extension for atomic counters (simply to increment some value when frag shader works), but it's available only in OpenGL ES 3.1 (and later), I have to stick to ES 2.0.
Is there any common solution I missed?
What you can try is to "reduce" texture in question to a significantly smaller one and read back to CPU that one (which should be less expensive performance-wise). For example, you can split your texture into squares N by N (where N is preferably is a power of two), then render a "whole screen" quad into a 1024/N by 1024/N texture with a fragment shader that sums number of red pixels in corresponding square:
sampler2D texture;
void main(void) {
vec2 offset = N * gl_FragCoord.xy;
int cnt = 0;
for (float x = 0.; x < float(N); x += 1) {
for(float y = 0.; y < float(N); y += 1) {
if (texture2D(texture, (offset + vec2(x, y)) / 1024.) == vec4(1, 0, 0, 1)) {
cnt += 1;
}
}
}
gl_FragColor = vec4((cnt % 256) / 255., ((cnt / 256) % 256) / 255., /* ... */);
}
Also remember that readPixels synchronously wait till GPU is done with all previously issued draws to the texture. So it may be beneficial to have two textures,
on each frame one is being rendered to, and the other is being read from. The next frame you swap them. That will somewhat delay obtaining the desired data, but should eliminate some freezes.
Related
I'm currently facing a problem I simply don't understand.
I employ ARCore for an inside out tracking task. Since I need to do some additional image processing I use Unitys capability to load a native c++ plugin. At the very end of each frame I pass the image in YUV_420_888 format as raw byte array to my native plugin.
A texture handle is created right at the beginning of the components initialization:
private void CreateTextureAndPassToPlugin()
{
Texture2D tex = new Texture2D(640, 480, TextureFormat.RGBA32, false);
tex.filterMode = FilterMode.Point;
tex.Apply();
debug_screen_.GetComponent<Renderer>().material.mainTexture = tex;
// Pass texture pointer to the plugin
SetTextureFromUnity(tex.GetNativeTexturePtr(), tex.width, tex.height);
}
Since I only need the grayscale image I basically ignore the UV part of the image and only use the y coordinates as displayed in the following:
uchar *p_out;
int channels = 4;
for (int r = 0; r < image_matrix->rows; r++) {
p_out = image_matrix->ptr<uchar>(r);
for (int c = 0; c < image_matrix->cols * channels; c++) {
unsigned int idx = r * y_row_stride + c;
p_out[c] = static_cast<uchar>(image_data[idx]);
p_out[c + 1] = static_cast<uchar>(image_data[idx]);
p_out[c + 2] = static_cast<uchar>(image_data[idx]);
p_out[c + 3] = static_cast<uchar>(255);
}
}
then each frame the image data is put into a GL texture:
GLuint gltex = (GLuint)(size_t)(g_TextureHandle);
glBindTexture(GL_TEXTURE_2D, gltex);
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, 640, 480, GL_RGBA, GL_UNSIGNED_BYTE, current_image.data);
I know that I use way too much memory by creating and passing the texture as RGBA but since GL_R8 is not supported by OpenGL ES3 and GL_ALPHA always lead to internal OpenGL errors I just pass the greyscale value to each color component.
However in the end the texture is rendered as can be seen in the following image:
At first I thought, that the reason for this may lie in the other channels having the same values, however setting all other channels than the first one to any value does not have any impact.
Am I missing something OpenGL texture creation wise?
YUV_420_888 is a multiplane texture, where the luminance plane only contains a single channel per pixel.
for (int c = 0; c < image_matrix->cols * channels; c++) {
unsigned int idx = r * y_row_stride + c;
Your loop bounds assume c is in multiple of 4 channels, which is right for the output surface, but you then use it also when computing the input surface index. The input surface plane you are using only contains one channel, so idx is wrong.
In general you are also over writing the same memory multiple times - the loop increments c by one each iteration but you then write to c, c+1, c+2, and c+3 so overwrite three of the values you wrote last time.
Shorter answer - your OpenGL ES code is fine, but I think you're filling the texture with bad data.
Untested, but I think you need:
for (int c = 0; c < image_matrix->cols * channels; c += channels) {
unsigned int idx = (r * y_row_stride) + (c / channels);
I'm adding a black (0) padding around Region of interest (center) of NV21 frame got from Android CameraPreview callbacks in a thread.
To avoid overhead of conversion to RGB/Bitmap and reverse, I'm trying to manipulate NV21 byte array directly but this involves nested loops which is also making preview/processing slow.
This is my run() method sending frames to detector after calling method blackNonROI.
public void run() {
Frame outputFrame;
ByteBuffer data;
while (true) {
synchronized (mLock) {
while (mActive && (mPendingFrameData == null))
try{ mLock.wait(); }catch(InterruptedException e){ return; }
if (!mActive) { return; }
// Region of Interest
mPendingFrameData = blackNonROI(mPendingFrameData.array(),mPreviewSize.getWidth(),mPreviewSize.getHeight(),300,300);
outputFrame = new Frame.Builder().setImageData(mPendingFrameData, mPreviewSize.getWidth(),mPreviewSize.getHeight(), ImageFormat.NV21).setId(mPendingFrameId).setTimestampMillis(mPendingTimeMillis).setRotation(mRotation).build();
data = mPendingFrameData;
mPendingFrameData = null;
}
try {
mDetector.receiveFrame(outputFrame);
} catch (Throwable t) {
} finally {
mCamera.addCallbackBuffer(data.array());
}
}
}
Following is the method blackNonROI
private ByteBuffer blackNonROI(byte[] yuvData, int width, int height, int roiWidth, int roiHeight){
int hozMargin = (width - roiWidth) / 2;
int verMargin = (height - roiHeight) / 2;
// top/bottom of center
for(int x=0; x<width; x++){
for(int y=0; y<verMargin; y++)
yuvData[y * width + x] = 0;
for(int y=height-verMargin; y<height; y++)
yuvData[y * width + x] = 0;
}
// left/right of center
for(int y=verMargin; y<height-verMargin; y++){
for (int x = 0; x < hozMargin; x++)
yuvData[y * width + x] = 0;
for (int x = width-hozMargin; x < width; x++)
yuvData[y * width + x] = 0;
}
return ByteBuffer.wrap(yuvData);
}
Example output frame
Note that I'm not cropping the image, just padding black pixels around specified center of image to maintain coordinated for further activities. This works like it should but it's not fast enough and causing lag in preview and frames processing.
Can I further improve byte array update?
Is time/place for calling blackNonROI fine?
Any other way / lib for doing it more efficiently?
My simple pixel iteration is so slow, how YUV/Bitmap libraries do complex things so fast? do they use GPU?
Edit:
I've replaced both for loops with following code, and it's pretty much fast now (Please refer to greeble31's answer for details):
// full top padding
from = 0;
to = (verMargin-1)*width + width;
Arrays.fill(yuvData,from,to,(byte)1);
// full bottom padding
from = (height-verMargin)*width;
to = (height-1)*width + width;
Arrays.fill(yuvData,from,to,(byte)1);
for(int y=verMargin; y<height-verMargin; y++) {
// left-middle padding
from = y*width;
to = y*width + hozMargin;
Arrays.fill(yuvData,from,to,(byte)1);
// right-middle padding
from = y*width + width-hozMargin;
to = y*width + width;
Arrays.fill(yuvData,from,to,(byte)1);
}
1. Yes. To understand why, let's take a look at the bytecode Android Studio produces for your "left/right of center" nested loop:
(Annotated excerpt from a release build of blackNonROI, AS 3.2.1):
:goto_27
sub-int v2, p2, p4 ;for(int y=verMargin; y<height-verMargin; y++)
if-ge v1, v2, :cond_45
const/4 v2, 0x0
:goto_2c
if-ge v2, p3, :cond_36 ;for (int x = 0; x < hozMargin; x++)
mul-int v3, v1, p1
add-int/2addr v3, v2
.line 759
aput-byte v0, p0, v3
add-int/lit8 v2, v2, 0x1
goto :goto_2c
:cond_36
sub-int v2, p1, p3
:goto_38
if-ge v2, p1, :cond_42 ;for (int x = width-hozMargin; x < width; x++)
mul-int v3, v1, p1
add-int/2addr v3, v2
.line 761
aput-byte v0, p0, v3
add-int/lit8 v2, v2, 0x1
goto :goto_38
:cond_42
add-int/lit8 v1, v1, 0x1
goto :goto_27
.line 764
:cond_45 ;all done with the for loops!
Without bothering to decipher this whole thing line-by-line, it is clear that each of your small, inner loops is performing:
1 comparison
1 integer multiplication
1 addition
1 store
1 goto
That's a lot, when you consider that all that you really need this inner loop to do is set a certain number of successive array elements to 0.
Moreover, some of these bytecodes require multiple machine instructions to implement, so I wouldn't be surprised if you're looking at over 20 cycles, just to do a single iteration of one of the inner loops. (I haven't tested what this code looks like once it's compiled by the Dalvik VM, but I sincerely doubt it is smart enough to optimize the multiplications out of these loops.)
POSSIBLE FIXES
You could improve performance by eliminating some redundant calculations. For example, each inner loop is recalculating y * width each time. Instead, you could pre-calculate that offset, store it in a local variable (in the outer loop), and use that when calculating the indices.
When performance is absolutely critical, I will sometimes do this sort of buffer manipulation in native code. If you can be reasonably certain that mPendingFrameData is a DirectByteBuffer, this is an even more attractive option. The disadvantages are 1.) higher complexity, and 2.) less of a "safety net" if something goes wrong/crashes.
MOST APPROPRIATE FIX
In your case, the most appropriate solution is probably just to use Arrays.fill(), which is more likely to be implemented in an optimized way.
Note that the top and bottom blocks are big, contiguous chunks of memory, and can be handled by one Arrays.fill() each:
Arrays.fill(yuvData, 0, verMargin * width, 0); //top
Arrays.fill(yuvData, width * height - verMargin * width, width * height, 0); //bottom
And then the sides could be handled something like this:
for(int y=verMargin; y<height-verMargin; y++){
int offset = y * width;
Arrays.fill(yuvData, offset, offset + hozMargin, 0); //left
Arrays.fill(yuvData, offset + width, offset + width - hozMargin, 0); //right
}
There are more opportunities for optimization, here, but we're already at the point of diminishing returns. For example, since the end of each row of is adjacent to the start of the next one (in memory), you could actually combine two smaller fill() calls into a larger one that covers both the right side of row N and the left side of row N + 1. And so forth.
2. Not sure. If your preview is displaying without any corruption/tearing, then it's probably a safe place to call the function from (from a thread safety standpoint), and is therefor probably as good a place as any.
3 and 4. There could be libraries for doing this task; I don't know of any offhand, for Java-based NV21 frames. You'd have to do some format conversions, and I don't think it's be worth it. Using a GPU to do this work is excessive over-optimization, in my opinion, but it may be appropriate for some specialized applications. I'd consider going to JNI (native code) before I'd ever consider using the GPU.
I think your choice to do the manipulation directly to the NV21, instead of converting to a bitmap, is a good one (considering your needs and the fact that the task is simple enough to avoid needing a graphics library).
Obviously, the most efficient way to pass image for detection would be to pass the ROI rectangle to detector. All our image processing functions accept bounding box as a parameter.
If the black margin is used for display, consider using a black overlay mask for preview layout instead of pixel manipulation.
If pixel manipulation is inevitable, check if you can limit it to Y OK, you already do this!
If your detector works on a downscaled image (as my face recognition engine does), it may be wise to apply black out to a resized frame.
At any rate, keep your loops clean and tidy, remove all recurring calculations. Using Arrays.fill() operations may help significantly, but not dramatically.
The main texture of my surface shader is a Google Maps image tile, similar to this:
.
I want to replace pixels that are close to a specified color with that from a separate texture. What is working now is the following:
Shader "MyShader"
{
Properties
{
_MainTex("Base (RGB) Trans (A)", 2D) = "white" {}
_GrassTexture("Grass Texture", 2D) = "white" {}
_RoadTexture("Road Texture", 2D) = "white" {}
_WaterTexture("Water Texture", 2D) = "white" {}
}
SubShader
{
Tags{ "Queue" = "Transparent-1" "IgnoreProjector" = "True" "ForceNoShadowCasting" = "True" "RenderType" = "Opaque" }
LOD 200
CGPROGRAM
#pragma surface surf Lambert alpha approxview halfasview noforwardadd nometa
uniform sampler2D _MainTex;
uniform sampler2D _GrassTexture;
uniform sampler2D _RoadTexture;
uniform sampler2D _WaterTexture;
struct Input
{
float2 uv_MainTex;
};
void surf(Input IN, inout SurfaceOutput o)
{
fixed4 ct = tex2D(_MainTex, IN.uv_MainTex);
// if the red (or blue) channel of the pixel is within a
// specific range, get either a 1 or a 0 (true/false).
int grassCond = int(ct.r >= 0.45) * int(0.46 >= ct.r);
int waterCond = int(ct.r >= 0.14) * int(0.15 >= ct.r);
int roadCond = int(ct.b >= 0.23) * int(0.24 >= ct.b);
// if none of the above conditions is a 1, then we want to keep our
// current pixel's color:
half defaultCond = 1 - grassCond - waterCond - roadCond;
// get the pixel from each texture, multiple by their check condition
// to get:
// fixed4(0,0,0,0) if this isn't the right texture for this pixel
// or fixed4(r,g,b,1) from the texture if it is the right pixel
fixed4 grass = grassCond * tex2D(_GrassTexture, IN.uv_MainTex);
fixed4 water = waterCond * tex2D(_WaterTexture, IN.uv_MainTex);
fixed4 road = roadCond * tex2D(_RoadTexture, IN.uv_MainTex);
fixed4 def = defaultCond * ct; // just used the MainTex pixel
// then use the found pixels as the Albedo
o.Albedo = (grass + road + water + def).rgb;
o.Alpha = 1;
}
ENDCG
}
Fallback "None"
}
This is the first shader I've ever written, and it probably isn't very performant. It seems counter intuitive to me to call tex2D on each texture for every pixel to just throw that data away, but I couldn't think of a better way to do this without if/else (which I read were bad for GPUs).
This is a Unity Surface Shader, and not a fragment/vertex shader. I know there is a step that happens behind the scenes that will generate the fragment/vertex shader for me (adding in the scene's lighting, fog, etc.). This shader is applied to 100 256x256px map tiles (2560x2560 pixels in total). The grass/road/water textures are all 256x256 pixels as well.
My question is: is there a better, more performant way of accomplishing what I'm doing here? The game runs on Android and iOS.
I'm not a specialist in Shader performance, but assuming you have a relatively small number of source tiles that you wish to render in the same frame it might make more sense to store the result of the pixel replacement and reuse it.
As you are stating that the resulting image is going to be the same size as your source tile, just render the source tile using your surface shader (without any lighting though, you may want to consider using a simple, flat pixel shader!) into a RenderTexture once and then use that RenderTexture as source for your world rendering. That way you are doing the expensive work only once per source tile and thus it isn't even important anymore whether your shader is well optimized.
If all textures are static, you might even consider not doing this at runtime, but just translate them once in the Editor.
I've written some shader code for my android application. It has some time dependant animation which work totaly fine on webgl version, shader code is below, but full version could be found here
vec3 bip(vec2 uv, vec2 center)
{
vec2 diff = center-uv; //difference between center and start coordinate
float r = length(diff); //vector length
float scale = mod(u_ElapsedTime,2.); //it is equal 1 every 2 seconds and trigerring function
float circle = smoothstep(scale, scale+cirleWidth, r)
* smoothstep(scale+cirleWidth,scale, r)*4.;
return vec3(circle);
}
Return of the function is used in Fragcolor as a base for color.
u_ElapsedTime is sent to shader via uniform:
glUniform1f(uElapsedTime,elapsedTime);
Time data sent to shader from "onDrawFrame":
public void onDrawFrame(GL10 gl) {
glClear(GL_COLOR_BUFFER_BIT);
elapsedTime = (SystemClock.currentThreadTimeMillis()-startTime)/100f;
//Log.d("KOS","time " + elapsedTime);
scannerProgram.useProgram(); //initialize shader
scannerProgram.setUniforms(resolution,elapsedTime,rotate); //send uniforms to shader
scannerSurface.bindData(scannerProgram); //get attribute location
scannerSurface.draw(); //draw vertices with given attributes
}
So everything looks totaly fine. Nevertheless, after some amount of time it looks like there is some lags and amount of frames is lesser then from the beginning. In the end it could be like only one-two frames per cicle for that function. In the same time it doesn't seems like opengl by itself have some lags, because i can for example rotate the picture and don't see any lags.
What could be the reason of that lags??
upd:
code of binddata:
public void bindData(ScannerShaderProgram scannerProgram) {
//getting location of each attribute for shader program
vertexArray.setVertexAttribPointer(
0,
scannerProgram.getPositionAttributeLocation(),
POSITION_COMPONENT_COUNT,
0
);
Sounds to me like precision issues. Try taking this line from your shader:
float scale = mod(u_ElapsedTime,2.);
And perform it on the CPU instead. e.g.
elapsedTime = ((SystemClock.currentThreadTimeMillis()-startTime)%200)/100f;
I am quite new to opengl es 2.0 on android. I am working on a project which draws a few plane indicators on screen(like altimeter, compass etc). After doing the tutorial from the official google dev site here http://developer.android.com/training/graphics/opengl/index.html I just continued along this path, drawing circles, triangles, squares etc (only 2d stuff). I can make the drawn objects move using rotation and translation matrices, but the only way I know how to do this(except for how they did it in the tutorial) is like this in the onDrawFrame() method of my renderer class:
//set values for all Indicators
try {
Thread.sleep(1);
// for roll + pitch:
if(roll < 90) {
roll += 1.5f;
} else roll = 0;
if(pitch < 90) {
pitch += 0.5f;
} else pitch = 0;
// for compass:
if(compassDeg > 360) compassDeg = 0;
else compassDeg += 1;
//for altimeter
if(realAltitude >= 20000) realAltitude = 0;
else realAltitude += 12;
//for speedometer:
if(realSpeed >= 161) realSpeed = 0;
else realSpeed += 3;
} catch (InterruptedException e) {
e.printStackTrace();
}
roll, pitch, compassDeg, speed etc are the parameters the indicators receive and I designed them to move accordingly (if compassDeg = 0 for example, the compass will point north and so on). These parameters will eventually be received via bluetooth but for now I'm modifying them from the code directly because I don't have a bluetooth implementation yet.
I am pretty sure this is not the best way to do it, sometimes the drawn objects stutter and seem to go back a few frames, then back again and I don't think pausing the drawing method is a good idea in general.
I've seen that in the tutorial I mentioned in the beginning they use something like this:
//Use the following code to generate constant rotation.
//Leave this code out when using TouchEvents.
long time = SystemClock.uptimeMillis() %4000L ;
float contAngle = -0.090f * ((int) time);
Matrix.setRotateM(contRotationMatrix, 0, contAngle, 0, 0, -1.0f);
Matrix.multiplyMM(contMVPMatrix, 0, mMVPMatrix4, 0, contRotationMatrix, 0);
which is still kinda weird I think, there has to be a more straightforward way in which to specify how to draw each frame, to rotate and translate objects frame by frame.
So my question is how do I make everything move frame by frame or something like that, or at least how do I find out when one frame has been drawn?