glsl programming architecture which part is "really" parallel execution?

glsl programming architecture which part is "really" parallel execution? - android

I am trying to implement image processing algorithm like gaussian filtering, bilateral filtering in GPU using glsl.
And I am getting confused with which part is "really" parallel execution. for example, I have a 1280*720 preview as texture. I am not quite sure which part is really running for 1280*720 times and which part is not.
what's the dispatching mechanism of glsl codes?
my gaussian filtering code is like:
#extension GL_OES_EGL_image_external : require
precision mediump float;
varying vec2 vTextureCoord;
uniform samplerExternalOES sTexture;
uniform sampler2D sTextureMask;
void main() {
float r=texture2D(sTexture, vTextureCoord).r;
float g=texture2D(sTexture, vTextureCoord).g;
float b=texture2D(sTexture, vTextureCoord).b;
// a test sample
float test=1.0*0.5;
float width=1280.0;
float height=720.0;
vec4 sum;
//offsets of a 3*3 kernel
vec2 offset0=vec2(-1.0,-1.0); vec2 offset1=vec2(0.0,-1.0); vec2 offset2=vec2(1.0,-1.0);
vec2 offset3=vec2(-1.0,0.0); vec2 offset4=vec2(0.0,0.0); vec2 offset5=vec2(1.0,0.0);
vec2 offset6=vec2(-1.0,1.0); vec2 offset7=vec2(0.0,1.0); vec2 offset8=vec2(1.0,1.0);
//gaussina kernel with sigma==100.0;
float kernelValue0 = 0.999900; float kernelValue1 = 0.999950; float kernelValue2 = 0.999900;
float kernelValue3 = 0.999950; float kernelValue4 =1.000000; float kernelValue5 = 0.999950;
float kernelValue6 = 0.999900; float kernelValue7 = 0.999950; float kernelValue8 = 0.999900;
vec4 cTemp0;vec4 cTemp1;vec4 cTemp2;vec4 cTemp3;vec4 cTemp4;vec4 cTemp5;vec4 cTemp6;vec4 cTemp7;vec4 cTemp8;
//getting 3*3 pixel values around current pixel
vec2 src_coor_2;
src_coor_2=vec2(vTextureCoord[0]+offset0.x/width,vTextureCoord[1]+offset0.y/height);
cTemp0=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset1.x/width,vTextureCoord[1]+offset1.y/height);
cTemp1=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset2.x/width,vTextureCoord[1]+offset2.y/height);
cTemp2=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset3.x/width,vTextureCoord[1]+offset3.y/height);
cTemp3=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset4.x/width,vTextureCoord[1]+offset4.y/height);
cTemp4=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset5.x/width,vTextureCoord[1]+offset5.y/height);
cTemp5=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset6.x/width,vTextureCoord[1]+offset6.y/height);
cTemp6=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset7.x/width,vTextureCoord[1]+offset7.y/height);
cTemp7=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset8.x/width,vTextureCoord[1]+offset8.y/height);
cTemp8=texture2D(sTexture, src_coor_2);
//convolution
sum =kernelValue0*cTemp0+kernelValue1*cTemp1+kernelValue2*cTemp2+
kernelValue3*cTemp3+kernelValue4*cTemp4+kernelValue5*cTemp5+
kernelValue6*cTemp6+kernelValue7*cTemp7+kernelValue8*cTemp8;
float factor=kernelValue0+kernelValue1+kernelValue2+kernelValue3+kernelValue4+kernelValue5+kernelValue6+kernelValue7+kernelValue8;
gl_FragColor = sum/factor;
//gl_FragColor=texture2D(sTexture, vTextureCoord);
}
this code is running with lower fps against pure preview on my phone(galaxy nexus).
but if I change the last part of my code to direct output with original pixel value, like
//gl_FragColor = sum/factor;
gl_FragColor=texture2D(sTexture, vTextureCoord);
it would run fast and same fps as pure preview.
the quesion is: things I write for test and useless in the beginning like:
float test=1.0*0.5;
how many time is it executed?
other parts like:
sum =kernelValue0*cTemp0+kernelValue1*cTemp1+kernelValue2*cTemp2+
kernelValue3*cTemp3+kernelValue4*cTemp4+kernelValue5*cTemp5+
kernelValue6*cTemp6+kernelValue7*cTemp7+kernelValue8*cTemp8;
would not run 1280*720 times just when I change
gl_FragColor = sum/factor;
to
gl_FragColor=texture2D(sTexture, vTextureCoord);?
how is the mechanism to decide which is to run 1280*720 times, which is just useless when parallel though out the pixels? is it done automatically?
what's the architecture, dispatching, how it organize the data to the GPU and other things for a glsl program?
I am wondering what should I do for more complicated operations like bilateral filtering and with kernel size like 9*9 and 9 times per pixel than this 3*3 gaussian kernel.

The entire fragment shader code is executed as a whole for each and every fragment. A fragment approximates either, if no antialiasing is done an output pixel, or with multisample antialiasing the samples of the framebuffer. What a fragment exactly is, is not specified in detail by the OpenGL spec, other than it's the output of the fragment stage which is then turned into values on the framebuffer bitplanes.
The rasterizer produces a series of framebuffer addresses and values
using a two-dimensional description of a point, line segment, or polygon. Each
fragment so produced is fed to the next stage that performs operations on
individual fragments before they ﬁnally alter the framebuffer. These operations include
[OpenGL-3.3 core spec, section 2.4]
would not run 1280*720 times just when I change
gl_FragColor = sum/factor;
to
gl_FragColor=texture2D(sTexture, vTextureCoord);?
Division is a costly and complex operation. Since the sum of the kernel is a constant, and doesn't change per fragment you shouldn't evaulate that in the shader. Evaluate it on the CPU and supply 1./factor as a uniform (which is a constant equal for all fragments) and multiply that with sum which is much faster than division.
Your gaussian kernel is actually a 3×3 matrix, for which there is a dedicated type in GLSL. The calculations you perform can be rewritten in terms of dot products (mathematically correct term would be scalar or inner product), for which GPUs have dedicated, accelerated instructions.
Also you shouldn't split up the components of a texture into individual floats.
All in all you built quite a number of speed bumps into your code.

On a modern (Shader Model 3.0+) GPU, fragment shaders are scheduled to operate on 2x2 blocks of pixels (pixel quads) at a time. Fun fact, this was required in order to implement the derivative instruction in Shader Model 3.0 and it has remained part of GPU architecture design ever since. Pixel quads are the lowest-level of granularity you can ever get in fragment shader scheduling. In fact, if you were to discard in a fragment shader, unless all of the fragments in the pixel quad also discard, then every instance of the fragment shader in the block continues running and the result is thrown out at the end for the individual fragments that requested discard.
In addition to this, most GPUs have multiple stream processing units and will schedule pixel quads into larger workgroups (NV calls them warps, AMD calls them wavefronts). In a nutshell, everything is happening in parallel, that is the entire premise of GPUs - they apply a single task across multiple threads that all operate on the same data in parallel; this is why they scale so well when cores are increased as opposed to CPUs.
Put simply, rather than dispatching individual instructions in your GLSL shader to run on separate functional units, what really happens is this. Your GLSL shader is run on multiple processing units simultaneously (conceptually, one thread per-fragment), and these threads all execute the same sequence of instructions in a paradigm known as SIMT (Single Instruction Multiple Thread).
Getting back to the basic scheduling unit (warp/wavefront), if one instance of your shader stalls fetching memory the rest of the instances in said scheduling unit also stall, because they all run the same instruction simultaneously. This is why dependent texture reads and large filter kernels are bad mojo; since the texture memory needed by a particular group of fragments may be indeterminate until run-time or spread too far, efficiently pre-fetching and caching texture data within a scheduling unit can become difficult if not impossible.
The biggest problem with accurately describing the level of parallelism is that the GPU architectures keep changing (most of the discussion above related to Shader Model 3.0+ GPUs). Not too long ago, GPUs had vectorized ISAs but now both AMD and NV have switched to superscalar because it actually improves instruction scheduling efficiency. Throw specialized embedded GPUs into the mix and you have a real nightmare on your hands, it is hard to really say what shader model they run (since derivative is optional in OpenGL ES 2.0).
See this other question on Stack Overflow for a more concise statement of what I just wrote.
For some pretty diagrams, here is a somewhat out of date, but still useful presentation from nVIDIA.

Related

Strange texture result when mix two camera textures in glsl

I'm making a simple image filter app in android and I implemented lowpass filter using same method in GPUImage(https://github.com/BradLarson/GPUImage)
it buffers the previous and current camera frames mixture and render it.
So i created a buffer FBO and render the current camera texture, re-use it as a texture for mixture in lowpass filter shader with next camera texture.
I tested my code with some smartphones(Galaxy S10, Nexus 6P, etc..) and it worked well. However in Galaxy S8(Mali-G71) the result is strange and I don't know what was wrong.
These are the wrong results
Here are my code:
Fragment shader:
varying vec2 vTextureCoord;
uniform sampler2D sTexture1;
uniform float filterStrength;
void main() {
vec4 texColor0 = texture2D(sTexture, vTextureCoord);
vec4 texColor1 = texture2D(sTexture1, vTextureCoord);
gl_FragColor = mix(texColor0, texColor1, filterStrength);
}
What can cause this results?
Thanks in advance.

The artifacts look tile aligned for Mali, so if I had to guess you are reading the currently bound framebuffer color attachment as an input texture at the same time as writing in to it.
This is "implementation defined" behavior in the specification, and concurrent reads and writes will definitely do bad things on a tile-based renderer like Mali.

Query precision of OpenGLES Android device supports

I am using OpenGL ES to run some shaders on Android.
On some older/cheap devices they do not support highp precision so the shader output is incorrect.
I need to know when the app starts if the device can support high precision. That way I can tell the user "forget it, your device does not support high precision floats" rather than have it output garbage for them.
I found this query code online, but it seems to only be for WebGL
var highp = gl.getShaderPrecisionFormat(gl.FRAGMENT_SHADER, gl.HIGH_FLOAT);
var highpSupported = highp.precision != 0;
Does anyone have a way I can query an android device (KitKat or higher) to see what precision the GLES shaders will support?
This is the final code I now use, but contents of range and precision are always -999 no matter where I run the code in my app. Before, during or after the GLSurfaceView has been created and GLES output has run.
IntBuffer range = IntBuffer.allocate(2);
IntBuffer precision = IntBuffer.allocate(1);
range.put(0,-999);
range.put(1,-999);
precision.put(0,-999);
android.opengl.GLES20.glGetShaderPrecisionFormat(android.opengl.GLES20.GL_FRAGMENT_SHADER, android.opengl.GLES20.GL_HIGH_FLOAT,range,precision);
String toastText="Range[0]="+String.valueOf(range.get(0))+" Range[1]="+String.valueOf(range.get(1))+" Precision[0]="+String.valueOf(precision.get(0));
Toast.makeText(getApplicationContext(),toastText, Toast.LENGTH_SHORT).show();
The above code always returns -999 for all 3 values, and the kronos doco states if an error occurs then the values will be unchanged. So it looks like there is an error or I am not calling it at the right time.

Noise screen on OpenGLES Android

I try to create noise screen on Android device. For getting random I use this formula:
fract((sin(dot(gl_FragCoord.xy ,vec2(0.9898,78.233)))) * 4375.85453)
I tested it in http://glslsandbox.com/ and it's work perfect.
Result form http://glslsandbox.com/
But on phone and tablet I have another result
On Nexus 9 I receive this:
But it's not so bad. On LG-D415 I receive this
Can somebody help me with it?
Shader:
#ifdef GL_ES
precision mediump float;
#endif
float rand(vec2 co){
return fract((sin(dot(co.xy ,vec2(0.9898,78.233)))) * 4375.85453);
}
void main() {
vec2 st = gl_FragCoord.xy;
float rnd = rand(st);
gl_FragColor = vec4(vec3(rnd),1.0);
}

It's almost certainly a precision issue. Your mediump based code might be executed as standard single precision float on some devices or half precision on others (or anywhere in between).
Half precision simply isn't enough to do those calculations with any degree of accuracy.
You could change your calculations to use highp, but then you'll run into the next problem which is that many older devices don't support highp in fragment shaders.
I'd strongly recommend you create a noise texture, and just sample from it. It'll produce much more consistent results across devices, and I'd expect it to be much faster on the majority of mobile GPUs.
This is an interesting article about floating point precision: http://www.youi.tv/mobile-gpu-floating-point-accuracy-variances/
This is the related app on the Google Play store.
This is a more in-depth discussion of GPU precision: https://community.arm.com/groups/arm-mali-graphics/blog/2013/05/29/benchmarking-floating-point-precision-in-mobile-gpus

GLSL IF speed vs multiply factor

I know this has been asked generally but answer is alweays "depends", so I'm creating a concrete question in hope to get a concrete answer.
I know the evil of IF's on GLSL, they can be really expensive, even execute all code in some hardware.
So, I have a fragment shader from an example (a dual paraboloid shadow map) which uses if's to determine which map to use and compute the depth, but I know it's very easy to replace those if's with a multiplier, the question is there are a texture sampling inside the fragment shader, what would be faster, to use an if or use a multiplier to filter the unused data?
These are the proposed codes:
IF version:
//Alpha is a variable computed on the fly, cannot be replaced
float depth = 0;
float mydepth = 0;
if(alpha >= 0.5f)
{
depth = texture2D(ShadowFrontS, P0.xy).x;
mydepth = P0.z;
}
else
{
depth = texture2D(ShadowBackS, P1.xy).x;
mydepth = P1.z;
}
Filter version:
float mlt = ceiling(alpha - 0.5f);
float depth = 0;
float mydepth = 0;
depth = texture2D(ShadowFrontS, P0.xy).x * mlt;
mydepth = P0.z * mlt;
mlt = 1.0f - mlt;
depth = depth + (texture2D(ShadowFrontS, P1.xy).x * mlt);
mydepth = P1.z * mlt;
P.D.: I'm targeting Desktop and Mobile devices, so performance on low-end hardware is a must.

Branching is not "evil" per-se on massively SIMD architectures. If all the threads in a "bunch" (NVidia calls them Warps) follow the same code path, i.e. take all the same branches, everything is fine.
Only if a branch is partly taken (within that bunch) and for the other part not, both branches must be executed and later on the calculations and data fetches discarded that are not relevant for the current thread.
Now in your case it requires some careful profiling to see, which variant benefits your GPU more. But my gut instinct tells me, it's actually the branching version. Why? Because: Usually the value by which you decide on a branch depends on the screen space position and often large contiguous areas of fragments share the same code path and branching; so performance penalities happen only for those "bunches", which cover a bordering region. These bunches are usually only a few pixel² in size (8×8, or 16×16).
The shader you have there is not GPU limited (i.e. limited by the computational capabilities of the GPU), but memory bandwidth limited, i.e. by the throughput that the GPU's memory link offers; that is because of the texture2D fetch operations. And in that case reducing the actual number of fetches and thereby the required memory bandwidth will probably benefit your program more than reducing the number of computations.
The branchless mix-multiplex variant of your shader will always fetch both textures, the branching one will do that only within the bordering regions. So from that heuristic I'd guess, that your branching variant is actually the better choice.
But to be sure you have to profile it.

How to properly mix drawing calls and changes of a sampler value with a single shader program?

I'm trying to draw two objects using two different textures with one shader program in OpenGL ES 2.0 for Android. The first object should have texture0 and the second sould have texture1.
In fragment shader I have:
uniform sampler2D tex;
and in java code:
int tiu0 = 0;
int tiu1 = 1;
int texLoc = glGetUniformLocation(program, "tex");
glUseProgram(program);
// bind texture0 to texture image unit 0
glActiveTexture(GL_TEXTURE0 + tiu0);
glBindTexture(GL_TEXTURE_2D, texture0);
// bind texture1 to texture image unit 1
glActiveTexture(GL_TEXTURE0 + tiu1);
glBindTexture(GL_TEXTURE_2D, texture1);
glUniform1i(texLoc, tiu0);
// success: glGetError returns GL_NO_ERROR, glGetUniformiv returns 0 for texLoc
drawFirstObject(); // should have texture0
glUniform1i(texLoc, tiu1);
// success: glGetError returns GL_NO_ERROR, glGetUniformiv returns 1 for texLoc
drawSecondObject(); // should have texture1
Running on Samsung Galaxy Ace with Android 2.3.3 both objects have the same texture0. Similar code runs correctly in OpenGL 2.0 on my desktop computer.
If I remove drawFirstObject, the second object will have texture1.
If I remove drawSecondObject, the first object will have texture0.
If somewhere between drawFirstObject and drawSecondObject I change the program for a while:
glUseProgram(0); // can be also any valid program other than the program from the next call
glUseProgram(program);
then both objects will have texture1.
Values of uniforms different from sampler2D are always set correctly.
I know I can draw the two objects with different textures using only one texture image unit and binding appropriate texture to that texture image unit before drawing the object, but I also want to know what's going on here.
Is something wrong with my code? Is it possible in OpenGL ES 2.0 to draw the objects with different textures by only switching between texture image units as I shown in the code? If it's impossible, is that difference between OpenGL 2.0 (where it's possible) and OpenGL ES 2.0 documented anywhere? I can't find it.

After hours of further research I've found out that this problem is specific to Adreno 200 GPU that is utilized in my Samsung Galaxy Ace (GT-S5830). It seems like Adreno 200 driver assigns the texture to the sampler in the first call to a drawing function and after that it ignores any changes to the sampler value (glUniform1i(samplerLocation, textureImageUnit)) until one of the two occurs:
glUseProgram is called with a different shader program,
a different texture is bound to any texture image unit used by the shader program.
There's a thread in the forums of the manufacturer of Adreno 200 GPU describing the very same problem.
So if you call drawing functions several times with the same shader program and with different textures bound before, there are two workarounds to the described problem:
Call glUseProgram(0); glUseProgram(yourDrawingProgram); before every drawing function.
Before every drawing call, bind different texture to at least one texture image unit used by your shader program. This solution can be difficult to maintain, because if you bind the same texture that is already bound to the texture image unit, the problem will remain. So in this case the easiest solution is to simply not change sampler values and bind textures of all texture image units used by the shader program before every drawing call.

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.