Is discard bad for program performance in OpenGL?

Is discard bad for program performance in OpenGL? - android

I was reading this article, and the author writes:
Here's how to write high-performance applications on every platform in two easy steps:
[...]
Follow best practices. In the case of Android and OpenGL, this includes things like "batch draw calls", "don't use discard in fragment shaders", and so on.
I have never before heard that discard would have a bad impact on performance or such, and have been using it to avoid blending when a detailed alpha hasn't been necessary.
Could someone please explain why and when using discard might be considered a bad practise, and how discard + depthtest compares with alpha + blend?
Edit: After having received an answer on this question I did some testing by rendering a background gradient with a textured quad on top of that.
Using GL_DEPTH_TEST and a fragment-shader ending with the line "if(
gl_FragColor.a < 0.5 ){ discard; }" gave about 32 fps.
Removing the if/discard statement from the fragment-shader increased
the rendering speed to about 44 fps.
Using GL_BLEND with the blend function "(GL_SRC_ALPHA,
GL_ONE_MINUS_SRC_ALPHA)" instead of GL_DEPTH_TEST also resulted in around 44 fps.

It's hardware-dependent. For PowerVR hardware, and other GPUs that use tile-based rendering, using discard means that the TBR can no longer assume that every fragment drawn will become a pixel. This assumption is important because it allows the TBR to evaluate all the depths first, then only evaluate the fragment shaders for the top-most fragments. A sort of deferred rendering approach, except in hardware.
Note that you would get the same issue from turning on alpha test.

"discard" is bad for every mainstream graphics acceleration technique - IMR, TBR, TBDR. This is because visibility of a fragment(and hence depth) is only determinable after fragment processing and not during Early-Z or PowerVR's HSR (hidden surface removal) etc. The further down the graphics pipeline something gets before removal tends to indicate its effect on performance; in this case more processing of fragments + disruption of depth processing of other polygons = bad effect
If you must use discard make sure that only the tris that need it are rendered with a shader containing it and, to minimise its effect on overall rendering performance, render your objects in the order: opaque, discard, blended.
Incidentally, only PowerVR hardware determines visibility in the deferred step (hence it's the only GPU termed as "TBDR"). Other solutions may be tile-based (TBR), but are still using Early Z techniques dependent on submission order like an IMR does.
TBRs and TBDRs do blending on-chip (faster, less power-hungry than going to main memory) so blending should be favoured for transparency. The usual procedure to render blended polygons correctly is to disable depth writes (but not tests) and render tris in back-to-front depth order (unless the blend operation is order-independent). Often approximate sorting is good enough. Geometry should be such that large areas of completely transparent fragments are avoided. More than one fragment still gets processed per pixel this way, but HW depth optimisation isn't interrupted like with discarded fragments.

Also, just having an "if" statement in your fragment shader can cause a big slowdown on some hardware. (Specifically, GPUs that are heavily pipelined, or that do single instruction/multiple data, will have big performance penalties from branch statements.) So your test results might be a combination of the "if" statement and the effects that others mentioned.
(For what it's worth, testing on my Galaxy Nexus showed a huge speedup when I switched to depth-sorting my semitransparent objects and rendering them back to front, instead of rendering in random order and discarding fragments in the shader.)

Object A is in front of Object B. Object A has a shader using 'discard'. As such, I can't do 'Early-Z' properly because I need to know which sections of Object B will be visible through Object A. This means that Object A has to pass all the way through the processing pipeline until almost the last moment (until fragment processing is performed) before I can determine if Object B is actually visible or not.
This is bad for HSR and 'Early-Z' as potentially occluded objects have to sit and wait for the depth information to be updated before they can be processed. As has been stated above, its bad for everyone, or, in slightly more friendly way "Friends don't let friends use Discard".

In your test, your if statment is in per pixel level performance
if ( gl_FragColor.a < 0.5 ){ discard; }
Would be processed once per pixel that was being rendered (pretty sure that's per pixel and not per texel)
If your if statment was testing a Uniform or Constant you'd most likley get a different result due to to Constants only being processed once on compile or uniforms being processed once per update.

Related

Reading a single integer back from GPU

Target: OpenGL ES 3.1
After each frame, I want to read back from my fragment shader the number of transparent fragments (i.e. fragments with 0 < alpha < 1) that the shader just wrote. Depending on this number, the CPU will then, in the next frame, use different rendering techniques (0 - normal, fast rendering; >0 --> switch to a order-independent transparency method with appropriate buffer size- which is much slower).
I guess I don't even need this data immediately; I can tolerate a delay of 1 frame (i.e. after sending frame N to GPU my CPU could read back the number from previous frame N-1 - if that would help speed-wise).
The question is: what is the best way to read this single integer back?
Should I bind a FBO with GL_READ_FRAMEBUFFER and then use glReadPixels to read the first pixel? or should I be using glMapBufferRange? Is it even a good idea? The point is, I need order-independent transparency, but a proper implementation of it (A-buffer with per-pixel linked lists) is a much slower rendering technique, so I want to detect if I need it at the moment (most of the time I won't be needing it) and only do it if there are really some transparent fragments on the screen.

Is there anything I can do about the overhead from running a shader multiple times

I'm trying to implement deferred rendering on an Android phone using OPENGL ES 3.0. I've gotten this to work ok but only very slowly, which rather defeats the whole point. What really slows things up is the multiple calls to the shaders. Here, briefly, is what my code does:
Geometry Pass:
Render scene - output position, normal and colour to off-screen buffers.
For each light:
a) Stencil Pass:
Render a sphere at the current light position, sized according to the lights intensity. Mark these pixels as influenced by current light. No actual output.
b) Light Pass:
Render a sphere again, this time using the data from the geometry pass to apply lighting equations to pixels marked in the previous step. Add this to off-screen buffer
Blit to screen
It's this restarting the shaders for each light causing the bottleneck. For example, with 25 lights the above steps run at about 5 fps. If instead I do: Geometry Pass / Stencil Pass - draw 25 lights / Light Pass - draw 25 lights it runs at around 30 fps. So, does anybody know how I can avoid having to re-initialize the shaders? Or, in fact, just explain what's taking up the time? Would it help or even be possible (and I'm sorry if this sounds daft) to keep the shader 'open' and overwrite the previous data rather than doing whatever it is that takes so much time restarting the shader? Or should I give this up as a method for having multiple lights, on a mobile devise anyway.

Well, I solved the problem of having to swap shaders for each light by using an integer texture as a stencil map, where a certain bit is set to represent each light. (So, limited to 32 lights.) This means step 2a (above) can be looped, then a single change of shader, and looping step 2b. However, (ahahaha!) it turns out that this didn't really speed things up as it's not, after all, swapping shaders that's the problem but changing write destination. That is, multiple calls to glDrawBuffers. As I had two such calls in the stencil creation loop - one to draw nowhere when drawing a sphere to calculate which pixels are influenced and one to draw to the integer texture used as the stencil map. I finally realized that as I use blending (each write with a colour where a singe bit is on) it doesn't matter if I write at the pixel calculation stage, so long as it's with all zeros. Getting rid of the unnecessary calls to glDrawBuffers takes the FPS from single figures to the high twenties.
In summary, this method of deferred rendering is certainly faster than forward rendering but limited to 32 lights.
I'd like to say that me code was written just to see if this was a viable method and many small optimizations could be made. Incidentally, as I was limited to 4 draw buffers, I had to scratch the position map and instead recover this from gl_FragCoord.xyz. I don't have proper benchmarking tools so I'd be interested to hear from anyone who can tell me what difference this makes, speedwise.

Color picking implementation

I'm working on a very simple, yet customizable OpenGL ES 2 rendering engine (I know that stuff like "unity" and "unreal engine" exist, and that reinventing the wheel isn't probably the sanest thing to do, just take it as a given ;-) ).
Now I'm facing object picking: I don't want to do ray-casting and I would like to do color picking instead (free laughs here: I'm using MRTs on ES3 in the current implementation. It works, but only where it works, if you catch my drift...).
AFAIK when color picking, you can either have two buffers (one for selection and one for rendering), or write twice to the same one: each approach has it's pro and cons.
Assuming that I have an unknown number of objects is it better:
to create two buffers and draw each object to each buffer, before going to the next one (thus minimizing the amount of uniforms you have to load, but switching buffer twice for each object)
to create a single buffer, draw all objects for selection, doing the color picking, then draw everything for rendering (thus limiting the context switching, but increasing program switching and uniforms loading)
I think that the question can be summarized as: "is it more expensive to switch buffers repetitively, or to switch programs and load uniforms"?
Oh... and feel free to tell me if the question doesn't make sense :)

On mobile the number of drawcalls you make is one of the most important performance factor to consider, the driver overhead for each drawcall is generally huge (bigger on Android than IOS in general), so drawing two times the objects will not be good for perf (will use more CPU => cost of driver calls).
If you have no more than 256 different objects then a simple solution without MRT would be to use a RGBA render target (instead RGB) and store the object "ID" in the alpha channel (so as a grayscale color).
Otherwise, you should at all cost avoid switching buffer repetitively, otherwise you will get slow load/store operations (GPU being forced to memcpy the current buffer to "backup it" when switching to a new one, imagine doing that hundreds of time per frame ...)
So to answer your question it cost more to switch buffer repetitively than to switch program and load uniforms.
PS: if you have more than 256 objects you may eventually try to render in "slices" of 256 objects, after each slice you glRead the buffer pixels and check for object selection in the alpha channel then glClear only the alpha and continue to the next slice. But note that this may not be very efficient either because as soon as you want to read the buffer pixels the CPU has to stall, waiting for the GPU to finish rendering so you are breaking CPU/GPU parallelism

Single Bone (Matrix Palette) Animation vs. simple rotation/translation of parts

First of all, I'm using OpenGL ES 1.1 and not 2.0, mostly because I've spent quite a bit of time learning 1.0/1.1 and am pretty comfortable with it.
I took time to learn the use of matrix palettes for animation and after switching gears on a project I've come to a question.
Originally I was using 2 and 3 bone animation because I had the need of weights for certain vertex groups. Now... in the new project I'm working on I will be animating more mechanical things, so the need for more than 1 bone or weighting is unnecessary. I'd like to still use a matrix palette with verts weighted 100% to single bones... but I wonder if that will cause a performance hit. Instead, I could break a mesh into smaller pieces and do simple translation and rotation between element draw calls. I am concerned, of course, with performance.

TL;DR version: try both ways and see which one performs better.
Really, using palettes for 1-bone animation is something that you can do without too much hassle, and depending on the number of different bones, and the driver overhead on the devices you do it on, might perform better.
It's worth noting that weights can be ignored in a 1-bone model, and the resulting per-vertex code should typically be comparable to a single transform, modulo the indirection to the palette.
That, of course, hinges on the GL implementation to optimize the weighting away. On the other hand, the higher the number of bones, the more draw calls you would have to generate without palettes, and the more you tax the CPU/driver code.
So at a broad level, I'd say that the palette is somewhat more work per-vertex, but significantly less per-bone. Where the tipping point is depends on the platform, as both of those cost can vary significantly.

GLScissors: what is faster/better?

I'm having a small dilemma.
I'm working with Android (not that it is relevant) and I noticed one thing on some phones and not others: the scissor prevents the call glClear(GL_COLOR_BUFFER_BIT) from working correctly.
Therefore I'm wondering if it is better to do:
gl.glDisable(GL10.GL_SCISSOR_TEST);
gl.glClear(GL10.GL_COLOR_BUFFER_BIT);
gl.glEnable(GL10.GL_SCISSOR_TEST);
or
gl.glScissors(0,0,800,480);
gl.glClear(GL10.GL_COLOR_BUFFER_BIT);
Basically, is it better to change to scissors test zone before the clear or is it better to disable/enable?

This is the specification of glScissors, since a glClear is considered in OpenGL as a drawing command. So the behavior you see is perfectly normal (D3D works similarly) but buggy on the other phones where funnily it seems for you to work!
About which solution to choose, I don't know, both are valid. It's a matter of taste I would say. I prefer the first one because it's more easy to figure out what happens.
Now if on your OpenGL implementation the second solution turns out to be faster than the first one, I would picked the second one. Benchmark!
Under the glHood:
Let me share of what I know on desktop GPUs and speculate a bit (don't take it too seriously!). A glClear command actually results in a draw of a full-screen quad, since drawing triangles is the "fast path" on GPUs. This is probably more efficient than DMAs or fixed hardware, since it's parallel (all shader cores clear its portion of the screen, color, z and stencil) and if you do that you can avoid the cost of fixed hardware.
About glScissor, I heard it's implemented via stencil buffering (the same mechanism than usual OpenGL stencil buffer), so only fragments that fall into the scissor zone can participate to depth dest, fragment shading, blending, etc (this is done for same reason than glClear, avoid dedicated hardware). It could be also implemented as a fragment discard+dynamic branching on modern GPUs.
Now you can see why it works that way. Only fragments of the full screen quad that lies within the scissor zone can "shade" the color buffer and clear it!

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.