First of all, I'm using OpenGL ES 1.1 and not 2.0, mostly because I've spent quite a bit of time learning 1.0/1.1 and am pretty comfortable with it.
I took time to learn the use of matrix palettes for animation and after switching gears on a project I've come to a question.
Originally I was using 2 and 3 bone animation because I had the need of weights for certain vertex groups. Now... in the new project I'm working on I will be animating more mechanical things, so the need for more than 1 bone or weighting is unnecessary. I'd like to still use a matrix palette with verts weighted 100% to single bones... but I wonder if that will cause a performance hit. Instead, I could break a mesh into smaller pieces and do simple translation and rotation between element draw calls. I am concerned, of course, with performance.
TL;DR version: try both ways and see which one performs better.
Really, using palettes for 1-bone animation is something that you can do without too much hassle, and depending on the number of different bones, and the driver overhead on the devices you do it on, might perform better.
It's worth noting that weights can be ignored in a 1-bone model, and the resulting per-vertex code should typically be comparable to a single transform, modulo the indirection to the palette.
That, of course, hinges on the GL implementation to optimize the weighting away. On the other hand, the higher the number of bones, the more draw calls you would have to generate without palettes, and the more you tax the CPU/driver code.
So at a broad level, I'd say that the palette is somewhat more work per-vertex, but significantly less per-bone. Where the tipping point is depends on the platform, as both of those cost can vary significantly.
Related
I am developping a game on android using opengl and am having a little performance problem.
Let's say for example I want to draw a background partially filled with grass "bushes". Bushes have different x,y,z, different sizes and so on (each bush is a 2D sprite), and potentially partially hide each other (I use a perspective camera). I am having a big performance problem if those sprites are big (i.e. the quad sizes, not the texture size/resolution) :
If I use a classical front to back draw (to avoid overdraw), I find myself having problems because of (I think) alpha testing. Even if the bushes have only opaque and fully transparent pixels (no partial transparency), and if I use the proper alpha testing comparison (GL_EQUAL 1) the performances are bad because a lot of pixels have to be alpha tested (If I understand right).
If I use a back to front display with alpha testing disabled, I lose a lot of performance too (but this time because of overdraw problems), even when disabling depth buffer writing (not sure if it does anything if depth test is disabled by the way).
I am having good performances if using front to back without alpha testing, but of course sprite cutout is completely gone, which is really really bad.
All the bushes have the same texture, I use 16 bit colors, mip mapping, geometry batching, cull faces, no shaders, etc. All what I can think of to improve performances (which are not bad in other cases), except texture compression. I even filter the sprites to avoid "displaying" the ones out the screen. I have also tried some "violent optimizations" for test purposes, such as making the textures fully opaque, lowering the texture resolution a lot, disabling blending, etc, but nothing was fantastic performance-wise except the alpha testing removal.
I was wondering if I was forgetting something here to help with the performance. Back to front creates overdraw, front to back is slow because of alpha testing (and I do not want my bushes to be "square" images so I cannot disable alpha testing). If I create smaller sprites performances are far better (even with a lot more sprites), but this is only a workaround.
To summarize, how can you display overlapping big quads needing cutout, without losing performance?
PS : I am testing on a nexus one.
PS2 : Some optimizations suggest to not create quads but geometries more "fitting" the texture, but it seems to be a really tedious process, and would not help me a lot I think.
Drawing front-to-back is normally a benefit because of early-z: the hardware can do the depth test right after rasterization, before doing the texture fetch or shading. With front-to-back sorting, most fragments fail the depth test, and you save a lot of texture bandwidth, shading throughput, and zbuffer-write bandwidth.
But alpha test breaks that. If a fragment passes the depth test, it might still be killed by alpha test, so zwrite can't happen until after texturing/shading. Most hardware that can do early-z still has to do the depth test at the same point in the pipeline as it does zwrite, so with alpha test you end up doing ztest + zwrite after texturing and shading. As a result, front-to-back sorting only saves you zwrite bandwidth, nothing else.
I think you have two options, if you really want large sprites that overlap significantly:
(a) Only use two or three distinct Z values for your sprites. Draw them back-to-front with blending (and alpha-test, if it helps). No overlap within a layer: you can pre-render each layer either in the original assets or once at runtime, then just shift it left and right.
(b) If your sprites have large opaque regions surrounded by a semi-transparent border, you can draw the opaque regions in a first pass with no alpha test, then draw borders as a separate pass. This will cut down on the number of alpha-tested fragments.
OpenGL ES 2.0 is implemented in a project that I have been working on with a couple shader components that define what a texture should look like after modifications from a Bitmap. The SurfaceView will only ever have a single image in it for my project.
While doing several different approaches and looking through code in the past 24 hours, just hoping for a quick response or two from the community. Not looking for solutions, I'll do that research.
It sounds as though since we are using shaders, that in order to do scaling and movements in the texture based on touch events, that I will have have to use the Matrix utilities and OpenGL translations or movements with the camera to get the same effect as what is currently done within an ImageView. Would this be the appropriate approach? Perhaps even modify the shader code so that I have some additional input variables?
I don't believe that I can use anything on the Android side that would get the same effect, such as modifying the canvas of the SurfaceView or altering dimensions of the UI in some other fashion that would achieve the same effect?
Thanks. Again, solutions for zooming and moving around aren't necessary, just trying to get a grasp on intermixing OpenGL and Android appropriately for the task.
Why does it seem that several elements in 1.0 are easier than 2.0; ease of use should improve between releases.
Yes. You will need to use an ortho projection and adjust the extents to zoom. See this link here. To pan, you can simply use a glTranslatef.
If you would like to do this entirely in the pixel shader, you can use the texture matrix stack with glScalef and glTranslatef.
I was reading this article, and the author writes:
Here's how to write high-performance applications on every platform in two easy steps:
[...]
Follow best practices. In the case of Android and OpenGL, this includes things like "batch draw calls", "don't use discard in fragment shaders", and so on.
I have never before heard that discard would have a bad impact on performance or such, and have been using it to avoid blending when a detailed alpha hasn't been necessary.
Could someone please explain why and when using discard might be considered a bad practise, and how discard + depthtest compares with alpha + blend?
Edit: After having received an answer on this question I did some testing by rendering a background gradient with a textured quad on top of that.
Using GL_DEPTH_TEST and a fragment-shader ending with the line "if(
gl_FragColor.a < 0.5 ){ discard; }" gave about 32 fps.
Removing the if/discard statement from the fragment-shader increased
the rendering speed to about 44 fps.
Using GL_BLEND with the blend function "(GL_SRC_ALPHA,
GL_ONE_MINUS_SRC_ALPHA)" instead of GL_DEPTH_TEST also resulted in around 44 fps.
It's hardware-dependent. For PowerVR hardware, and other GPUs that use tile-based rendering, using discard means that the TBR can no longer assume that every fragment drawn will become a pixel. This assumption is important because it allows the TBR to evaluate all the depths first, then only evaluate the fragment shaders for the top-most fragments. A sort of deferred rendering approach, except in hardware.
Note that you would get the same issue from turning on alpha test.
"discard" is bad for every mainstream graphics acceleration technique - IMR, TBR, TBDR. This is because visibility of a fragment(and hence depth) is only determinable after fragment processing and not during Early-Z or PowerVR's HSR (hidden surface removal) etc. The further down the graphics pipeline something gets before removal tends to indicate its effect on performance; in this case more processing of fragments + disruption of depth processing of other polygons = bad effect
If you must use discard make sure that only the tris that need it are rendered with a shader containing it and, to minimise its effect on overall rendering performance, render your objects in the order: opaque, discard, blended.
Incidentally, only PowerVR hardware determines visibility in the deferred step (hence it's the only GPU termed as "TBDR"). Other solutions may be tile-based (TBR), but are still using Early Z techniques dependent on submission order like an IMR does.
TBRs and TBDRs do blending on-chip (faster, less power-hungry than going to main memory) so blending should be favoured for transparency. The usual procedure to render blended polygons correctly is to disable depth writes (but not tests) and render tris in back-to-front depth order (unless the blend operation is order-independent). Often approximate sorting is good enough. Geometry should be such that large areas of completely transparent fragments are avoided. More than one fragment still gets processed per pixel this way, but HW depth optimisation isn't interrupted like with discarded fragments.
Also, just having an "if" statement in your fragment shader can cause a big slowdown on some hardware. (Specifically, GPUs that are heavily pipelined, or that do single instruction/multiple data, will have big performance penalties from branch statements.) So your test results might be a combination of the "if" statement and the effects that others mentioned.
(For what it's worth, testing on my Galaxy Nexus showed a huge speedup when I switched to depth-sorting my semitransparent objects and rendering them back to front, instead of rendering in random order and discarding fragments in the shader.)
Object A is in front of Object B. Object A has a shader using 'discard'. As such, I can't do 'Early-Z' properly because I need to know which sections of Object B will be visible through Object A. This means that Object A has to pass all the way through the processing pipeline until almost the last moment (until fragment processing is performed) before I can determine if Object B is actually visible or not.
This is bad for HSR and 'Early-Z' as potentially occluded objects have to sit and wait for the depth information to be updated before they can be processed. As has been stated above, its bad for everyone, or, in slightly more friendly way "Friends don't let friends use Discard".
In your test, your if statment is in per pixel level performance
if ( gl_FragColor.a < 0.5 ){ discard; }
Would be processed once per pixel that was being rendered (pretty sure that's per pixel and not per texel)
If your if statment was testing a Uniform or Constant you'd most likley get a different result due to to Constants only being processed once on compile or uniforms being processed once per update.
I'm having a small dilemma.
I'm working with Android (not that it is relevant) and I noticed one thing on some phones and not others: the scissor prevents the call glClear(GL_COLOR_BUFFER_BIT) from working correctly.
Therefore I'm wondering if it is better to do:
gl.glDisable(GL10.GL_SCISSOR_TEST);
gl.glClear(GL10.GL_COLOR_BUFFER_BIT);
gl.glEnable(GL10.GL_SCISSOR_TEST);
or
gl.glScissors(0,0,800,480);
gl.glClear(GL10.GL_COLOR_BUFFER_BIT);
Basically, is it better to change to scissors test zone before the clear or is it better to disable/enable?
This is the specification of glScissors, since a glClear is considered in OpenGL as a drawing command. So the behavior you see is perfectly normal (D3D works similarly) but buggy on the other phones where funnily it seems for you to work!
About which solution to choose, I don't know, both are valid. It's a matter of taste I would say. I prefer the first one because it's more easy to figure out what happens.
Now if on your OpenGL implementation the second solution turns out to be faster than the first one, I would picked the second one. Benchmark!
Under the glHood:
Let me share of what I know on desktop GPUs and speculate a bit (don't take it too seriously!). A glClear command actually results in a draw of a full-screen quad, since drawing triangles is the "fast path" on GPUs. This is probably more efficient than DMAs or fixed hardware, since it's parallel (all shader cores clear its portion of the screen, color, z and stencil) and if you do that you can avoid the cost of fixed hardware.
About glScissor, I heard it's implemented via stencil buffering (the same mechanism than usual OpenGL stencil buffer), so only fragments that fall into the scissor zone can participate to depth dest, fragment shading, blending, etc (this is done for same reason than glClear, avoid dedicated hardware). It could be also implemented as a fragment discard+dynamic branching on modern GPUs.
Now you can see why it works that way. Only fragments of the full screen quad that lies within the scissor zone can "shade" the color buffer and clear it!
What is the best way: if I use glDrawArrays, or if I use glDrawElements? Any difference?
For both, you pass OpenGL some buffers containing vertex data.
glDrawArrays is basically "draw this contiguous range of vertices, using the data I gave you earlier".
Good:
You don't need to build an index buffer
Bad:
If you organise your data into GL_TRIANGLES, you will have duplicate vertex data for adjacent triangles. This is obviously wasteful.
If you use GL_TRIANGLE_STRIP and GL_TRIANGLE_FAN to try and avoid duplicating data: it isn't terribly effective and you'd have to make a rendering call for each strip and fan. OpenGL calls are expensive and should be avoided where possible
With glDrawElements, you pass in buffer containing the indices of the vertices you want to draw.
Good
No duplicate vertex data - you just index the same data for different triangles
You can just use GL_TRIANGLES and rely on the vertex cache to avoid processing the same data twice - no need to re-organise your geometry data or split rendering over multiple calls
Bad
Memory overhead of index buffer
My recommendation is to use glDrawElements
The performance implications are probably similar on the iphone, the OpenGL ES Programming Guide for iOS recommends using triangle strips and joining multiple strips through degenerate triangles.
The link has a nice illustration of the concept. This way you could reuse some vertices and still do all the drawing in one step.
For best performance, your models should be submitted as a single unindexed triangle strip using glDrawArrays with as few duplicated vertices as possible. If your models require many vertices to be duplicated (because many vertices are shared by triangles that do not appear sequentially in the triangle strip or because your application merged many smaller triangle strips), you may obtain better performance using a separate index buffer and calling glDrawElements instead. There is a trade off: an unindexed triangle strip must periodically duplicate entire vertices, while an indexed triangle list requires additional memory for the indices and adds overhead to look up vertices. For best results, test your models using both indexed and unindexed triangle strips, and use the one that performs the fastest.
Where possible, sort vertex and index data so that that triangles that share common vertices are drawn reasonably close to each other in the triangle strip. Graphics hardware often caches recent vertex calculations, so locality of reference may allow the hardware to avoid calculating a vertex multiple times.
The downside is that you probably need a preprocessing step that sorts your mesh in order to obtain long enough strips.
I could not come up with a nice algorithm for this yet, so I can not give any performance or space numbers compared to GL_TRIANGLES. Of course this is also highly dependent on the meshes you want to draw.
Actually you can degenerate the triangle strip to create continuous strips so that you don't have to split it while using glDrawArray.
I have been using glDrawElements and GL_TRIANGLES but thinking about using glDrawArray instead with GL_TRIANGLE_STRIP. This way there is no need for creating inidices vector.
Anyone that knows more about the vertex cache thing that was mentioned above in one of the posts? Thinking about the performance between glDrawElements/GL_TRIANGLE vs glDrawArray/GL_TRIANGLE_STRIP.
The accepted answer is slightly outdated. Following the doc link in Jorn Horstmann's answer, OpenGL ES Programming Guide for iOS, Apple describes how to use "degenerate-triangles" trick with DrawElements, thereby gaining the best of both worlds.
The minor savings of a few indices by using DrawArrays isn't worth the savings you get by combining all your data into a single GL call, DrawElements. (You could combine all using DrawArrays, but then any "wasted elements" would be wasting vertices, which are much larger than indices, and require more render time too.)
This also means you don't need to carefully consider all your models, as to whether most of them can be rendered as a minimal number of strips, or they are too complex. One uniform solution, that handles everything. (But do try to organize into strips where possible, to minimize data sent, and maximize GPU likeliness of re-using recently cached vertex calculations.)
BEST: A single DrawElements call with GL_TRIANGLE_STRIP, containing all your data (that is changing in each frame).