Target: OpenGL ES 3.1
After each frame, I want to read back from my fragment shader the number of transparent fragments (i.e. fragments with 0 < alpha < 1) that the shader just wrote. Depending on this number, the CPU will then, in the next frame, use different rendering techniques (0 - normal, fast rendering; >0 --> switch to a order-independent transparency method with appropriate buffer size- which is much slower).
I guess I don't even need this data immediately; I can tolerate a delay of 1 frame (i.e. after sending frame N to GPU my CPU could read back the number from previous frame N-1 - if that would help speed-wise).
The question is: what is the best way to read this single integer back?
Should I bind a FBO with GL_READ_FRAMEBUFFER and then use glReadPixels to read the first pixel? or should I be using glMapBufferRange? Is it even a good idea? The point is, I need order-independent transparency, but a proper implementation of it (A-buffer with per-pixel linked lists) is a much slower rendering technique, so I want to detect if I need it at the moment (most of the time I won't be needing it) and only do it if there are really some transparent fragments on the screen.
Related
I'm trying to implement deferred rendering on an Android phone using OPENGL ES 3.0. I've gotten this to work ok but only very slowly, which rather defeats the whole point. What really slows things up is the multiple calls to the shaders. Here, briefly, is what my code does:
Geometry Pass:
Render scene - output position, normal and colour to off-screen buffers.
For each light:
a) Stencil Pass:
Render a sphere at the current light position, sized according to the lights intensity. Mark these pixels as influenced by current light. No actual output.
b) Light Pass:
Render a sphere again, this time using the data from the geometry pass to apply lighting equations to pixels marked in the previous step. Add this to off-screen buffer
Blit to screen
It's this restarting the shaders for each light causing the bottleneck. For example, with 25 lights the above steps run at about 5 fps. If instead I do: Geometry Pass / Stencil Pass - draw 25 lights / Light Pass - draw 25 lights it runs at around 30 fps. So, does anybody know how I can avoid having to re-initialize the shaders? Or, in fact, just explain what's taking up the time? Would it help or even be possible (and I'm sorry if this sounds daft) to keep the shader 'open' and overwrite the previous data rather than doing whatever it is that takes so much time restarting the shader? Or should I give this up as a method for having multiple lights, on a mobile devise anyway.
Well, I solved the problem of having to swap shaders for each light by using an integer texture as a stencil map, where a certain bit is set to represent each light. (So, limited to 32 lights.) This means step 2a (above) can be looped, then a single change of shader, and looping step 2b. However, (ahahaha!) it turns out that this didn't really speed things up as it's not, after all, swapping shaders that's the problem but changing write destination. That is, multiple calls to glDrawBuffers. As I had two such calls in the stencil creation loop - one to draw nowhere when drawing a sphere to calculate which pixels are influenced and one to draw to the integer texture used as the stencil map. I finally realized that as I use blending (each write with a colour where a singe bit is on) it doesn't matter if I write at the pixel calculation stage, so long as it's with all zeros. Getting rid of the unnecessary calls to glDrawBuffers takes the FPS from single figures to the high twenties.
In summary, this method of deferred rendering is certainly faster than forward rendering but limited to 32 lights.
I'd like to say that me code was written just to see if this was a viable method and many small optimizations could be made. Incidentally, as I was limited to 4 draw buffers, I had to scratch the position map and instead recover this from gl_FragCoord.xyz. I don't have proper benchmarking tools so I'd be interested to hear from anyone who can tell me what difference this makes, speedwise.
I'm working on a very simple, yet customizable OpenGL ES 2 rendering engine (I know that stuff like "unity" and "unreal engine" exist, and that reinventing the wheel isn't probably the sanest thing to do, just take it as a given ;-) ).
Now I'm facing object picking: I don't want to do ray-casting and I would like to do color picking instead (free laughs here: I'm using MRTs on ES3 in the current implementation. It works, but only where it works, if you catch my drift...).
AFAIK when color picking, you can either have two buffers (one for selection and one for rendering), or write twice to the same one: each approach has it's pro and cons.
Assuming that I have an unknown number of objects is it better:
to create two buffers and draw each object to each buffer, before going to the next one (thus minimizing the amount of uniforms you have to load, but switching buffer twice for each object)
to create a single buffer, draw all objects for selection, doing the color picking, then draw everything for rendering (thus limiting the context switching, but increasing program switching and uniforms loading)
I think that the question can be summarized as: "is it more expensive to switch buffers repetitively, or to switch programs and load uniforms"?
Oh... and feel free to tell me if the question doesn't make sense :)
On mobile the number of drawcalls you make is one of the most important performance factor to consider, the driver overhead for each drawcall is generally huge (bigger on Android than IOS in general), so drawing two times the objects will not be good for perf (will use more CPU => cost of driver calls).
If you have no more than 256 different objects then a simple solution without MRT would be to use a RGBA render target (instead RGB) and store the object "ID" in the alpha channel (so as a grayscale color).
Otherwise, you should at all cost avoid switching buffer repetitively, otherwise you will get slow load/store operations (GPU being forced to memcpy the current buffer to "backup it" when switching to a new one, imagine doing that hundreds of time per frame ...)
So to answer your question it cost more to switch buffer repetitively than to switch program and load uniforms.
PS: if you have more than 256 objects you may eventually try to render in "slices" of 256 objects, after each slice you glRead the buffer pixels and check for object selection in the alpha channel then glClear only the alpha and continue to the next slice. But note that this may not be very efficient either because as soon as you want to read the buffer pixels the CPU has to stall, waiting for the GPU to finish rendering so you are breaking CPU/GPU parallelism
I created a voxel world using OpenGL ES 2.0 using a VBO to store a basic cube and using a different position matrix for each cube. I am able to get 30fps on my Galaxy S3 when there are 500-600 cubes being rendered, but anything more than 1500 cubes isn't able to run at a faster rate than 8 fps. This is unacceptable because the voxel world should be able to handle more than 5,000 voxels being rendered at a stable 30fps. I have played other mobile games on my phone that run at good framerates and render much more than 5000 blocks at a time. What kind of techniques would be best for getting good performance?
Here is what I have set up in more detail:
There is one VBO containing vertex information for a basic cube.
Each block has its own matrix that is translated to the block's position in world space (This matrix is calculated only once when the block is created). The block calls glDrawArrays to draw the cube using its position matrix. Unfortunately this means there are thousands of calls to glDrawArrays in each frame.
Is there a better technique to this? I don't know how to group all the blocks into one single call to glDrawArrays because that would mean the VBO would need a huge allocation, to add all the vertex data for every single cube, and it is impossible to know how much space the VBO would need before drawing them. What I was thinking was to allocate a VBO for every 500 or so blocks so that if it needs more space for blocks it can always create a new VBO for it. And this way it wouldn't be allocating too much extra space since it will only allocate enough space for 500 blocks, and this way if we have 5000 blocks in the world, there will be only 10 calls to glDrawArrays instead of having thousands of those calls.
Another idea I have is that instead of having a VBO for the cube, I could make a VBO for a quad, and use a transformation matrix on each quad. This would require even more calls to glDrawArrays since I would have to call it for each face of the cube, but the plus side is that this way I can remove the faces that already have a block next to them. For the floor level, each block has 4 blocks surrounding it, so those 4 faces don't actually need to be drawn. This would save drawing those 4 quads for each block, but it would require more than double the amount of glDrawArrays calls. To reduce the amount of glDrawArrays calls I could create a new VBO for every 500 or so quads, and add/remove quads to the current VBOs whenever necessary. This would reduce the amount of glDrawArrays calls, but it would mean that I have to group each quad based on its texture, which is another issue because if I have to create a VBO for each texture, that would require me to allocate a lot of extra unnecessary space because there might be just one block that uses a certain texture and I may end up allocating space for 500 blocks for that texture.
These are my thoughts on some of the methods I can think of to optimise the rendering, but I don't think any of these techniques will drastically improve the fps of the game, because every method comes with its own issues. Is there anything that I have not thought of that could be a better solution?
EDIT: I switched to rendering quads instead of cubes because this way I can skip over the faces that are not visible. After that I also added frustum culling so that only blocks visible inside the frustum are shown. This increased the performance so that I can render a decent sized world at 30 fps now. But I think there is still a lot of room for improvement, because there are currently 23,000 calls to glDrawArrays(GL_TRIANGLES) (one for each quad rendered on screen). Would switching to using glDrawArrays(GL_TRIANGLE_STRIPS) make any real difference? And also creating VBO's that hold 1,000 quads each instead of just 1 quad is a possibility, but that would mean I would have to allocate a lot more space in the VBO's. (Right now there is only one quad stored in the VBO which is transformed by a matrix to its position/rotation).
if using Octtrees (wich is definitely THE WAY) does not suit you, you can optimize the code for calling the vbo lists.
In my work, I started with a scene rendering at 3fps rate, just optimizing the opengl calls and context switches, now runs on 53fps (wich is quite fine considering the starting point).
So, try not to change any register inside the gpu between calls:
order all the objects with the same shader to render them all at the time using only one glUseProgram
order objects with transparency, so you only draw translucent objects at the end.
draw objects in such a fashion that fragments are drawn only once (if a object is behind another, draw the front object first, cause depth test is faster than fragment calculation).
use shaders without "discard;" wich is costly for the cpu to process.
use reversed loops to get a little bit of cpu speed
dont select the texture if it is already the same than selected in the GPU (a cpu 'if' is less costly than a GPU register change).
try not to update the shader attributes if there is no need to (cpu if is less costly).
if you post some pieces of code I can help you better.
I am currently implementing a voxel world using java on a normal PC with OpenGL 4.x.
At the beginning I had the same issue but that I followed a very basic tutorial: https://sites.google.com/site/letsmakeavoxelengine/
With one render call per chunk there is no problem having 10 Chunks of 32*32*32 Blocks rendered (FPS > 30). You should load the Chunk and only add those faces which are not occluded by other faces (so that they are visible to the player) to an array which will be uploaded to a VBO. Therefore you have one rendercall per Chunk with the minimum amout of faces
In 2D is looks like this
_ _ _
|B B B|
|B B |
|B B B|
- - -
There is no need to draw the faces between the outter faces. In addition you can use frustrum culling: How to check if an object lies outside the clipping volume in OpenGL?
So you just need to make a render call for those chunks which are actually inside your frustrum. Do not render chunks behind the camera. OpenGL will make a lot of calculations for all vertices of the chunk, but then the chunk is not visible so why render it in the first place. This can happen in your java code.
A third optimazation could be deferred shading: http://en.wikipedia.org/wiki/Deferred_shading
As far as I know the shading is processed before depth testing and throwing away those triangels/ faces occluded by others, you can speed up your shader using deferred shading as you only shade those vertices which will pass the depth-testing.
There are a lot of more ways to optimize voxel rendering but for me this are the most basic operations. The given tutorial behind the first link isn't finished yet, but he shows a lot of ideas for optimizing voxel rendering.
Edit:
If you want to use textures, which different textures for each cube, I recommend to place all textures in a big one, so you do not need to swap textures, a simple texture lookup is much more faster than swapping a texture (glBindTexture(..)) and then make a lookup and later swap back to this texture. Use one big huge texture and apply the right UV coordinates to your vertices.
You should use BSP Octrees to discard big blocks of offscreen cubes.
You divide the world into 8 "space cubes" wich go in the different axis.
Then, you check if the camera can see something inside the cube, if it can't you discard all the blocks in that section (wich can speed up to 8x). Then, inside the block, you divide again in 8 sections, and check again if they are visible. An so on, speeding checks and renders.
http://en.wikipedia.org/wiki/Octree
http://i.ytimg.com/vi/S-oIeUiw2UY/hqdefault.jpg
Octree can be accelerated using "portals" (and I dont mean GladOs ;) ) wich discard voxels and Octrees depending on the visibility inside doors and windows, but is only good for interiors.
External requirements --- you have to hate them...
I have an OpenGL ES game, which uses EGL and OpenGL ES to draw on the screen. I don't have source to this; it's supplied as a binary blob. I'm implementing the interface layer that mediates between the game's calls to EGL and OpenGL and the platform's implementation.
It works fine. But I now have the unexpected external requirement that I need to be able to rotate the entire game's output 90 degrees.
Can anyone suggest any good (easy, fast) ways to do this? Off the top of my head, I can think of:
insert the appropriate transformation into the game's projection matrix. This seems to me to be the fastest solution; but I don't think I have enough knowledge of the game's manipulation of the projection matrix to do this reliably. Plus it'll confuse the game if it uses any OpenGL calls to access the screen which don't go through the projection matrix. (glReadPixels(), for example.)
give the game a rendering context to an off-screen buffer; it renders there, and then when the game calls eglSwapBuffers() I copy the result onto the screen. Render-to-texture would help here. Problems: this will affect performance as I'm effectively doing two drawing passes instead of one; and render-to-texture isn't standardised in OpenGL ES. (My target platform, Android, doesn't even reliably support shared contexts.)
render into the colour buffer, then use glReadPixels() to copy the data out and do a software rotate onto the screen. Problems: dead slow, and I have no control of the size of the buffer (i.e. if the screen is 640x480 and we're drawing 90° rotated, I really want to give the game a 480x640 colour buffer).
other?
Game-specific hacks aren't an option here because I need to be able to swap out the game binary with another one; this has to be a generic fix. Changing the game isn't an option because we don't have control of the game source code.
Any suggestions? Other than the non-technical one of trying to persuade the requirement to go away?
What is the issue with you have to use glRotate along the z axis ??
Approach 1 is the way to go.
Pixel operations are heavy and it is possible, that you could be messing up with the aspect ratio, etc etc.
The steps which go into drawing are
1. Set the transformation matrix (the model/ projection)
If landscape, apply the glRotate
2. Set the view port (this might change each time you rotate the screen)
if landscape - set a b as height/widht respectively
if landscape - set b a as height/widht respectively
3. Draw the matrix
When you rotate the screen, the objects are rendered again. So glRotate is the best way to go.
I was reading this article, and the author writes:
Here's how to write high-performance applications on every platform in two easy steps:
[...]
Follow best practices. In the case of Android and OpenGL, this includes things like "batch draw calls", "don't use discard in fragment shaders", and so on.
I have never before heard that discard would have a bad impact on performance or such, and have been using it to avoid blending when a detailed alpha hasn't been necessary.
Could someone please explain why and when using discard might be considered a bad practise, and how discard + depthtest compares with alpha + blend?
Edit: After having received an answer on this question I did some testing by rendering a background gradient with a textured quad on top of that.
Using GL_DEPTH_TEST and a fragment-shader ending with the line "if(
gl_FragColor.a < 0.5 ){ discard; }" gave about 32 fps.
Removing the if/discard statement from the fragment-shader increased
the rendering speed to about 44 fps.
Using GL_BLEND with the blend function "(GL_SRC_ALPHA,
GL_ONE_MINUS_SRC_ALPHA)" instead of GL_DEPTH_TEST also resulted in around 44 fps.
It's hardware-dependent. For PowerVR hardware, and other GPUs that use tile-based rendering, using discard means that the TBR can no longer assume that every fragment drawn will become a pixel. This assumption is important because it allows the TBR to evaluate all the depths first, then only evaluate the fragment shaders for the top-most fragments. A sort of deferred rendering approach, except in hardware.
Note that you would get the same issue from turning on alpha test.
"discard" is bad for every mainstream graphics acceleration technique - IMR, TBR, TBDR. This is because visibility of a fragment(and hence depth) is only determinable after fragment processing and not during Early-Z or PowerVR's HSR (hidden surface removal) etc. The further down the graphics pipeline something gets before removal tends to indicate its effect on performance; in this case more processing of fragments + disruption of depth processing of other polygons = bad effect
If you must use discard make sure that only the tris that need it are rendered with a shader containing it and, to minimise its effect on overall rendering performance, render your objects in the order: opaque, discard, blended.
Incidentally, only PowerVR hardware determines visibility in the deferred step (hence it's the only GPU termed as "TBDR"). Other solutions may be tile-based (TBR), but are still using Early Z techniques dependent on submission order like an IMR does.
TBRs and TBDRs do blending on-chip (faster, less power-hungry than going to main memory) so blending should be favoured for transparency. The usual procedure to render blended polygons correctly is to disable depth writes (but not tests) and render tris in back-to-front depth order (unless the blend operation is order-independent). Often approximate sorting is good enough. Geometry should be such that large areas of completely transparent fragments are avoided. More than one fragment still gets processed per pixel this way, but HW depth optimisation isn't interrupted like with discarded fragments.
Also, just having an "if" statement in your fragment shader can cause a big slowdown on some hardware. (Specifically, GPUs that are heavily pipelined, or that do single instruction/multiple data, will have big performance penalties from branch statements.) So your test results might be a combination of the "if" statement and the effects that others mentioned.
(For what it's worth, testing on my Galaxy Nexus showed a huge speedup when I switched to depth-sorting my semitransparent objects and rendering them back to front, instead of rendering in random order and discarding fragments in the shader.)
Object A is in front of Object B. Object A has a shader using 'discard'. As such, I can't do 'Early-Z' properly because I need to know which sections of Object B will be visible through Object A. This means that Object A has to pass all the way through the processing pipeline until almost the last moment (until fragment processing is performed) before I can determine if Object B is actually visible or not.
This is bad for HSR and 'Early-Z' as potentially occluded objects have to sit and wait for the depth information to be updated before they can be processed. As has been stated above, its bad for everyone, or, in slightly more friendly way "Friends don't let friends use Discard".
In your test, your if statment is in per pixel level performance
if ( gl_FragColor.a < 0.5 ){ discard; }
Would be processed once per pixel that was being rendered (pretty sure that's per pixel and not per texel)
If your if statment was testing a Uniform or Constant you'd most likley get a different result due to to Constants only being processed once on compile or uniforms being processed once per update.