GLSL IF speed vs multiply factor

GLSL IF speed vs multiply factor - android

I know this has been asked generally but answer is alweays "depends", so I'm creating a concrete question in hope to get a concrete answer.
I know the evil of IF's on GLSL, they can be really expensive, even execute all code in some hardware.
So, I have a fragment shader from an example (a dual paraboloid shadow map) which uses if's to determine which map to use and compute the depth, but I know it's very easy to replace those if's with a multiplier, the question is there are a texture sampling inside the fragment shader, what would be faster, to use an if or use a multiplier to filter the unused data?
These are the proposed codes:
IF version:
//Alpha is a variable computed on the fly, cannot be replaced
float depth = 0;
float mydepth = 0;
if(alpha >= 0.5f)
{
depth = texture2D(ShadowFrontS, P0.xy).x;
mydepth = P0.z;
}
else
{
depth = texture2D(ShadowBackS, P1.xy).x;
mydepth = P1.z;
}
Filter version:
float mlt = ceiling(alpha - 0.5f);
float depth = 0;
float mydepth = 0;
depth = texture2D(ShadowFrontS, P0.xy).x * mlt;
mydepth = P0.z * mlt;
mlt = 1.0f - mlt;
depth = depth + (texture2D(ShadowFrontS, P1.xy).x * mlt);
mydepth = P1.z * mlt;
P.D.: I'm targeting Desktop and Mobile devices, so performance on low-end hardware is a must.

Branching is not "evil" per-se on massively SIMD architectures. If all the threads in a "bunch" (NVidia calls them Warps) follow the same code path, i.e. take all the same branches, everything is fine.
Only if a branch is partly taken (within that bunch) and for the other part not, both branches must be executed and later on the calculations and data fetches discarded that are not relevant for the current thread.
Now in your case it requires some careful profiling to see, which variant benefits your GPU more. But my gut instinct tells me, it's actually the branching version. Why? Because: Usually the value by which you decide on a branch depends on the screen space position and often large contiguous areas of fragments share the same code path and branching; so performance penalities happen only for those "bunches", which cover a bordering region. These bunches are usually only a few pixel² in size (8×8, or 16×16).
The shader you have there is not GPU limited (i.e. limited by the computational capabilities of the GPU), but memory bandwidth limited, i.e. by the throughput that the GPU's memory link offers; that is because of the texture2D fetch operations. And in that case reducing the actual number of fetches and thereby the required memory bandwidth will probably benefit your program more than reducing the number of computations.
The branchless mix-multiplex variant of your shader will always fetch both textures, the branching one will do that only within the bordering regions. So from that heuristic I'd guess, that your branching variant is actually the better choice.
But to be sure you have to profile it.

Related

Graphics taking 0.5 GB memory in libgdx

I am developing a simple 2D game. I have multiple sprites. Each sprite has around 80 png/frames of 265* 256. I used LibGdx's Texture packer to package the atlas. Am enabling mimap using following code to pac
TexturePacker.Settings settings = new TexturePacker.Settings();
settings.combineSubdirectories = true;
settings.filterMin = Texture.TextureFilter.MipMapNearestLinear;
settings.filterMag = Texture.TextureFilter.Linear;
TexturePacker.process(settings, f.getPath(), outputFolderName, atlasFileName);
Questions:
Are 80 images/frames for single sprite too much?
Is 0.5 GB memory usage too much for a simple game like Fruit ninja?
How can i reduce my memory usage?
Any other things i should try?
Update1:
Here is the the screen shot taken from android profiler.

80 frames is kind of a lot but that number has never been a problem in my projects. That said, most of our 60 frame pre-rendered animations are small, like an icon flashing graphic in 48x48 segments, with one sheet (so there is minimal context switching). This has never been a performance issue for us ... BUT you saying 256x256 scares me a little, especially if it is uncropped and a lot of pngs are created!
While my projects have sprites of similar frame numbers, they have been optimized via the Texture Packer. Make sure you have Trim mode set to "Trim" and that it isn't "None" (the setting is near the bottom under Sprites). This setting will de-homogenize the 256x256 into smaller pieces if at all possible, reducing the number of total sheets and texture bindings (I think, and also fewer context switches) .. I'm not totally sure how you packed your sprites or if they can even be trimmed, but this could potentially be a performance life saver.
Let me show you an example:
Before
After
Also if you provided the code in how your animations are created, we could double check to make sure you aren't binding 80 Textures when only 1 would be needed. I create my animations via the following process:
TextureAtlas atlas = assetManager.get(atlas_name);
Sprite[] spriteFrames = new Sprite[numFrames];
for (int index=0; index<numFrames; index++) {
if (atlas.findRegion(lookup_name, index) == null)
Engine.console("[ERROR] problem loading and finding region for " + atlas_name + " " + lookup_name + " index: " + index + " not found in spritesheet .. fix immediately");
else {
spriteFrames[index] = atlas.createSprite(lookup_name, index);
}
}
Animation<Sprite> animation = new Animation<Sprite>(timeBetweenFrames, spriteFrames);
I hoped this helped providing some insight.

How to do correct timing of Android RenderScript code on Nvidia Shield

I have implemented a small CNN in RenderScript and want to profile the performance on different hardware. On my Nexus 7 the times make sense, but on the NVIDIA Shield they do not.
The CNN (LeNet) is implemented in 9 layers residing in a queue, computation is performed in sequence. Each layer is timed individually.
Here is an example:
conv1 pool1 conv2 pool2 resh1 ip1 relu1 ip2 softmax
nexus7 11.177 7.813 13.357 8.367 8.097 2.1 0.326 1.557 2.667
shield 13.219 1.024 1.567 1.081 0.988 14.588 13.323 14.318 40.347
The distribution of the times are about right for the nexus, with conv1 and conv2 (convolution layers) taking most of the time. But on the shield, the times drop way beyond what's reasonable for layers 2-4 and seem to gather up towards the end. The softmax layer is a relatively small job, so 40ms is way too large. My timing method must be faulty, or something else is going on.
The code running the layers looks something like this:
double[] times = new double[layers.size()];
int layerindex = 0;
for (Layer a : layers) {
double t = SystemClock.elapsedRealtime();
//long t = System.currentTimeMillis(); // makes no difference
blob = a.forward(blob); // here we call renderscript forEach_(), invoke_() etc
//mRS.finish(); // makes no difference
t = SystemClock.elapsedRealtime() - t;
//t = System.currentTimeMillis() - t; // makes no difference
times[layerindex] += t; // later we take average etc
layerindex++;
}
It is my understanding that once forEach_() returns, the job is supposed to be finished. In any case, mRS.finish() should provide a final barrier. But looking at the times, the only reasonable explanation is that jobs are still processed in the background.
The app is very simple, I just run the test from MainActivity and print to logcat. Android Studio builds the app as a release and runs it on the device which is connected by USB.
(1) What is the correct way to time RenderScript processes?
(2) Is it true that when forEach_() returns, the threads spawned by the script are guaranteed to be done?
(3) In my test app, I simply run directly from the MainActivity. Is this a problem (other than blocking the UI thread and making the app unresponsive)? If this influences the timing or causes the weirdness, what is a proper way to set up a test app like this?

I've implemented CNNs in RenderScript myself, and as you explain, it does require chaining multiple processes and calling forEach_*() various times for each layer if you implement them each as a different kernel. As such, I can assure you that the forEach call returning does not really guarantee that the process has completed. In theory, this will only schedule the kernel and all queued up requests will actually run whenever the system determines it's best to, especially if they get processed in the tablet's GPU.
Usually, the only way to make absolutely sure you have some kind of control over a kernel truly running is by explicitly reading the output of the RS kernel in between layers, such as by using .copyTo() on the output allocation object of that kernel. This "forces" any queued up RS jobs that have not run yet (on which that layer's output allocation is dependent), to execute at that time. Granted, that may introduce data transfer overheads and your timing will not be fully accurate -- in fact, the execution time of the full network will quite surely be lower than the sum of the individual layers if timed in this manner. But as far as I know, it's the only reliable way to time individual kernels in a chain and it will give you some feedback to find out where bottlenecks are, and to better guide your optimization, if that's what you're after.

Maybe a little bit off topic: but for CNN, if you can structure your algorithm using matrix-matrix multiplication as basic computing blocks you can actually use RenderScript IntrinsicBLAS, especially BNNM and SGEMM.
Pros:
High performance implementation of 8bit Matrix Multiplication (BNNM), available in N Preview.
Back support back to Android 2.3 through RenderScript Support lib, when using Build-Tools 24.0.0 rc3 and above.
High performance GPU acceleration of SGEMM on Nexus5X and 6P with N Preview build NPC91K.
If you only use RenderScript Intrinsics, you can code everything in java.
Cons:
Your algorithm may need to be refactored, and need to be based on 2d matrix multiplication.
Though available in Android 6.0, but BNNM performance in 6.0 is not satisfactory. So it is better to use support lib for BNNM and set targetSdkVersion to be 24.
SGEMM GPU acceleration currently only available in Nexus5X and Nexus6P. And it currently requires the width and height of the Matrices to be multiples of 8.
It's worth trying if BLAS fits into your algorithm. And it is easy to use:
import android.support.v8.renderscript.*;
// if you are not using support lib:
// import android.renderscript.*;
private void runBNNM(int m, int n, int k, byte[] a_byte, byte[] b_byte, int c_offset, RenderScript mRS) {
Allocation A, B, C;
Type.Builder builder = new Type.Builder(mRS, Element.U8(mRS));
Type a_type = builder.setX(k).setY(m).create();
Type b_type = builder.setX(k).setY(n).create();
Type c_type = builder.setX(n).setY(m).create();
// If you are reusing the input Allocations, just create and cache them somewhere else.
A = Allocation.createTyped(mRS, a_type);
B = Allocation.createTyped(mRS, b_type);
C = Allocation.createTyped(mRS, c_type);
A.copyFrom(a_byte);
B.copyFrom(b_byte);
ScriptIntrinsicBLAS blas = ScriptIntrinsicBLAS.create(mRS);
// Computes: C = A * B.Transpose
int a_offset = 0;
int b_offset = 0;
int c_offset = 0;
int c_multiplier = 1;
blas.BNNM(A, a_offset, B, b_offset, C, c_offset, c_multiplier);
}
SGEMM is similar:
ScriptIntrinsicBLAS blas = ScriptIntrinsicBLAS.create(mRS);
// Construct the Allocations: A, B, C somewhere and make sure the dimensions match.
// Computes: C = 1.0f * A * B + 0.0f * C
float alpha = 1.0f;
float beta = 0.0f;
blas.SGEMM(ScriptIntrinsicBLAS.NO_TRANSPOSE, ScriptIntrinsicBLAS.NO_TRANSPOSE,
alpha, A, B, beta, C);

glsl programming architecture which part is "really" parallel execution?

I am trying to implement image processing algorithm like gaussian filtering, bilateral filtering in GPU using glsl.
And I am getting confused with which part is "really" parallel execution. for example, I have a 1280*720 preview as texture. I am not quite sure which part is really running for 1280*720 times and which part is not.
what's the dispatching mechanism of glsl codes?
my gaussian filtering code is like:
#extension GL_OES_EGL_image_external : require
precision mediump float;
varying vec2 vTextureCoord;
uniform samplerExternalOES sTexture;
uniform sampler2D sTextureMask;
void main() {
float r=texture2D(sTexture, vTextureCoord).r;
float g=texture2D(sTexture, vTextureCoord).g;
float b=texture2D(sTexture, vTextureCoord).b;
// a test sample
float test=1.0*0.5;
float width=1280.0;
float height=720.0;
vec4 sum;
//offsets of a 3*3 kernel
vec2 offset0=vec2(-1.0,-1.0); vec2 offset1=vec2(0.0,-1.0); vec2 offset2=vec2(1.0,-1.0);
vec2 offset3=vec2(-1.0,0.0); vec2 offset4=vec2(0.0,0.0); vec2 offset5=vec2(1.0,0.0);
vec2 offset6=vec2(-1.0,1.0); vec2 offset7=vec2(0.0,1.0); vec2 offset8=vec2(1.0,1.0);
//gaussina kernel with sigma==100.0;
float kernelValue0 = 0.999900; float kernelValue1 = 0.999950; float kernelValue2 = 0.999900;
float kernelValue3 = 0.999950; float kernelValue4 =1.000000; float kernelValue5 = 0.999950;
float kernelValue6 = 0.999900; float kernelValue7 = 0.999950; float kernelValue8 = 0.999900;
vec4 cTemp0;vec4 cTemp1;vec4 cTemp2;vec4 cTemp3;vec4 cTemp4;vec4 cTemp5;vec4 cTemp6;vec4 cTemp7;vec4 cTemp8;
//getting 3*3 pixel values around current pixel
vec2 src_coor_2;
src_coor_2=vec2(vTextureCoord[0]+offset0.x/width,vTextureCoord[1]+offset0.y/height);
cTemp0=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset1.x/width,vTextureCoord[1]+offset1.y/height);
cTemp1=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset2.x/width,vTextureCoord[1]+offset2.y/height);
cTemp2=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset3.x/width,vTextureCoord[1]+offset3.y/height);
cTemp3=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset4.x/width,vTextureCoord[1]+offset4.y/height);
cTemp4=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset5.x/width,vTextureCoord[1]+offset5.y/height);
cTemp5=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset6.x/width,vTextureCoord[1]+offset6.y/height);
cTemp6=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset7.x/width,vTextureCoord[1]+offset7.y/height);
cTemp7=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset8.x/width,vTextureCoord[1]+offset8.y/height);
cTemp8=texture2D(sTexture, src_coor_2);
//convolution
sum =kernelValue0*cTemp0+kernelValue1*cTemp1+kernelValue2*cTemp2+
kernelValue3*cTemp3+kernelValue4*cTemp4+kernelValue5*cTemp5+
kernelValue6*cTemp6+kernelValue7*cTemp7+kernelValue8*cTemp8;
float factor=kernelValue0+kernelValue1+kernelValue2+kernelValue3+kernelValue4+kernelValue5+kernelValue6+kernelValue7+kernelValue8;
gl_FragColor = sum/factor;
//gl_FragColor=texture2D(sTexture, vTextureCoord);
}
this code is running with lower fps against pure preview on my phone(galaxy nexus).
but if I change the last part of my code to direct output with original pixel value, like
//gl_FragColor = sum/factor;
gl_FragColor=texture2D(sTexture, vTextureCoord);
it would run fast and same fps as pure preview.
the quesion is: things I write for test and useless in the beginning like:
float test=1.0*0.5;
how many time is it executed?
other parts like:
sum =kernelValue0*cTemp0+kernelValue1*cTemp1+kernelValue2*cTemp2+
kernelValue3*cTemp3+kernelValue4*cTemp4+kernelValue5*cTemp5+
kernelValue6*cTemp6+kernelValue7*cTemp7+kernelValue8*cTemp8;
would not run 1280*720 times just when I change
gl_FragColor = sum/factor;
to
gl_FragColor=texture2D(sTexture, vTextureCoord);?
how is the mechanism to decide which is to run 1280*720 times, which is just useless when parallel though out the pixels? is it done automatically?
what's the architecture, dispatching, how it organize the data to the GPU and other things for a glsl program?
I am wondering what should I do for more complicated operations like bilateral filtering and with kernel size like 9*9 and 9 times per pixel than this 3*3 gaussian kernel.

The entire fragment shader code is executed as a whole for each and every fragment. A fragment approximates either, if no antialiasing is done an output pixel, or with multisample antialiasing the samples of the framebuffer. What a fragment exactly is, is not specified in detail by the OpenGL spec, other than it's the output of the fragment stage which is then turned into values on the framebuffer bitplanes.
The rasterizer produces a series of framebuffer addresses and values
using a two-dimensional description of a point, line segment, or polygon. Each
fragment so produced is fed to the next stage that performs operations on
individual fragments before they ﬁnally alter the framebuffer. These operations include
[OpenGL-3.3 core spec, section 2.4]
would not run 1280*720 times just when I change
gl_FragColor = sum/factor;
to
gl_FragColor=texture2D(sTexture, vTextureCoord);?
Division is a costly and complex operation. Since the sum of the kernel is a constant, and doesn't change per fragment you shouldn't evaulate that in the shader. Evaluate it on the CPU and supply 1./factor as a uniform (which is a constant equal for all fragments) and multiply that with sum which is much faster than division.
Your gaussian kernel is actually a 3×3 matrix, for which there is a dedicated type in GLSL. The calculations you perform can be rewritten in terms of dot products (mathematically correct term would be scalar or inner product), for which GPUs have dedicated, accelerated instructions.
Also you shouldn't split up the components of a texture into individual floats.
All in all you built quite a number of speed bumps into your code.

On a modern (Shader Model 3.0+) GPU, fragment shaders are scheduled to operate on 2x2 blocks of pixels (pixel quads) at a time. Fun fact, this was required in order to implement the derivative instruction in Shader Model 3.0 and it has remained part of GPU architecture design ever since. Pixel quads are the lowest-level of granularity you can ever get in fragment shader scheduling. In fact, if you were to discard in a fragment shader, unless all of the fragments in the pixel quad also discard, then every instance of the fragment shader in the block continues running and the result is thrown out at the end for the individual fragments that requested discard.
In addition to this, most GPUs have multiple stream processing units and will schedule pixel quads into larger workgroups (NV calls them warps, AMD calls them wavefronts). In a nutshell, everything is happening in parallel, that is the entire premise of GPUs - they apply a single task across multiple threads that all operate on the same data in parallel; this is why they scale so well when cores are increased as opposed to CPUs.
Put simply, rather than dispatching individual instructions in your GLSL shader to run on separate functional units, what really happens is this. Your GLSL shader is run on multiple processing units simultaneously (conceptually, one thread per-fragment), and these threads all execute the same sequence of instructions in a paradigm known as SIMT (Single Instruction Multiple Thread).
Getting back to the basic scheduling unit (warp/wavefront), if one instance of your shader stalls fetching memory the rest of the instances in said scheduling unit also stall, because they all run the same instruction simultaneously. This is why dependent texture reads and large filter kernels are bad mojo; since the texture memory needed by a particular group of fragments may be indeterminate until run-time or spread too far, efficiently pre-fetching and caching texture data within a scheduling unit can become difficult if not impossible.
The biggest problem with accurately describing the level of parallelism is that the GPU architectures keep changing (most of the discussion above related to Shader Model 3.0+ GPUs). Not too long ago, GPUs had vectorized ISAs but now both AMD and NV have switched to superscalar because it actually improves instruction scheduling efficiency. Throw specialized embedded GPUs into the mix and you have a real nightmare on your hands, it is hard to really say what shader model they run (since derivative is optional in OpenGL ES 2.0).
See this other question on Stack Overflow for a more concise statement of what I just wrote.
For some pretty diagrams, here is a somewhat out of date, but still useful presentation from nVIDIA.

How to collide objects with high speed in Unity

I try to create game for Android and I have problem with high speed objects, they don't wanna to collide.
I have Sphere with Sphere Collider and Bouncy material, and RigidBody with this param (Gravity=false, Interpolate=Interpolate, Collision Detection = Continuous Dynamic)
Also I have 3 walls with Box Collider and Bouncy material.
This is my code for Sphere
function IncreaseBallVelocity() {
rigidbody.velocity *= 1.05;
}
function Awake () {
rigidbody.AddForce(4, 4, 0, ForceMode.Impulse);
InvokeRepeating("IncreaseBallVelocity", 2, 2);
}
In project Settings I set: "Min Penetration For Penalty Force"=0.001, "Solver Interation Count"=50
When I play on the start it work fine (it bounces) but when speed go to high, Sphere just passes the wall.
Can anyone help me?
Thanks.
Edited
var hit : RaycastHit;
var mainGameScript : MainGame;
var particles_splash : GameObject;
function Awake () {
rigidbody.AddForce(4, 4, 0, ForceMode.Impulse);
InvokeRepeating("IncreaseBallVelocity", 2, 2);
}
function Update() {
if (rigidbody.SweepTest(transform.forward, hit, 0.5))
Debug.Log(hit.distance + "mts distance to obstacle");
if(transform.position.y < -3) {
mainGameScript.GameOver();
//Application.LoadLevel("Menu");
}
}
function IncreaseBallVelocity() {
rigidbody.velocity *= 1.05;
}
function OnCollisionEnter(collision : Collision) {
Instantiate(particles_splash, transform.position, transform.rotation);
}
EDITED added more info
Fixed Timestep = 0.02 Maximum Allowed Tir = 0.333
There is no difference between running the game in editor player and on Android
No. It looks OK when I set 0.01
My Paddle is Box Collider without Rigidbody, walls are the same
There are all in same layer (when speed is normal it all works) value in PhysicsManager are the default (same like in image) exept "Solver Interation Co..." = 50
No. When I change speed it pass other wall
I am using standard cube but I expand/shrink it to fit my screen and other objects, when I expand wall more then it's OK it bouncing
No. It's simple project simple example from Video http://www.youtube.com/watch?v=edfd1HJmKPY
I don't use gravity

See:
Similar SO Question
A community script that uses ray tracing to help manage fast objects
UnityAnswers post leading to the script in (2)
You could also try changing the fixed time step for physics. The smaller this value, the more times Unity calculates the physics of a scene. But be warned, making this value too small, say <= 0.005, will likely result in an unstable game, especially on a portable device.
The script above is best for bullets or small objects. You can manually force rigid body collisions tests:
public class example : MonoBehaviour {
public RaycastHit hit;
void Update() {
if (rigidbody.SweepTest(transform.forward, out hit, 10))
Debug.Log(hit.distance + "mts distance to obstacle");
}
}

I think the main problem is the manipulation of Rigidbody's velocity. I would try the following to solve the problem.
Redesign your code to ensure that IncreaseBallVelocity and every other manipulation of Rigidbody is called within FixedUpdate. Check that there are no other manipulations to Transform.position.
Try to replace setting velocity directly by using AddForce or similar methods so the physics engine has a higher chance to calculate all dependencies.
If there are more items (main player character, ...) involved related to the physics calculation, ensure that their code runs in FixedUpdate too.
Another point I stumbled upon were meshes that are scaled very much. Having a GameObject with scale <= 0.01 or >= 100 has definitely a negative impact on physics calculation. According to the docs and this Unity forum entry from one of the gurus you should avoid Transform.scale values != 1
Still not happy? OK then the next test is starting with high velocities but no acceleration. At this phase we want to know, if the high velocity itself or the acceleration is to blame for the problem. It would be interesting to know the velocities' values at which the physics engine starts to fail - please post them so that we can compare them.
EDIT: Some more things to investigate
6.7 m/sec does not sound that much so that I guess there is a special reason or a combination of reasons why things go wrong.
Is your Maximum Allowed Timestep high enough? For testing I suggest 5 to 10x Fixed Timestep. Note that this might kill the frame rate but that can be dfixed later.
Is there any difference between running the game in editor player and on Android?
Did you notice any drops in frame rate because of the 0.01 FixedTimestep? This would indicate that the physics engine might be in trouble.
Could it be that there are static colliders (objects having a collider but no Rigidbody) that are moved around or manipulated otherwise? This would cause heavy recalculations within PhysX.
What about the layers: Are all walls on the same layer resp. are the involved layers are configured appropriately in collision detection matrix?
Does the no-bounce effect always happen at the same wall? If so, can you just copy the 1st wall and put it in place of the second one to see if there is something wrong with this specific wall.
If not to much effort, I would try to set up some standard cubes as walls just to be sure that transform.scale is not to blame for it (I made really bad experience with this).
Do you manipulate gravity or TimeManager.timeScale from within a script?
BTW: are you using gravity? (Should be no problem just

Android Game Development: Collision Detection Failing

I am currently developing a game for Android, and I would like your expertise on an issue that I have been having.
Background:
My game incorporates frame rate independent motion, which takes into
account the delta time value before performing necessary Velocity
calculations.
The game is a traditional 2D platformer.
The Issue:
Here's my issue (simplified). Let's pretend that my character is a square standing on top of a platform (with "gravity" being a constant downward velocity of characterVelocityDown).
I have defined the collision detection as follows (assuming Y axis points downwards):
Given characterFootY is the y-coordinate of the base of my square character, platformSurfaceY is the upper y-coordinate of my platform, and platformBaseY is the lower y-coordinate of my platform:
if (characterFootY + characterVelocityDown > platformSurfaceY && characterFootY + characterDy < platformBaseY) {
//Collision Is True
characterFootY = platformSurfaceY;
characterVelocityDown = 0;
} else{
characterVelocityDown = deltaTime * 6;
This approach works perfectly fine when the game is running at regular speed; however, if the game slows down, the deltaTime (which is the elapsed time between the previous frame and the current frame) becomes large, and characterFootY + characterVelocityDown exceed the boundaries that define collision detection and the character just falls straight through (as if teleporting).
How should I approach this issue to prevent this?
Thanks in advance for your help and I am looking forward to learning from you!

What you need to do is to run your physics loop with constant delta time and iterate it as many time as it need with current tick.
const float PHYSICS_TICK = 1/60.f; // 60 FPS
void Update( float dt )
{
m_dt += dt;
while( m_dt > PHYSICS_TICK )
{
UpdatePhysics( PHYSICS_TICK );
m_dt -= PHYSICS_TICK;
}
}
There are various technics used to handle the tick left ( m_dt )
Caps for miniumum tick and maximum tick are also a must.

I guess the issue here is that slowdowns are inevitable. You can try and optimize the code but you'll always have users with slow devices or busy sections of your game where it takes a little longer than usual to process it all. Instead of assuming a consistent delta, assume the opposite. Code under the realization that someone could try installing it on an abacus.
So basically, as SeveN says, make your game loop handle slowdowns. The only real way to do this (in my admittedly limited experience) would be to place a cap on how large delta can be. This will result in your clock not running on time exactly, but when you think about it, that's how most games handle slowdown. You don't fire up StarCraft on your pentium 66 and have it run at 5 FPS but full speed, it slow down and processes it as normal, albeit at a slideshow.
If you did such a thing, during periods of slowdown in your game, it'd visibly slow down... but the calculations should still all be spot on.
edit: just realised you're SeveN. Well done.

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.