I've been learning up a little on the cpu features and stumbled upon NEON.
From what I've read, it looks like NEON requires specific programming to use this, but is this completely true, or do the cpus that have this feature still find ways to untilize it and speed media processes for some applications even though there is not specific code for it?
There are a number of ways to make use of the NEON instructions. Some of them are:
Libraries. It is a good chance that your memcpy is handcrafted using NEON. Music/video playback libs in the API are using NEON and/or GPU for acceleration. Aso, there are third-pary libs that use it. FastCV from Qualcomm is a good example
Compiler-issued instructions. Some compilers, when provided with the correct options will issue NEON instructions. Most compilers will use neon for float operations, but not vectorize them. They will use the unit as a single-data unit, just because it is fast and convenient. There are some vectorization capabilities in GCC and ARM compiler, but they are really limited in scope and results.
Hand-coded C with intrinsics http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html It is probably the best way to get started in the NEON world.
Hand-coded assembler. This seems to be the best, if you want to achieve max performance. It also requires a good deal of effort and CS knowledge.
Last but not least, you can use NEON by just downloading apps that use it. Your favourite music player and your camera app put the NEON unit in your smartphone to good use.
Conclusion: NEON has a lot of usages, but it is only used if the code specifically contains NEON instructions. More technically, as #pst said, it must be targeted by a piece of code.
Related
I am developing some numerical software, whose performance, depends a lot on the numerical accuracy (i.e., floats, double etc.).
I have noticed that the ARM NEON does not fully comply with the IEEE754 floating point standard. Is there a way to emulate NEON's floating point precision, on an x86 CPU ? For example a library that emulates the NEON SIMD floating point operations.
Probably.
I'm less familiar with SSE, but you can force many of the SSE modes to behave like NEON. This will depend on your compiler and available libraries, but see some Visual Studio FP unit control functions. This might be good enough for your requirements.
Furthermore, you can use the arm_neon.h header to ensure that you are using similar intrinsics to accomplish similar things.
Finally, if you really require achieving this precision at these boundary conditions, you are going to want a good test suite to verify that you are achieving your results as intended.
Finally finally, even with pure "C" code, which typically complies with IEEE-754, and uses the VFP on ARM as other commenters have mentioned, you will get different results because floating point is a highly... irregular process, subject to the whim of optimization and order of operations. It is challenging to get results to match across different compilers, let alone hardware architectures. For example, to get highly agreeable results on Intel with gcc it's often required to use the -ffloat-store flag, if you want to compare with /fp:precise on CL/MSVS.
In the end, you may need to accept some kind of non-zero error tolerance. Trying to get to zero may be difficult, but it would be awesome to hear your results if you get there. It seems possible... but difficult.
Thanks for your answers.
At last, I used an android phone connected to a desktop, and certain functions were running on the phone.
Renderscript is an Android computation engine that lets you use CPU/GPU native hardware acceleration in order to boost applications, for example in image processing and computer vision algorithms.
Is there a similar thing in iOS and Windows Phone 7/8?
The RenderScript compatibility library is designed to compile for most any posix system. It would be very easy to get it running on other platforms.
I can't speak for Windows Phone but on iOS it would be vImage running on the Accelerate Framework. Just like Renderscript, it is dynamically optimized for the CPU on the target platform.
vImage optimizes image processing by using the CPU’s vector processor.
If a vector processor is not available, vImage uses the next best
available option. This framework allows you to reap the benefits of
vector processors without the need to write vectorized code.
https://developer.apple.com/library/mac/documentation/performance/Conceptual/vImage/Introduction/Introduction.html
I can't speak for Windows Phone but on iOS it would be Apple Metal, its language specification being almost same as renderscript c99.
For iOS it is the newly introduced swift I guess.
Maybe it is worth to try it out, but I'm not an iOS developer so I can't say anything about its performance, but the demos on the WWDC looked promising. Also instead of Renderscript Swift seemes to be designed for graphics, the Renderscript soppurt for graphics has been deprecated and Renderscript turned more into a general computation framework (which of course can be used as a backend for graphic calculations).
https://developer.apple.com/library/prerelease/ios/documentation/swift/conceptual/swift_programming_language/TheBasics.html
I am trying to play a synthesized sound (basically 2 sine waves and some noise) using the AudioTrack class. It doesn't seem to be any different than the SourceDataLine in javax.sound.sampled, BUT the synthesis is REALLY SLOW. Even for ARM standards, it's unrealistic to think that 32768 samples (16 bit, stereo, for a total of 65536) take over 1 second to render on a Nexus 4 (measured with System.nanotime(), write to AudioTrack excluded).
The synthesis part is almost identical to this http://audioprograming.wordpress.com/2012/10/18/a-simple-synth-in-android-step-by-step-guide-using-the-java-sdk/, the only difference is that I play stereo sound (I can't reduce it to mono because it's a binaural tone).
Any ideas? what can I do?
Thanks in advance
Marko's answer seems very good. But if you're still in the experimental/investigational phase of your project, you might want to consider using Pure Data, which already is implemented as a combination Android library/NDK library and which would allow you to synthesize many sounds and interact with them in a relatively simple manner.
The libpd distribution is the Android implementation of Pure Data. Some good starting references can be found at the SoundOnSound site and also at this site.
Addendum: I found a basic but functional implementation of an Android Midi Driver through this discussion link. The relevant code can be found here (github, project by billthefarmer, named mididriver).
You can view how I use it in my Android app (imSynt link leads you to Google Play), or on YouTube.
The performance of audio synthesis on ARM is actually very respectable with native code that makes good use of the NEON unit. The Dalvik's JIT compiler is never going to get close to this level of performance for floating-point intensive code.
A look at the enormous number of soft-synth apps for iOS provides ample evidence of what should be possible on ARM devices with similar levels of performance.
However, the performance you are reporting is several orders of magnitude short of what I would expect. You might consider the following:
Double precision float-point arithmetic is particularly expensive on ARM Cortex A-x NEON units, where as single precision is very fast and highly parallelizable. Math.sin() returns a double, so is unnecessarily precise, and liable to be slow. The 24-mantissa provided by single precision floating point value is substantially larger than the 16-bit int used by the audio subsystem.
You could precompute sin(x) and then perform a table-lookup in your render loop.
There is a previous post on SO concerning Math.sin(x) on android suggesting degrading performance as x becomes large, as it's likely to in this case over time.
For a more advanced table-based synthesiser, you might consider using a DDS Oscillator.
Ultimately, you might consider using native code for synthesis, with the NDK.
You should be able render multiple oscillators with filters and envelopes and still have CPU time left over. Check your inner loops to make sure that there are no system calls.
Are you on a very old phone? You did not mention the hardware or OS version.
You might want to try using JSyn. It is a free modular Java synthesizer that runs on any Java platform including desktops, Raspberry Pi and Android.
https://github.com/philburk/jsyn
Have you tried profiling your code? It sounds like something else is possibly causing your slow down, profiling would help to highlight the cause.
Mike
Is there any other free vector library optimized for neon that math-neon?
I would like to get advantage of neon in my code, i have lot of objects and i am doing lot of simple vector physics-math, like adding vectors, multiplying, dotting them, those are 3d vectors but if i could make it a lot faster 2d should be ok too, the question is, is it worth using neon? for example lets take 100000 points, i need to calculate their movement, collisions etc. I am currently using my own math, and its based on inline functions, lets say that i would like to use my hypothetical neon library with matrices too, currently i am using glm for that, and its doing fine, but could it be faster? Speed advantage between arm-abi and arm7-abi in ndk is about 30 percent in my case, can neon be faster or maybe my code is translated to neon in compile time?
You can check eigen. It has special code that it is activated when neon instruction support is activated.
Like someone else mentioned, you should look into Eigen, it is probably good enough for you. But if you want full performance (much better than 30% gain, more like 300% gain), you should use NEON code yourself and make sure your entire inner loop is written completely with NEON (not any CPU or VFP code).
If you just NEON optimize part of your loop instead of the entire loop, you get major penalties and so the NEON code is perhaps just 30% faster or perhaps even slower than regular C code. But a full NEON loop can often give you 300% - 2000% speedup!
If you are developing for an ARM Cortex-A9 then NEON C Intrinsics should be good enough, but for ARM Cortex-A8 devices you usually need NEON Assembly code to get full performance. I give some more info on how to NEON optimize your whole loop at "http://www.shervinemami.info/armAssembly.html"
Code is compiled for NEON if the target architecture supports it, namely, if it is compiled for armeabi-v7a. To do this, simply add armeabi-v7a to the list of targets in your app's Application.mk file.
I'm trying to decide on whether to primarily use floats or ints for all 3D-related elements in my app (which is C++ for the most part). I understand that most ARM-based devices have no hardware floating point support, so I figure that any heavy lifting with floats would be noticeably slower.
However, I'm planning to prep all data for the most part (i.e. have vertex buffers where applicable and transform using matrices that don't change a lot), so I'm just stuffing data down OpenGL's throat. Can I assume that this goes more or less straight to the GPU and will as such be reasonably fast? (Btw, the minimum requirement is OpenGL ES 2.0, so that presumably excludes older 1.x-based phones.)
Also - how is the penalty when I mix and match ints and floats? Assuming that all my geometry is just pre-built float buffers, but I use ints for matrices since those do require expensive operations like matrix multiplications, how much wrath will I incur here?
By the way, I know that I should keep my expectations low (sounds like even asking for floats on the CPU is asking for too much), but is there anything remotely like 128-bit VMX registers?
(And I'm secretly hoping that fadden is reading this question and has an awesome answer.)
Older Android devices like the G1 and MyTouch have ARMv6 CPUs without floating point support. Most newer devices, like the Droid, Nexus One, and Incredible, use ARMv7-A CPUs that do have FP hardware. If your game is really 3D-intensive, it might demand more from the 3D implementation than the older devices can provide anyway, so you need to decide what level of hardware you want to support.
If you code exclusively in Java, your app will take advantage of the FP hardware when available. If you write native code with the NDK, and select the armv5te architecture, you won't get hardware FP at all. If you select the armv7-a architecture, you will, but your app won't be available on pre-ARMv7-A devices.
OpenGL from Java should be sitting on top of "direct" byte buffers now, which are currently slow to access from Java but very fast from the native side. (I don't know much about the GL implementation though, so I can't offer much more than that.)
Some devices additionally support the NEON "Advanced SIMD" extension, which provides some fancy features beyond what the basic VFP support has. However, you must test for this at runtime if you want to use it (looks like there's sample code for this now -- see the NDK page for NDK r4b).
An earlier answer has some info about the gcc flags used by the NDK for "hard" fp.
Ultimately, the answer to "fixed or float" comes down to what class of devices you want your app to run on. It's certainly easier to code for armv7-a, but you cut yourself off from a piece of the market.
In my opinion you should stick with fixed-point as much as possible.
Not only old phones miss floating point support, but also new ones such as the HTC Wildfire.
Also, if you choose to require ARMv7, please note that for example the Motorola Milestone (Droid for Europe) does feature an ARMv7 CPU, but because of the way Android 2.1 has been built for this device, the device will not use your armeabi-v7a libs (and might hide your app from the Market).
I personally worked around this by detecting ARMv7 support using the new cpufeatures library provided with NDK r4b, to load some armeabi-v7a lib on demand with dlopen().