I want to run a Neural Network on mobile. Currently, I am exploring Mxnet (http://mxnet.io) framework for deploying it (only for Inference). As I am concerned about the execution time performance on mobile, I want to know if it runs on the GPU of mobile phones (Android/iOS). The documentation mentions that it can use multiple CPUs as well as GPUs for training, but it is still not clear if it can GPU of mobile phone for inference on mobile. It mentions about dependency on BLAS, because of which it seems it uses CPU on mobile. Could anyone please tell me if I can use mobile GPU with mxnet for inference? If not, what are my other options?
UPDATE: The Neural Networks API is now available on Android devices starting from API 27 (Oreo 8.1). The API provides a lower-level facility that a higher-level machine learning framework (e.g. Tensorflow, Caffe) can use to build models. It is a C-language API that can be accessed through the Android Native Development Kit (NDK).
NNAPI gives hardware vendors a Service Provider Interface (SPI) to provide drivers for computational hardware such as Graphics Processing Units (GPUs) and Digital Signal Processors (DSPs). As a result, the NNAPI provides an abstraction for high performance computation. There is a CPU fallback in case no hardware acceleration drivers are present.
For those wanting to implement a machine learning model on Android, the framework of choice is now Tensorflow Lite. Tensorflow Lite for Android is implemented on top of the NNAPI, so Tensorflow models will get hardware acceleration when available. Tensorflow Lite has other optimizations to squeeze more performance out of the mobile platform.
The process goes as follows:
Develop and train your model on Keras (using Tensorflow backend)
Or use a pretrained model
Save a "frozen" model in Tensorflow protobuf format
Use the Tensorflow Optimizing Converter to convert the protobuf into a "pre-parsed" tflite model format
See the Tensorflow Lite Developer Guide
I went through an exercise of creating a neural net application for Android using Deeplearning4j. Because Deeplearning4j is based on Java, I thought it would be a good match for Android. Based on my experience with this, I can answer some of your questions.
To answer your most basic question:
Could anyone please tell me if I can use mobile GPU with mxnet for inference?
The answer is: No. The explanation for this follows.
It mentions about dependency on BLAS, because of which it seems it uses CPU on mobile.
BLAS (Basic Linear Algebraic Subprograms) is at the heart of AI computation. Because of the sheer amount of number-crunching involved in these complex models the math routines must be optimized as much as possible. The computational firepower of GPUs make them ideal processors for AI models.
It appears that MXNet can use Atlas (libblas), OpenBLAS, and MKL. These are CPU-based libraries.
Currently the main (and — as far as I know — only) option for running BLAS on a GPU is CuBLAS, developed specifically for NVIDIA (CUDA) GPUs. Apparently MXNet can use CuBLAS in addition to the CPU libraries.
The GPU in many mobile devices is a lower-power chip that works with ARM architectures which doesn't have a dedicated BLAS library yet.
what are my other options?
Just go with the CPU. Since it's the training that's extremely compute-intensive, using the CPU for inference isn't the show-stopper you think it is. In OpenBLAS, the routines are written in assembly and hand-optimized for each CPU it can run on. This includes ARM.
Do the recognition on the server. After working on another demo application which sent an image to the server that performed the recognition and returned results to the device, I think this approach has some benefits for the user such as better overall response time and not having to find space to install a 100MB(!) application.
Since you also tagged iOS, using a C++-based framework like MXNet is probably the best choice if you are trying to go cross-platform. But their method of creating one giant .cpp file and providing a simple native interface might not give you enough flexibility for Android. Deeplearning4j has solved that problem pretty elegantly by using JavaCPP which takes the complexity out of JNI.
Related
I am new to NNAPI and Snapdragon Neural Processing Engine (SNPE) SDK. I know that with Pytorch mobile you can train and quantize your NN model using NNAPI, save it as .pt file, and load it in your app (here). At the same time, you can convert the .pt file to dlc using snpe-pytorch-to-dlc in SNPE SDK, and then run it using snpe-net-run. As far as I know we cannot specify how we can use the low-level processors such GPU or AIP in Qualcomm chips using NNAPI (here), while with SNPE we can specify the processors. I am wondering if my understanding is correct? Can I specify AIP, GPU or CPU in Qualcomm chip using NNAPI? Maybe I have some fundamental misunderstandings on NNAPI and SNPE.
Fail to load NNAPI pytorch model in app
In addition, when I tried to load the NNAPI version .pt file, the app failed to load on both emulator and device while the original one succeed. I am not sure if there is anything different that I need to pay attention when loading the model.pt using NNAPI.
Any help is very appreciated!
I've trained a TensorFlow model which takes my RTX2080 several seconds per action (in addition to 20-30 seconds to initialise the model).
I've been looking into turning this into an iOS/Android app running on tensorflow lite, but apart from the technical challenge of converting the model into a tensorflow lite model and everything else,
I am wondering about the feasibility of this running on phone hardware even on a reasonably modern phone with inbuilt GPU would this still likely be too slow for practical purposes?
Can anyone who has built an iOS/Android app with tensorflow lite where the phone is responsible for computation comment on performance and other practical considerations?
The only other option of having requests served by my own server(s) on AWS, for example, would turn into a major expense if the app had significant use.
I have created an image recognition app in swift using TensorFlow lite and did not find any performance issues with it. The prediction took anywhere from 3 - 5 seconds consistently , which I personally think is not that bad.So I would suggest to go ahead with your app using TF Model.Thanks!
I'm about to start working on an augmented reality project that will involve using the EPSON Moverio AR glasses or similar. Assuming we go for those glasses, they run on Android 4.0.4 - API 15. The app will involve near realtime analysis of video (frames from the glasses camera) for feature detection / tracking with markers and overlaying 3d objects in the 'real world' based on the markers.
So far then, on the technical side, it looks like we'll be dealing with:
API 15
OpenCV
OpenGLES2
Considering the above, I'm wondering if it's worth doing it all thru the NDK, using a single NativeActivity with the android_native_app_glue code. When I say worth it I mean performance wise.
Sure, doing it all on the C/C++ side has for instance the advantage that the code could then potentially be ported with minimal modification to run on other environments. But OpenCV does have Java bindings for Android and GL can also be used to a certain extent from Java. So I'm just wondering if performance-wise it's worth it or it would be about the same as, say, using a GLSurfaceView.
I work in augmented reality. The vast majority of applications I've seen have been native. Google recommends avoiding native application unless the gains are absolutely necessary. I think AR is one of the relatively few cases where it is necessary. The benefits I'm aware of are:
Native camera access will allow you to get a higher capture framerate. Passing the data to the Java layer considerably slows this down. Even OpenCV's non-native capture can be slower in Java because OpenCV primary maintains the data in native objects. Your capture framerate is a limiting factor on how fast you can update the pose information for your objects. Beware though, OpenCV's native camera will not work on devices running Qualcomm's MSM optimized fork of android - this includes many snapdragon devices.
Every call to an OpenGL method in Java not only has a cost related to dropping into native, and they also perform quite a few additional checks. Look through GLES20.cpp which contains the native implementation of the GLES20 class's methods. You'll see that you could bypass quite a lot of logic by using native calls. This is fine in most mobile application, but 3D rendering often gets a significant benefit from bypassing those checks and the JNI overhead. This is even more important in AR because your will already be swamping the system with CV processing.
You will very likely want your detection related code in native. OpenCV have samples if you want to see the difference between native and Java detection performance. The former will use fewer resources and be more consistent. Using a native application means that you can call your native functions without paying the cost of passing large amount of data from Java to native.
Native sensor access is more efficient and the rate says far more consistent in native thanks to the lack of garbage collection and JNI. This is relevant if you will be using IMU data interesting ways.
You may be able to build an non-native application that has most of it's code in native and runs well despite being Java based, but it is considerably more difficult.
Renderscript is an Android computation engine that lets you use CPU/GPU native hardware acceleration in order to boost applications, for example in image processing and computer vision algorithms.
Is there a similar thing in iOS and Windows Phone 7/8?
The RenderScript compatibility library is designed to compile for most any posix system. It would be very easy to get it running on other platforms.
I can't speak for Windows Phone but on iOS it would be vImage running on the Accelerate Framework. Just like Renderscript, it is dynamically optimized for the CPU on the target platform.
vImage optimizes image processing by using the CPU’s vector processor.
If a vector processor is not available, vImage uses the next best
available option. This framework allows you to reap the benefits of
vector processors without the need to write vectorized code.
https://developer.apple.com/library/mac/documentation/performance/Conceptual/vImage/Introduction/Introduction.html
I can't speak for Windows Phone but on iOS it would be Apple Metal, its language specification being almost same as renderscript c99.
For iOS it is the newly introduced swift I guess.
Maybe it is worth to try it out, but I'm not an iOS developer so I can't say anything about its performance, but the demos on the WWDC looked promising. Also instead of Renderscript Swift seemes to be designed for graphics, the Renderscript soppurt for graphics has been deprecated and Renderscript turned more into a general computation framework (which of course can be used as a backend for graphic calculations).
https://developer.apple.com/library/prerelease/ios/documentation/swift/conceptual/swift_programming_language/TheBasics.html
I am now programming on Android and I wonder whether we can use GPGPU for Android now? I once heard that Renderscript can potentially execute on GPGPU in the future. But I wonder whether it is possible for us to programming on GPGPU now? And if it is possible for me to program on the Android GPGPU, where can I find some tutorials or sample programs? Thank you for your help and suggestions.
Up till now I know that the OpenGL ES library was now accelerated use GPU, but I want to use the GPU for computing. What I want to do is to accelerate computing so that I hope to use some libraries of APIs such as OpenCL.
2021-April Update
Google has announced deprecation of the RenderScript API in favor of Vulkan with Android 12.
The option for manufacturers to include the Vulkan API was made available in Android 7.0 Compatibility Definition Document - 3.3.1.1. Graphic Libraries.
Original Answer
Actually Renderscript Compute doesn't use the GPU at this time, but is designed for it
From Romain Guy who works on the Android platform:
Renderscript Compute is currently CPU bound but with the for_each construct it will take advantage of multiple cores immediately
Renderscript Compute was designed to run on the GPU and/or the CPU
Renderscript Compute avoids having to write JNI code and gives you architecture independent, high performance results
Renderscript Compute can, as of Android 4.1, benefit from SIMD optimizations (NEON on ARM)
https://groups.google.com/d/msg/android-developers/m194NFf_ZqA/Whq4qWisv5MJ
yes , it is possible .
you can use either renderscript or opengGL ES 2.0 .
renderscript is available on android 3.0 and above , and openGL ES 2.0 is available on about 95% of the devices.
As of Android 4.2, Renderscript can involve GPU in computations (in certain cases).
More information here: http://android-developers.blogspot.com/2013/01/evolution-of-renderscript-performance.html
As I understand, ScriptIntrinsic subclasses are well-optimized to run on GPU on compatible hardware (for example, Nexus10 with Mali T604). Documentation:
http://developer.android.com/reference/android/renderscript/ScriptIntrinsic.html
Of course you can decide to use OpenCL, but Renderscript is guaranteed (by Google, being a part of Android itself) to be running even on hardware which doesn't support GPGPU computation and will use any other available acceleration means supported by hardware it is running on.
There are several options: You can use OpenGL ES 2.0, which is supported by almost all devices but has limited functionality for GPGPU. You can use OpenGL ES 3.0, with which you can do much more in terms of GPU processing. Or you can use RenderScript, but this is platform-specific and furthermore does not give you any influence on whether your algorithms run on the GPU or the CPU. A summary about this topic can be found in this master's thesis: Parallel Computing for Digital Signal Processing on Mobile Device GPUs.
You should also check out ogles_gpgpu, which allows GPGPU via OpenGL ES 2.0 on Android and iOS.