ARM NEON Assembler - usage & understanding

ARM NEON Assembler - usage & understanding - android

I am new to assembler and NEON programming.
My task is to convert part of an algorithm from C to ARM Assembler using NEON instructions.
The algorithm takes an int32 array, loads different values from this array, does some bitshifting and Xor and writes the result in another array.
Later I will use an array with 64bit values, but for now i just try to rewrite the code.
C Pseudo code:
out_array[index] = shiftSome( in_array[index] ) ^ shiftSome( in_array[index] );
So here are my questions regarding NEON Instructions:
1.) If i load a register like this:
vld1.32 d0, [r1]
will it load only 32Bit from the memory or 2x32Bit to fill the 64Bit Neon D-Register?
2.) How can I access the 2/4/8 (i32, i16, i8) parts of the D-Register?
3.) I am trying to load different values from the array with an offset, but it doesn't
seem to work...what am I doing wrong... here is my code:
(it is an integer array so I´m trying to load for example the 3-element, which should have an offset of 64Bit = 8 Byte)
asm volatile(
"vld1.32 d0, [%0], #8 \n"
"vst1.32 d0, [%1]" : : "r" (a), "r" (out): "d0", "r5");
where "a" is the array and "out" is an pointer to an integer (for debugging).
4.) After I load a value from the array I need to shift it to the right but it doesn't seem to work:
vshr.u32 d0, d0, #24 // C code: x >> 24;
5.) Is it possible to only load 1 Byte in a Neon register so that I don't have to shift/mask something to get only the one Byte i need?
6.) I need to use Inline assembler, but I am not sure what the last line is for:
input list : output list : what is this for?
7.) Do you know any good NEON References with code examples?
The Program should run on an Samsung Galaxy S2, cortex-A9 Processor if that makes any difference. Thanks for the help.
----------------edit-------------------
That is what i found out:
It will always load the full Register (64Bit)
You can use the "vmov" instruction to transfer part of a neon register to an arm register.
The offset should be in an arm register and will be added to the
base address after the memory access.
It is the "clobbered reg list". Every Register that is used and
neither in the input or output list, should be written here.

I can answer most of your questions: (update: clarified "lane" issue)
1) NEON instructions can only load and store entire registers (64-bit, 128-bit) at a time to and from memory. There is a MOV instruction variant that allows single "lanes" to be moved to or from ARM registers.
2) You can use the NEON MOV instruction to affect single lanes. Performance will suffer when doing too many single element operations. NEON instructions benefit application performance by doing parallel operations on vectors (groups of floats/ints).
3) The immediate value offsets in ARM assembly language are bytes, not elements/registers. NEON instructions allow post increment with a register, not immediate value. For normal ARM instructions, your post-increment of 8 will add 8 (bytes) to the source pointer.
4) Shifts in NEON affect all elements of a vector. A shift right of 24 bits using vshr.u32 will shift both 32-bit unsigned longs by 24 bits and throw away the bits that get shifted out.
5) NEON instructions allow moving single elements in and out of normal ARM registers, but don't allow loads or stores from memory directly into "lanes".
6) ?
7) Start here: http://blogs.arm.com/software-enablement/161-coding-for-neon-part-1-load-and-stores/ The ARM site has a good tutorial on NEON.

6) Clobbered registers.
asm(code : output operand list : input operand list : clobber list);
If you are using registers, which had not been passed as operands, you need to inform
the compiler about this. The following code will adjust a value to a multiple of four. It
uses r3 as a scratch register and lets the compiler know about this by specifying r3 in the
clobber list. Furthermore the CPU status flags are modified by the ands instruction.
Adding the pseudo register cc to the clobber list will keep the compiler informed about
this modification as well.
asm (
"ands R3, %1, #3"
"eor %0, %0, r3"
: "=r"(len)
: "0"(len)
: "cc", "r3"
);

Related

Understand BINDER_VM_SIZE in Android source code

In file framework/native/libs/binder/ProcessState.cpp
Why is BINDER_VM_SIZE set to 1M-8k?
#define BINDER_VM_SIZE ((1*1024*1024) - (4096 *2))

It's not this value initially, you can find that(from git commit log) it's first value is
#define BINDER_VM_SIZE (1*1024*1024)
Then someone change this value to
#define BINDER_VM_SIZE ((1*1024*1024) - (4096 *2))
with the following commit message:
Modify the binder to request 1M - 2 pages instead of 1M. The backing
store in the kernel requires a guard page, so 1M allocations fragment
memory very badly. Subtracting a couple of pages so that they fit in
a power of two allows the kernel to make more efficient use of its
virtual address space.
I myself don't fully understand this message，so I just paste it here hope it may help your understanding!

What's the difference between the AUDIO_FORMAT_PCM_32_BIT and AUDIO_FORMAT_PCM_8_24_BIT in Android Lollipop?

The AUDIO_FORMAT_PCM_32_BIT and AUDIO_FORMAT_PCM_8_24_BIT are two high definition audio formats in Android Lollipop.
Seems they are all in 32 bit depth.
Who know the exactly difference between them?

You can find that information in audio.h:
/* Audio format consists of a main format field (upper 8 bits) and a sub
format field (lower 24 bits).
AUDIO_FORMAT_PCM_32_BIT and AUDIO_FORMAT_PCM_8_24_BIT are defined as:
AUDIO_FORMAT_PCM_32_BIT = (AUDIO_FORMAT_PCM |
AUDIO_FORMAT_PCM_SUB_32_BIT),
AUDIO_FORMAT_PCM_8_24_BIT = (AUDIO_FORMAT_PCM |
AUDIO_FORMAT_PCM_SUB_8_24_BIT),
And if we look at the definitions of AUDIO_FORMAT_PCM_SUB_32_BIT and AUDIO_FORMAT_PCM_8_24_BIT we find some helpful comments:
AUDIO_FORMAT_PCM_SUB_32_BIT = 0x3, /* PCM signed .31 fixed point */
AUDIO_FORMAT_PCM_SUB_8_24_BIT = 0x4, /* PCM signed 7.24 fixed point */

In response to Michael's comment:
signed .31 means 1 bit for sign, 0 bits for the whole part, and 31 bits for the fractional part. signed 7.24 means 1 bit for sign, 7 bits for the whole part, and 24 bits for the fractional part. Read up on fixed-point arithmetic if you want to know more about how it's used.
AUDIO_FORMAT_PCM_8_24_BIT most likely refers to a padded 8 bits of zeros as the 7.24 fixed point doesn't make sense for PCM data. This is because PCM data ranges from [1.0 .. -1.0]. (it technically should be 8.23, otherwise 7.24 == 25-bits!). So the use of a "whole" [number] part does not make sense.
A single sample of AUDIO_FORMAT_PCM_8_24_BIT will contain 4 bytes, where only 3 bytes will hold any meaningful data and the remaining single byte will be all zeros.
The alternative is AUDIO_FORMAT_PCM_24_BIT_PACKED that only contains 3 bytes per sample and no padding. 24-bit audio has a strange format, and it doesn't fit well in the powers of 2 of digital audio. It is typically easier to handle a 24-bit sample as if it was 32-bit.

Is it possible to get list of OpenGL ES 2.0 methods which are vendor-specific?

Little backstory - I'm working on android application with OpenGL ES2.0 and some time ago I faced a problem with lines width, finally it turned out that glLineWidth() implementation is vendor specific, and the range of possible values is not guaranteed. For example for Adreno200 it is 1-18 and emulator I got 1-100.
I'm wondering if it is possible to get the list of such methods.

The list of limits with vendor specific values is in the spec document. To find that:
Go to https://www.khronos.org/ (Khronos is the consortium responsible for the OpenGL ES standard).
Click on "OpenGL ES" in the tabs above the top pane on the page.
Click on "Specs & Headers" at the bottom of the pane. This will bring you to https://www.khronos.org/registry/gles/.
Find the section "OpenGL ES 2.0 Specifications and Documentation", and click on "Full Specification". Or better yet, download the PDF file to have it handy for future use.
In this PDF file, look for section "6.2 State Tables", which starts on page 134. The information you're looking for is then in "Table 6.18 Implementation Dependent Values".
This table lists the name of each value, and the function to use for querying the value for your specific implementation. Also very useful, it lists the minimum value guaranteed to be supported by all implementations.
For your specific example, you will find a value ALIASED_LINE_WIDTH_RANGE, which is the 6th entry in the table, with GetFloatv for the function name, 1,1 for the minimum supported value, and this for the description:
Range (lo to hi) of aliased line widths
Based on this, you know that implementations can have a limit as low as 1 for the maximum line width (i.e. they do not support wide lines at all), and you can query the limit for the implementation you are using with:
GLfloat widthRange[2];
glGetFloatv(GL_ALIASED_LINE_WIDTH_RANGE, widthRange);

You can get all such data from glGet when running the program.
For example requesting glGetFloatv(GL_ALIASED_LINE_WIDTH_RANGE,lineWidthRange); would return the line width range.
The OpenGL ES 2.0 specification lists in its section 6.2 all the minimum requirements. From there we can see that line width range is guaranteed to be [1,1], everything else is implementation specific.
I am not aware of a list that would compare "all" implementations according to attribute values.

Program generates the same 'random' number with each execution

I'm trying to write a simple ASCII style game with randomly generated world for Android terminal using c4droid IDE. It has C++ support and basically I'm generating array[width][height] tiles using the rule rand()%2 - 1 creates walkable tile, 0 is wall. But there is problem. Each time I'm 'randomly' generating map it looks the same - because rand() isn't really random.
I heard about using entropy created by HDD's or another parts. Problem is I'm using it on android so it is being weird for me to implement as C++ is not being as used as Java so I couldn't find solution on google.
so short question: How can I generate "pretty much real" random numbers using c++ on android?

you need to seed your random number generator with srand(time(NULL)). This allows the computer to use the system time to come up with pseudo-random numbers.
a link to reference: http://www.cplusplus.com/reference/clibrary/cstdlib/srand/
EDIT: it might be smart to note that you only need to seed the rand() function only once, usually at the beginning of the program.
int main()
{
srand(time(NULL)) //only needed to be called ONCE
//code and rand functions afterward
}

I think rand() should work for what you're doing. Are you seeding the random number generator?
srand(time(NULL));
// Should be a different set of numbers each time you run.
for(unsigned i = 0; i < 10; ++i) {
cout << rand() % 2 - 1;
}

"Immediate out of range errors" when assigning 0.0 to a NEON register

If I understand it correctly, because ARM instructions are 32 bits long they can only hold so many bits of immediate value. What I'm trying to do is vmov.f32 s0, #0.0, and I get "immediate out of range" compiler error. Strange thing is that when I use an immediate value of, say #0.5 or #0.25 (all very neatly represented in binary), my code compiles. When I try to assign an immediate value of #0.1, I get the "garbage after following instruction" error, which makes sense if it's trying to represent those values with more bits that can fit into an ARM instruction. The #0.0 case is the only one where I get "immediate out of range", so I'm thinking it's got to be a bug if there's no other explanation.
Does anyone know how to assign an immediate value of #0.0 to a single word floating point register without having to convert it from somewhere else? If there's a good reason it shouldn't work in the first place, please let me know as well. I'm using GNU assembler with Android NDK build tool.
Update:
vmov.f32 d0, #0.0 does work. It keeps making less and less sense.
Update 2:
This doesn't work either: vmov.s32 s0, #0

0.0 is not representable as a VFP/NEON floating-point immediate. Representable floating-point immediates are between 1/8 and 31 in magnitude, which zero clearly isn't.
The corresponding bit pattern, however, is representable as an integer NEON immediate. Your assembler is being helpful and generating this encoding for you instead of an (impossible) floating-point immediate; when you write vmov.f32 d0, #0.0 it actually emits vmov.s32 d0, #0, which has the same effect as what you appear to be trying to do, but is actually a legal instruction.
vmov.s32 s0, #0 doesn't make any sense; NEON does not provide any instructions that operate on s registers.
If you just want to zero a NEON register, however, the preferred idiom is usually veor d0, d0. Is there a reason that you aren't using that?

If you want to assign 0 to an s register, you can easily do by using the instruction:
vsub.f32 s0, s0, s0

For assigning "0" to a register(doesn't matter if it's general register or NEON vector) just do this:
"eor s0, s0, s0 \n\t"

You could simply use this :
vmov.u32 d0, #0
because 0x00000000 is interpreted as 0.0f as well.
FYI, there can't be any "true" zero in float. It's actually 1.0 * (2^-128)
or 1.0 * (2^-129), I don't remember exactly.

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.

ARM NEON Assembler - usage & understanding - android

Related

Understand BINDER_VM_SIZE in Android source code

What's the difference between the AUDIO_FORMAT_PCM_32_BIT and AUDIO_FORMAT_PCM_8_24_BIT in Android Lollipop?

Is it possible to get list of OpenGL ES 2.0 methods which are vendor-specific?

Program generates the same 'random' number with each execution

"Immediate out of range errors" when assigning 0.0 to a NEON register

Categories

Resources