Wednesday, March 12, 2014

Bug in Undefined Instruction Handler ARM

Problem Statement:

Multimedia team reported random user space crash.
The issue was tough to reproduce. If you run monkey test on, say 7 devices for around 5-10 hours.
They also indicated about a pattern, that SP is shifted by 8bytes in tombstone.

Most of the time crash was reproduced while returning from a bionic libc function strtoimax().
When objdump of strotimax was investigated, found few instruction which may cause corruption.
Out of them the most important one was vpush {d8} and vpop {d8}. This is basically a floating point instruction.

To confirm that vpush and vpop is causing the issue, added a new function called strtoimax_debug in bionic libc. This was a dummy function which does noting, but check for any stack corruption. We added those suspected instruction here,
Like stmdb sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}, vpush and vpos
and check for the expected SP, if not generate an intentional data abort doing an ldr r0 [0]

From the result of this experiment, confirmed that the issue is with vpop instruction.

Now did bit study on how vfp instruction are executed. By default vfp engine is turned off druing a context switch. When a process executes floating point instruction, a undefined exception will be generated.
exception handler enables the vfp engine and jump back to the same instruction which caused the exception.
code looks okey in that perspective.

Tracing __und_usr(arch/arm/kernel/entry-armv.S) revealed that control came out of exception handler without performing do_vfp which is necessary to handle any vfp instruction when the engine is off.
This gave an indication that there can be only one possibility(for deviating the path) and that is an exception has triggered while executing the path.

Looking at the code further, saw that und_user is reading the instruction which cause the undefined exception. And comment says that this can cause a fault. Now the question is why it can falut?. Onepossible option is the code page might have reclaimed by other core. A fixup handler is register in the exception table. This is not the normal exception table. Here what I m talking is about kernel's fixup exception table.
But the fixup handler was not proper. It was calling ret_from_execption which retrun back to the next instruction. The problem is present in latest kernel aswell, but this problem is very rarely hit.

Now the fix is to return to the same instruction which cause fault instead of next instruction.

Some Notes:

The NEON/VFP register file is managed using lazy preserve (on UP systems) and
lazy restore (on both SMP and UP systems). This means that the register file is
kept 'live', and is only preserved and restored when multiple tasks are
contending for the NEON/VFP unit (or, in the SMP case, when a task migrates to
another core). Lazy restore is implemented by disabling the NEON/VFP unit after
every context switch, resulting in a trap when subsequently a NEON/VFP
instruction is issued, allowing the kernel to step in and perform the restore if
necessary.

Difference btw Neon and VFP
Neon donot support double precision
no complex operations like square root and divide.

Managed to push this fix upstream, [Link to kenrel.org]

1 comment:

Unknown said...

Someone pointed me at this post. Can you try this set of patches:

http://thread.gmane.org/gmane.linux.ports.arm.kernel/263765

With a Tested-by, I'll send them to rmk ;).

Thanks.