Bugzilla – Bug 15607
Suspected occasional register corruption (s20) on context switch
Last modified: 2019-01-25 10:35:02 UTC
Analysis for bug 14489 has got to the point that either a compiler (or supporting run-time library) or kernel bug is suspected. This bug has been opened to track investigation into the possibility of the problem being a kernel bug. The issue seems to be that floating-point register s20 is used a temporary storage (for an integer) in the routine squeezeplay_private/src/WMA10Dec/audio/wmaudio/v10/common/fft.c:prvFFT4DCT() and, during the execution of the routine, gets corrupted. It has been suggested that this could be a problem with register-sets save/restore logic somewhere in kernel context-switching code. The problem is difficult to reproduce, typically taking several hours of continuous play of WMA tracks to cause it to occur. I have been able to isolate the specific routine where it occurs only by use of a specially built version of the BSD malloc library, which can use inaccessible mmap-ed guard pages between heap memory pages or allocations. With the standard malloc implementation the problem can take longer to occur and the symptoms are more variable as the issue leads to generalized heap corruption which can manifest itself all over the place. I will soon checkin a change to workaround this problem by compiling the specific module at level -O0 which avoids use of s20. This can only be considered a temporary workaround. It would need to be reverted (or have its effect reverted) when attempting to recreate this problem.
From: http://www.ibm.com/developerworks/linux/library/l-inside.html FPU mode is another case where the state of the CPU should be protected from preemption. When the kernel is executing floating point instructions, the FPU state is not saved. If preemption happens here, then upon reschedule, the FPU state is completely different from what was there before preemption. So FPU code must always be locked against kernel preemption. Locking can be done by disabling preemption for the critical section and re-enabling it afterward. The following #defines have been provided by the 2.6 kernel to disable and enable preemption: * preempt_enable() -- decrements the preempt counter * preempt_disable() -- increments the preempt counter * get_cpu() -- calls preempt_disable() followed by a call to smp_processor_id() * put_cpu() -- re-enables preemption() Using these defines, Listing 1 could be rewritten as: Listing 2. Code with locking against preemption int cpu, arr[NR_CPUS]; arr[get_cpu()] = i; /* disable preemption */ j = arr[smp_processor_id()]; /* do some critical stuff here */ put_cpu() /* re-enable preemption */ Note that preempt_disable()/enable() calls are nested. That is, preempt_disable() can be called n number of times and preemption will only be re-enabled when the nth preempt_enable() is encountered. If the floating pointer register is used in application layer and a context switch happens then the content of the FP is lost.
I checked all the object files to have a good understanding of the usage of s2[0-9] registers. Here is the result: kernel vfp_get_float kernel vfp_put_float. squeezeplay_private/libwmafft_la-fft.o auDctIV WMA10Dec/audio/wmaudio/v10/common/fft.c squeezeplay_private/libwmafft_la-fft.o prvFFT4DCT WMA10Dec/audio/wmaudio/v10/common/fft.c squeezeplay_private/libwma_la-lowrate_commonstd.o prvInverseQuantizeMidRate WMA10Dec/audio/wmaudio/v10/common/lowrate_commonstd.c squeezeplay_private/libwma_la-wmachmtx.o SetMixFromChannelMaskD squeezeplay_private/libwma_la-wmachmtx.o SetMixFromChannelMaskF squeezeplay_private/wmadec-main.o decodeWMAFile2PCM WMA10Dec/audio/wmaudio/v10/wmadec_s/main.c squeezeplay_private/libwma_la-wmaltrt.o ltrtDownmixInit WMA10Dec/audio/common/chanmtx/wmaltrt.c squeezeplay_private/libwma_la-msaudiopro.o prvMultiXIDCTInit WMA10Dec/audio/wmaudio/v10/common/msaudiopro.c alsa-lib-1.0.18-3/alsa-lib-1.0.18/src/pcm/.libs/pcm_route.o snd_pcm_route_convert1_many alsa-lib-1.0.18-3/alsa-lib-1.0.18/src/pcm/pcm_route.c
http://tuxology.net/sf-forum/?forum=3&topic=7 floating point operations are not supported inside the Linux kernel mode, although of course they are supported in user space programs running on Linux. One of the main reasons for this lack of support is that it enables the Linux kernel to not need to save and restore the floating point unit registers on each system call and back, which in turns enables a faster context switch. adding the required code in the scheduler to do save the FPU registers will make all the system call slower, so it will actually slow down your user programs.
It seems to me that changing the context switch in ARM to save/restore floating point registers is not an option(Major impact on performance). By having a better understanding of which components are using the floating point registers, maybe we can come up with some synchronization rules, so that they will not overwrite each other Vector Floating Point single/double precision registers.
If we decide to support the floating point hardware in user mode we need to ensure that the floating point registers are preserved in the context switch code - extending the context switch time. One way of improving this situation is to flag tasks that use floating point so the FP registers are only preserved or restored when switching from or too an FP task.
Changes r8472-8474 switch the builds to a new toolchain. If you want to verify any fixes to the problem you will probably need to stick with r8473 or earlier. Also, it would be good if you did a repeat of your assessment in comment #2 above for the new toolchain. It could be that the problem simply goes away then. An earlier build of mine, with different optimization options, did not make use of s20 as a temporary in the module in question, but I have not verified this with the latest builds. Reading your comments 1, 3, 4, 5 leaves me a little confused. We use FP quite a bit. Its use by some codecs (WMA, ...?) may be incidental and they are indeed capable of operating without it. But perl, used to run TinySC, makes extensive use of FP and TinySC includes many FP operations. I do not see that we have an option not to support FP. Anyway, it seems very strange that FP registers would not be saved as part of a normal context-switch. That would imply that, although the CPU supports FP, one cannot really use it in a normal way in user programs. Comment #1 suggests that one must (which means that the compiler must) bracket all use of FP with preemption-management calls and not store temporary results in FP registers outside of these preemption brackets. But I am also concerned about the issue of context-switching time. I was already of the view that context-switching on this platform is not great, although I do not have any hard evidence of this. Richard once said that he thought that the real-time kernel changes also exacerbated this issue. Making it worse in general seems like a bad idea.
I ran a test of running multiple processes accessing floating point registers on Fab4. I checked the assembly code and made sure that the applications are using s registers. The test is running fine, which means somehow the concurrency is being managed properly. I noticed that mostly the following options are used for FP compilations: "-mfloat-abi=softfp -mfpu=vfp": -mfloat-abi=name Specifies which ABI to use for floating point values. Permissible values are: `soft', `softfp' and `hard'. `soft' and `hard' are equivalent to -msoft-float and -mhard-float respectively. Specifying `soft' causes GCC to generate output containing library calls for floating-point operations. `softfp' allows the generation of code using hardware floating-point instructions, but still uses the soft-float calling conventions. `hard' allows generation of floating-point instructions and uses FPU-specific calling conventions. -mfp=number This specifies what floating point hardware (or hardware emulation) is available on the target. Permissible names are: `fpa', `fpe2', `fpe3', `maverick', `vfp'. -mfp and -mfpe are synonyms for -mfpu=`fpe'number, for compatibility with older versions of GCC. This is an interesting bug.
As I understand it, the prebuilt libraries that come with the toolchain use -mfloat-abi=softfp.
To make sure that the context switching and the FPU is functional on Fab4 I wrote the next program(float_timing.c): #include <stdio.h> #include <stdlib.h> #include <math.h> #include <sys/time.h> void print_usage ( void ) { printf ( "float_timing <repitition>\n" ); printf ( "repitition: The number of iteration of the test\n" ); } main(int argc, char **argv ) { int counter = 0; int rep; float x = 0; int time0, time1; printf ( "version 2.0\n" ); if ( argc != 2 ) { print_usage (); exit ( 0 ); } rep = atoi ( argv[1]); time0 = time ( NULL ); for ( counter = 0; counter < rep; ++counter ) { x += sqrt ( counter * 0.895 ); //printf ( "x is %f\n", x ); } time1 = time ( NULL ); printf ( "Time it took to finish operation is %d, x is %f\n", time1 - time0, x ); } Here is the Makefile to build the float_timing_fast and float_timing_slow: all: arm-none-linux-gnueabi-gcc -march=armv6j -mtune=arm1136jf-s -mfloat-abi=softfp -mfpu=vfp ./float_timing.c -o float_timing_fast -lm arm-none-linux-gnueabi-gcc -march=armv6j -mtune=arm1136jf-s -mfloat-abi=soft ./float_timing.c -o float_timing_slow -lm clean: rm -f float_timing_slow rm -f float_timing_fast I looked at the assembly output of the float_timing_fast to make sure that it is using s?? registers: 00008564 <main>: 8564: e92d4800 push {fp, lr} 8568: e28db004 add fp, sp, #4 ; 0x4 856c: e24dd040 sub sp, sp, #64 ; 0x40 8570: e50b0020 str r0, [fp, #-32] 8574: e50b1024 str r1, [fp, #-36] 8578: e3a03000 mov r3, #0 ; 0x0 857c: e50b3018 str r3, [fp, #-24] 8580: e59f3120 ldr r3, [pc, #288] ; 86a8 <main+0x144> 8584: e50b3010 str r3, [fp, #-16] 8588: e59f011c ldr r0, [pc, #284] ; 86ac <main+0x148> 858c: ebffffba bl 847c <_init+0x74> 8590: e51b3020 ldr r3, [fp, #-32] 8594: e3530002 cmp r3, #2 ; 0x2 8598: 0a000002 beq 85a8 <main+0x44> 859c: ebffffe7 bl 8540 <print_usage> 85a0: e3a00000 mov r0, #0 ; 0x0 85a4: ebffffb7 bl 8488 <_init+0x80> 85a8: e51b3024 ldr r3, [fp, #-36] 85ac: e2833004 add r3, r3, #4 ; 0x4 85b0: e5933000 ldr r3, [r3] 85b4: e1a00003 mov r0, r3 85b8: ebffffa9 bl 8464 <_init+0x5c> 85bc: e1a03000 mov r3, r0 85c0: e50b3014 str r3, [fp, #-20] 85c4: e3a00000 mov r0, #0 ; 0x0 85c8: ebffffa8 bl 8470 <_init+0x68> 85cc: e1a03000 mov r3, r0 85d0: e50b300c str r3, [fp, #-12] 85d4: e3a03000 mov r3, #0 ; 0x0 85d8: e50b3018 str r3, [fp, #-24] 85dc: ea00001c b 8654 <main+0xf0> 85e0: ed5b7a04 vldr s15, [fp, #-16] 85e4: eeb75ae7 vcvt.f64.f32 d5, s15 85e8: ed0b5b0d vstr d5, [fp, #-52] 85ec: e51b3018 ldr r3, [fp, #-24] 85f0: ee053a90 vmov s11, r3 85f4: eeb86be5 vcvt.f64.s32 d6, s11 85f8: ed9f7b28 vldr d7, [pc, #160] ; 86a0 <main+0x13c> 85fc: ed1b5b0b vldr d5, [fp, #-44] 8600: ee265b07 vmul.f64 d5, d6, d7 8604: ed0b5b0b vstr d5, [fp, #-44] 8608: ed1b6b0b vldr d6, [fp, #-44] 860c: eeb16bc6 vsqrt.f64 d6, d6 8610: ed0b6b11 vstr d6, [fp, #-68] 8614: ed1b7b11 vldr d7, [fp, #-68] 8618: ed1b5b11 vldr d5, [fp, #-68] 861c: eeb47b45 vcmp.f64 d7, d5 8620: eef1fa10 vmrs APSR_nzcv, fpscr 8624: 0a000002 beq 8634 <main+0xd0> 8628: e14b02dc ldrd r0, [fp, #-44] 862c: ebffff86 bl 844c <_init+0x44> 8630: e14b04f4 strd r0, [fp, #-68] 8634: ed1b7b11 vldr d7, [fp, #-68] 8638: ed1b6b0d vldr d6, [fp, #-52] 863c: ee367b07 vadd.f64 d7, d6, d7 8640: eef77bc7 vcvt.f32.f64 s15, d7 8644: ed4b7a04 vstr s15, [fp, #-16] 8648: e51b3018 ldr r3, [fp, #-24] 864c: e2833001 add r3, r3, #1 ; 0x1 8650: e50b3018 str r3, [fp, #-24] 8654: e51b2018 ldr r2, [fp, #-24] 8658: e51b3014 ldr r3, [fp, #-20] 865c: e1520003 cmp r2, r3 8660: baffffde blt 85e0 <main+0x7c> 8664: e3a00000 mov r0, #0 ; 0x0 8668: ebffff80 bl 8470 <_init+0x68> 866c: e1a03000 mov r3, r0 8670: e50b3008 str r3, [fp, #-8] 8674: e51b2008 ldr r2, [fp, #-8] 8678: e51b300c ldr r3, [fp, #-12] 867c: e0633002 rsb r3, r3, r2 8680: ed5b7a04 vldr s15, [fp, #-16] 8684: eeb77ae7 vcvt.f64.f32 d7, s15 8688: e59f0020 ldr r0, [pc, #32] ; 86b0 <main+0x14c> 868c: e1a01003 mov r1, r3 8690: ec532b17 vmov r2, r3, d7 8694: ebffff6f bl 8458 <_init+0x50> 8698: e24bd004 sub sp, fp, #4 ; 0x4 869c: e8bd8800 pop {fp, pc} 86a0: 0a3d70a4 .word 0x0a3d70a4 86a4: 3feca3d7 .word 0x3feca3d7 86a8: 00000000 .word 0x00000000 86ac: 00008780 .word 0x00008780 86b0: 0000878c .word 0x0000878c 86b4: e1a00000 nop (mov r0,r0) Here is the result of some operations: # /root/float_timing_slow 1000000 version 2.0 Time it took to finish operation is 5, x is 630544256.000000 # /root/float_timing_slow 10000000 version 2.0 Time it took to finish operation is 49, x is 19180300288.000000 # /root/float_timing_slow 100000000 version 2.0 Time it took to finish operation is 485, x is 274877906944.000000 # /root/float_timing_fast 1000000 version 2.0 Time it took to finish operation is 0, x is 630544256.000000 # /root/float_timing_fast 10000000 version 2.0 Time it took to finish operation is 2, x is 19180300288.000000 # /root/float_timing_fast 100000000 version 2.0 Time it took to finish operation is 27, x is 274877906944.000000 Also I ran the following test to make sure that the concurrency is working fine with floating point operations: # /root/float_timing_fast 100000000 & version 2.0 # /root/float_timing_fast 100000000 & version 2.0 # /root/float_timing_fast 100000000 & version 2.0 # /root/float_timing_fast 100000000 & version 2.0 # /root/float_timing_fast 100000000 & version 2.0 # /root/float_timing_fast 100000000 & version 2.0 # /root/float_timing_fast 100000000 & version 2.0 # /root/float_timing_fast 100000000 & version 2.0 # /root/float_timing_fast 100000000 & version 2.0 # /root/float_timing_fast 100000000 & version 2.0 # /root/float_timing_slow 100000000 & version 2.0 # Time it took to finish operation is 274, x is 274877906944.000000 Time it took to finish operation is 283, x is 274877906944.000000 Time it took to finish operation is 286, x is 274877906944.000000 Time it took to finish operation is 289, x is 274877906944.000000 Time it took to finish operation is 290, x is 274877906944.000000 Time it took to finish operation is 290, x is 274877906944.000000 Time it took to finish operation is 289, x is 274877906944.000000 Time it took to finish operation is 289, x is 274877906944.000000 Time it took to finish operation is 288, x is 274877906944.000000 Time it took to finish operation is 287, x is 274877906944.000000 This proves to me that the FPU is functional and the context switching for floating point related operations is working. There is still a very minor chance of potential intermittent bugs, which I will be looking into next.
The following article describes in detail how the context switching for the VFP is done: http://www.linux-arm.org/LinuxKernel/LinuxVFP FP registers saving and restoring is done lazily in Linux to avoid unnecessary operations when an application does not use the VFP. The vfp_init function registers vfp_notifier to be called during a context switch via the __switch_to and atomic_notifier_call_chain functions. This function disables the VFP by clearing the FPEXC.EN bit, causing any subsequent VFP instruction (except access to the FPEXC, FPSID, MVFR0 and MVFR1 registers) to trigger an Undefined Instruction exception. The vfp_support_entry function called via the Undefined Instruction trap checks whether the VFP is disabled and a new thread requires access (the address of the current VFP state is stored in last_VFP_context defined in vfphw.S). If a new thread requires the VFP, the support entry code saves the current VFP registers state (if any) and restores the state for the new thread. By default, only the double-precision registers, FPEXC and FPSCR have to be saved or restored. If the FPEXC.EX bit is set, the FPINST and FPINST2 registers have to be saved or restored as well. Once the VFP is enabled by setting the FPEXC.EN bit, the execution resumes and the trigger VFP instruction is restarted. If following a context switch the execution would return to the same thread, the VFP support entry code checks for any exceptional VFP state that needs addressing (FPEXC.EX bit set) to avoid an exception re-entry when restarting the trigger instruction.
Since the workaround of changing the optimization level for the fft.c to none seems to have fixed this problem, this bug is postponed untill after the 7.5 release. I have done extensive testing on floating pointers with multiple processes and I have noticed that in normal cases all the floating point operations work fine. There is one indication of potential problem and that seems to be related to the operations that cause overflow.
Vahid is no longer working for us.