Bug 15607 – Suspected occasional register corruption (s20) on context switch

Bug 15607 - Suspected occasional register corruption (s20) on context switch

Summary:

Suspected occasional register corruption (s20) on context switch

Status:	RESOLVED WONTFIX

Product:	SB Touch
Classification:	Unclassified
Component:	OS
Version:	unspecified
Platform:	PC Other

Importance:	P2 major (vote)
Target Milestone:	8.0.0
Assigned To:	Unassigned bug - please assign me!

URL:
Keywords:

Depends on:
Blocks:	14489
	Show dependency tree / graph

Reported:	2010-02-03 22:16 UTC by Alan Young
Modified:	2019-01-25 10:35 UTC (History)
CC List:	2 users (show)

See Also:
Category:	Bug

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Alan Young 2010-02-03 22:16:26 UTC

Analysis for bug 14489 has got to the point that either a compiler (or supporting run-time library) or kernel bug is suspected. This bug has been opened to track investigation into the possibility of the problem being a kernel bug.

The issue seems to be that floating-point register s20 is used a temporary storage (for an integer) in the routine squeezeplay_private/src/WMA10Dec/audio/wmaudio/v10/common/fft.c:prvFFT4DCT() and, during the execution of the routine, gets corrupted. It has been suggested that this could be a problem with register-sets save/restore logic somewhere in kernel context-switching code. 

The problem is difficult to reproduce, typically taking several hours of continuous play of WMA tracks to cause it to occur. I have been able to isolate the specific routine where it occurs only by use of a specially built version of the BSD malloc library, which can use inaccessible mmap-ed guard pages between heap memory pages or allocations. With the standard malloc implementation the problem can take longer to occur and the symptoms are more variable as the issue leads to generalized heap corruption which can manifest itself all over the place.

I will soon checkin a change to workaround this problem by compiling the specific module at level -O0 which avoids use of s20. This can only be considered a temporary workaround. It would need to be reverted (or have its effect reverted) when attempting to recreate this problem.

Comment 1 Vahid Fereydouny 2010-02-08 14:27:26 UTC

From:
http://www.ibm.com/developerworks/linux/library/l-inside.html
FPU mode is another case where the state of the CPU should be protected from preemption. When the kernel is executing floating point instructions, the FPU state is not saved. If preemption happens here, then upon reschedule, the FPU state is completely different from what was there before preemption. So FPU code must always be locked against kernel preemption.

Locking can be done by disabling preemption for the critical section and re-enabling it afterward. The following #defines have been provided by the 2.6 kernel to disable and enable preemption:

    * preempt_enable() -- decrements the preempt counter
    * preempt_disable() -- increments the preempt counter
    * get_cpu() -- calls preempt_disable() followed by a call to smp_processor_id()
    * put_cpu() -- re-enables preemption()

Using these defines, Listing 1 could be rewritten as:

Listing 2. Code with locking against preemption

        int cpu, arr[NR_CPUS];

         arr[get_cpu()] = i;  /* disable preemption */
         j = arr[smp_processor_id()];
         /* do some critical stuff here */
         put_cpu()    /* re-enable preemption */


Note that preempt_disable()/enable() calls are nested. That is, preempt_disable() can be called n number of times and preemption will only be re-enabled when the nth preempt_enable() is encountered.
If the floating pointer register is used in application layer and a context switch happens then the content of the FP is lost.

Comment 2 Vahid Fereydouny 2010-02-09 14:40:21 UTC

I checked all the object files to have a good understanding of the usage of s2[0-9] registers. Here is the result:
kernel vfp_get_float
kernel vfp_put_float.
squeezeplay_private/libwmafft_la-fft.o              auDctIV  WMA10Dec/audio/wmaudio/v10/common/fft.c
squeezeplay_private/libwmafft_la-fft.o              prvFFT4DCT  WMA10Dec/audio/wmaudio/v10/common/fft.c
squeezeplay_private/libwma_la-lowrate_commonstd.o   prvInverseQuantizeMidRate WMA10Dec/audio/wmaudio/v10/common/lowrate_commonstd.c
squeezeplay_private/libwma_la-wmachmtx.o            SetMixFromChannelMaskD
squeezeplay_private/libwma_la-wmachmtx.o            SetMixFromChannelMaskF
squeezeplay_private/wmadec-main.o                   decodeWMAFile2PCM   WMA10Dec/audio/wmaudio/v10/wmadec_s/main.c
squeezeplay_private/libwma_la-wmaltrt.o             ltrtDownmixInit  WMA10Dec/audio/common/chanmtx/wmaltrt.c
squeezeplay_private/libwma_la-msaudiopro.o          prvMultiXIDCTInit  WMA10Dec/audio/wmaudio/v10/common/msaudiopro.c
alsa-lib-1.0.18-3/alsa-lib-1.0.18/src/pcm/.libs/pcm_route.o   snd_pcm_route_convert1_many         alsa-lib-1.0.18-3/alsa-lib-1.0.18/src/pcm/pcm_route.c

Comment 3 Vahid Fereydouny 2010-02-10 19:13:38 UTC

http://tuxology.net/sf-forum/?forum=3&topic=7
floating point operations are not supported inside the Linux kernel mode, although of course they are supported in user space programs running on Linux. One of the main reasons for this lack of support is that it enables the Linux kernel to not need to save and restore the floating point unit registers on each system call and back, which in turns enables a faster context switch.
adding the required code in the scheduler to do save the FPU registers will make all the system call slower, so it will actually slow down your user programs.

Comment 4 Vahid Fereydouny 2010-02-10 19:16:39 UTC

It seems to me that changing the context switch in ARM to save/restore floating point registers is not an option(Major impact on performance). By having a better understanding of which components are using the floating point registers, maybe we can come up with some synchronization rules, so that they will not overwrite each other Vector Floating Point single/double precision registers.

Comment 5 Vahid Fereydouny 2010-02-10 20:01:59 UTC

If we decide to support the floating point hardware in user mode we need to ensure that the floating point registers are preserved in the context switch code - extending the context switch time.
One way of improving this situation is to flag tasks that use floating point so the FP registers are only preserved or restored when switching from or too an FP task.

Comment 6 Alan Young 2010-02-10 22:12:54 UTC

Changes r8472-8474 switch the builds to a new toolchain. If you want to verify any fixes to the problem you will probably need to stick with r8473 or earlier.

Also, it would be good if you did a repeat of your assessment in comment #2 above for the new toolchain. It could be that the problem simply goes away then. An earlier build of mine, with different optimization options, did not make use of s20 as a temporary in the module in question, but I have not verified this with the latest builds.

Reading your comments 1, 3, 4, 5 leaves me a little confused. We use FP quite a bit. Its use by some codecs (WMA, ...?) may be incidental and they are indeed capable of operating without it. But perl, used to run TinySC, makes extensive use of FP and TinySC includes many FP operations. I do not see that we have an option not to support FP.

Anyway, it seems very strange that FP registers would not be saved as part of a normal context-switch. That would imply that, although the CPU supports FP, one cannot really use it in a normal way in user programs. Comment #1 suggests that one must (which means that the compiler must) bracket all use of FP with preemption-management calls and not store temporary results in FP registers outside of these preemption brackets.

But I am also concerned about the issue of context-switching time. I was already of the view that context-switching on this platform is not great, although I do not have any hard evidence of this. Richard once said that he thought that the real-time kernel changes also exacerbated this issue. Making it worse in general seems like a bad idea.

Comment 7 Vahid Fereydouny 2010-02-11 15:48:05 UTC

I ran a test of running multiple processes accessing floating point registers on Fab4. I checked the assembly code and made sure that the applications are using s registers. The test is running fine, which means somehow the concurrency is being managed properly.
I noticed that mostly the following options are used for FP compilations:
"-mfloat-abi=softfp -mfpu=vfp":
-mfloat-abi=name
    Specifies which ABI to use for floating point values. Permissible values are: `soft', `softfp' and `hard'. `soft' and `hard' are equivalent to -msoft-float and -mhard-float respectively. Specifying `soft' causes GCC to generate output containing library calls for floating-point operations. `softfp' allows the generation of code using hardware floating-point instructions, but still uses the soft-float calling conventions. `hard' allows generation of floating-point instructions and uses FPU-specific calling conventions. 

-mfp=number
    This specifies what floating point hardware (or hardware emulation) is available on the target. Permissible names are: `fpa', `fpe2', `fpe3', `maverick', `vfp'. -mfp and -mfpe are synonyms for -mfpu=`fpe'number, for compatibility with older versions of GCC.

This is an interesting bug.

Comment 8 Alan Young 2010-02-11 21:57:34 UTC

As I understand it, the prebuilt libraries that come with the toolchain use -mfloat-abi=softfp.

Comment 9 Vahid Fereydouny 2010-02-16 17:19:03 UTC

To make sure that the context switching and the FPU is functional on Fab4 I wrote the next program(float_timing.c):

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>

void print_usage ( void )
{
        printf ( "float_timing <repitition>\n" );
        printf ( "repitition: The number of iteration of the test\n" );
}

main(int argc, char **argv )
{
        int counter = 0;
        int rep;
        float x = 0;
        int time0, time1;

        printf ( "version 2.0\n" );
        if ( argc != 2 )
        {
                print_usage ();
                exit ( 0 );
        }

        rep = atoi ( argv[1]);

        time0 = time ( NULL );
        for ( counter = 0; counter < rep; ++counter )
        {
                x += sqrt ( counter * 0.895 );
                //printf ( "x is %f\n", x );
        }
        time1 = time ( NULL );

        printf ( "Time it took to finish operation is %d, x is %f\n", time1 - time0, x );
}
 
Here is the Makefile to build the float_timing_fast and float_timing_slow:

all:
        arm-none-linux-gnueabi-gcc -march=armv6j -mtune=arm1136jf-s -mfloat-abi=softfp -mfpu=vfp  ./float_timing.c -o float_timing_fast -lm
        arm-none-linux-gnueabi-gcc -march=armv6j -mtune=arm1136jf-s -mfloat-abi=soft   ./float_timing.c -o float_timing_slow -lm


clean:
        rm -f float_timing_slow
        rm -f float_timing_fast

I looked at the assembly output of the float_timing_fast to make sure that it is using s?? registers:

00008564 <main>:
    8564:	e92d4800 	push	{fp, lr}
    8568:	e28db004 	add	fp, sp, #4	; 0x4
    856c:	e24dd040 	sub	sp, sp, #64	; 0x40
    8570:	e50b0020 	str	r0, [fp, #-32]
    8574:	e50b1024 	str	r1, [fp, #-36]
    8578:	e3a03000 	mov	r3, #0	; 0x0
    857c:	e50b3018 	str	r3, [fp, #-24]
    8580:	e59f3120 	ldr	r3, [pc, #288]	; 86a8 <main+0x144>
    8584:	e50b3010 	str	r3, [fp, #-16]
    8588:	e59f011c 	ldr	r0, [pc, #284]	; 86ac <main+0x148>
    858c:	ebffffba 	bl	847c <_init+0x74>
    8590:	e51b3020 	ldr	r3, [fp, #-32]
    8594:	e3530002 	cmp	r3, #2	; 0x2
    8598:	0a000002 	beq	85a8 <main+0x44>
    859c:	ebffffe7 	bl	8540 <print_usage>
    85a0:	e3a00000 	mov	r0, #0	; 0x0
    85a4:	ebffffb7 	bl	8488 <_init+0x80>
    85a8:	e51b3024 	ldr	r3, [fp, #-36]
    85ac:	e2833004 	add	r3, r3, #4	; 0x4
    85b0:	e5933000 	ldr	r3, [r3]
    85b4:	e1a00003 	mov	r0, r3
    85b8:	ebffffa9 	bl	8464 <_init+0x5c>
    85bc:	e1a03000 	mov	r3, r0
    85c0:	e50b3014 	str	r3, [fp, #-20]
    85c4:	e3a00000 	mov	r0, #0	; 0x0
    85c8:	ebffffa8 	bl	8470 <_init+0x68>
    85cc:	e1a03000 	mov	r3, r0
    85d0:	e50b300c 	str	r3, [fp, #-12]
    85d4:	e3a03000 	mov	r3, #0	; 0x0
    85d8:	e50b3018 	str	r3, [fp, #-24]
    85dc:	ea00001c 	b	8654 <main+0xf0>
    85e0:	ed5b7a04 	vldr	s15, [fp, #-16]
    85e4:	eeb75ae7 	vcvt.f64.f32	d5, s15
    85e8:	ed0b5b0d 	vstr	d5, [fp, #-52]
    85ec:	e51b3018 	ldr	r3, [fp, #-24]
    85f0:	ee053a90 	vmov	s11, r3
    85f4:	eeb86be5 	vcvt.f64.s32	d6, s11
    85f8:	ed9f7b28 	vldr	d7, [pc, #160]	; 86a0 <main+0x13c>
    85fc:	ed1b5b0b 	vldr	d5, [fp, #-44]
    8600:	ee265b07 	vmul.f64	d5, d6, d7
    8604:	ed0b5b0b 	vstr	d5, [fp, #-44]
    8608:	ed1b6b0b 	vldr	d6, [fp, #-44]
    860c:	eeb16bc6 	vsqrt.f64	d6, d6
    8610:	ed0b6b11 	vstr	d6, [fp, #-68]
    8614:	ed1b7b11 	vldr	d7, [fp, #-68]
    8618:	ed1b5b11 	vldr	d5, [fp, #-68]
    861c:	eeb47b45 	vcmp.f64	d7, d5
    8620:	eef1fa10 	vmrs	APSR_nzcv, fpscr
    8624:	0a000002 	beq	8634 <main+0xd0>
    8628:	e14b02dc 	ldrd	r0, [fp, #-44]
    862c:	ebffff86 	bl	844c <_init+0x44>
    8630:	e14b04f4 	strd	r0, [fp, #-68]
    8634:	ed1b7b11 	vldr	d7, [fp, #-68]
    8638:	ed1b6b0d 	vldr	d6, [fp, #-52]
    863c:	ee367b07 	vadd.f64	d7, d6, d7
    8640:	eef77bc7 	vcvt.f32.f64	s15, d7
    8644:	ed4b7a04 	vstr	s15, [fp, #-16]
    8648:	e51b3018 	ldr	r3, [fp, #-24]
    864c:	e2833001 	add	r3, r3, #1	; 0x1
    8650:	e50b3018 	str	r3, [fp, #-24]
    8654:	e51b2018 	ldr	r2, [fp, #-24]
    8658:	e51b3014 	ldr	r3, [fp, #-20]
    865c:	e1520003 	cmp	r2, r3
    8660:	baffffde 	blt	85e0 <main+0x7c>
    8664:	e3a00000 	mov	r0, #0	; 0x0
    8668:	ebffff80 	bl	8470 <_init+0x68>
    866c:	e1a03000 	mov	r3, r0
    8670:	e50b3008 	str	r3, [fp, #-8]
    8674:	e51b2008 	ldr	r2, [fp, #-8]
    8678:	e51b300c 	ldr	r3, [fp, #-12]
    867c:	e0633002 	rsb	r3, r3, r2
    8680:	ed5b7a04 	vldr	s15, [fp, #-16]
    8684:	eeb77ae7 	vcvt.f64.f32	d7, s15
    8688:	e59f0020 	ldr	r0, [pc, #32]	; 86b0 <main+0x14c>
    868c:	e1a01003 	mov	r1, r3
    8690:	ec532b17 	vmov	r2, r3, d7
    8694:	ebffff6f 	bl	8458 <_init+0x50>
    8698:	e24bd004 	sub	sp, fp, #4	; 0x4
    869c:	e8bd8800 	pop	{fp, pc}
    86a0:	0a3d70a4 	.word	0x0a3d70a4
    86a4:	3feca3d7 	.word	0x3feca3d7
    86a8:	00000000 	.word	0x00000000
    86ac:	00008780 	.word	0x00008780
    86b0:	0000878c 	.word	0x0000878c
    86b4:	e1a00000 	nop			(mov r0,r0)


Here is the result of some operations:

# /root/float_timing_slow 1000000  
version 2.0
Time it took to finish operation is 5, x is 630544256.000000
# /root/float_timing_slow 10000000
version 2.0
Time it took to finish operation is 49, x is 19180300288.000000
# /root/float_timing_slow 100000000
version 2.0
Time it took to finish operation is 485, x is 274877906944.000000
# /root/float_timing_fast 1000000 
version 2.0
Time it took to finish operation is 0, x is 630544256.000000
# /root/float_timing_fast 10000000
version 2.0
Time it took to finish operation is 2, x is 19180300288.000000
# /root/float_timing_fast 100000000
version 2.0
Time it took to finish operation is 27, x is 274877906944.000000

Also I ran the following test to make sure that the concurrency is working fine with floating point operations:

# /root/float_timing_fast 100000000 &
version 2.0
# /root/float_timing_fast 100000000 &
version 2.0
# /root/float_timing_fast 100000000 &
version 2.0
# /root/float_timing_fast 100000000 &
version 2.0
# /root/float_timing_fast 100000000 &
version 2.0
# /root/float_timing_fast 100000000 &
version 2.0
# /root/float_timing_fast 100000000 &
version 2.0
# /root/float_timing_fast 100000000 &
version 2.0
# /root/float_timing_fast 100000000 &
version 2.0
# /root/float_timing_fast 100000000 &
version 2.0
# /root/float_timing_slow 100000000 &
version 2.0
# Time it took to finish operation is 274, x is 274877906944.000000
Time it took to finish operation is 283, x is 274877906944.000000
Time it took to finish operation is 286, x is 274877906944.000000
Time it took to finish operation is 289, x is 274877906944.000000
Time it took to finish operation is 290, x is 274877906944.000000
Time it took to finish operation is 290, x is 274877906944.000000
Time it took to finish operation is 289, x is 274877906944.000000
Time it took to finish operation is 289, x is 274877906944.000000
Time it took to finish operation is 288, x is 274877906944.000000
Time it took to finish operation is 287, x is 274877906944.000000

This proves to me that the FPU is functional and the context switching for floating point related operations is working. There is still a very minor chance of potential intermittent bugs, which I will be looking into next.

Comment 10 Vahid Fereydouny 2010-02-22 19:42:51 UTC

The following article describes in detail how the context switching for the VFP is done: http://www.linux-arm.org/LinuxKernel/LinuxVFP

FP registers saving and restoring is done lazily in Linux to avoid unnecessary operations when an application does not use the VFP.

The vfp_init function registers vfp_notifier to be called during a context switch via the __switch_to and atomic_notifier_call_chain functions. This function disables the VFP by clearing the FPEXC.EN bit, causing any subsequent VFP instruction (except access to the FPEXC, FPSID, MVFR0 and MVFR1 registers) to trigger an Undefined Instruction exception.

The vfp_support_entry function called via the Undefined Instruction trap checks whether the VFP is disabled and a new thread requires access (the address of the current VFP state is stored in last_VFP_context defined in vfphw.S). If a new thread requires the VFP, the support entry code saves the current VFP registers state (if any) and restores the state for the new thread. By default, only the double-precision registers, FPEXC and FPSCR have to be saved or restored. If the FPEXC.EX bit is set, the FPINST and FPINST2 registers have to be saved or restored as well. Once the VFP is enabled by setting the FPEXC.EN bit, the execution resumes and the trigger VFP instruction is restarted.

If following a context switch the execution would return to the same thread, the VFP support entry code checks for any exceptional VFP state that needs addressing (FPEXC.EX bit set) to avoid an exception re-entry when restarting the trigger instruction.

Comment 11 Vahid Fereydouny 2010-02-22 19:47:15 UTC

Since the workaround of changing the optimization level for the fft.c to none seems to have fixed this problem, this bug is postponed untill after the 7.5 release.
I have done extensive testing on floating pointers with multiple processes and I have noticed that in normal cases all the floating point operations work fine. There is one indication of potential problem and that seems to be related to the operations that cause overflow.

Comment 12 Michael Herger 2012-03-12 23:50:21 UTC

Vahid is no longer working for us.