Race condition between vTaskSwitchContext and xTaskNotifyFromISR?

I’m running FreeRTOS 10.0.1 on a Kinetis MK22FN1M0AVLH12 (Cortex M4F core), specifically the MCUonEclipse port, with MCUXpresso and gcc. I’ve been chasing an intermittent crash for weeks and I’m finally narrowing in on it. I have a temporary fix that seems to be working but I don’t see why it’d be necessary. It’s been hard to track down until now because it’d usually take hours to crash. What’s happening is that vTaskSwitchContext is sometimes being interrupted by a flex timer peripheral IRQ which uses xTaskNotifyFromISR. Normally this works fine, but if the interrupt happens at the wrong place in listGETOWNEROFNEXTENTRY it ends up setting pxCurrentTCB to null and the system crashes – with the stack clobbered, which has been slowing down my debugging. The SysTick and PendSV priorities are set to 15 (minimum), and the maximum API call priority is 5. The timer IRQ is set to priority 7 at the moment. To make the problem more apparent and speed up testing, I changed listGETOWNEROFNEXTENTRY to include a delay (line continuation backslashes omitted for clarity): ~~~

define listGETOWNEROFNEXTENTRY( pxTCB, pxList )

{
List_t * const pxConstList = ( pxList ); /* Increment the index to the next item and return the item, ensuring */ /* we don’t return the marker used at the end of the list. */ ( pxConstList )->pxIndex = ( pxConstList )->pxIndex->pxNext; if( ( void * ) ( pxConstList )->pxIndex == ( void * ) &( ( pxConstList )->xListEnd ) ) { ( pxConstList )->pxIndex = ( pxConstList )->pxIndex->pxNext; } for (int ix = 2000; ix; ix–) {}; ( pxTCB ) = ( pxConstList )->pxIndex->pvOwner; } ~~~ This just stretches out the time between the adjacent lines of code by a very large amount, so instead of taking several hours to crash it does so in seconds. My fix for the moment is to globally disable interrupts in vTaskSwitchContext before taskSELECTHIGHESTPRIORITY_TASK: ~~~ __asm (“CPSID i”); taskSELECTHIGHESTPRIORITYTASK(); traceTASKSWITCHED_IN(); __asm (“CPSIE i”); ~~~ I’m assuming there must be something wrong with my configuration, or maybe an oversight in this particular port, but I don’t know the FreeRTOS internals well enough to know where that might be. I’ve been through every resource I can find on Cortex M3/4 ports. Any ideas? Thanks! Scott

Race condition between vTaskSwitchContext and xTaskNotifyFromISR?

Thanks for providing this level of detail. I can try and replicate your scenario – with the additional delay included – but first a question about vTaskSwitchContext() getting interrupted by that has a priority lower than configMAXSYSCALLINTERRUPT_PRIORITY as that should not happen. vTaskSwitchContext() is called from the PendSV handler, at the time of writing it can be seen on line 455 here: https://sourceforge.net/p/freertos/code/HEAD/tree/trunk/FreeRTOS/Source/portable/GCC/ARMCM4F/port.c On lines 451 and 452 you can see that the basepri register is set to configMAXSYSCALLINTERRUPTPRIORITY, which you said it 5, which should then prevent the timer interrupt from executing if it has a priority of 7 (5 being a higher priority than 7 on Cortex-M). Would you agree with my description here? If so, do you have configASSERT() defined? Are there any other interrupts executing that might be the culprit?

Race condition between vTaskSwitchContext and xTaskNotifyFromISR?

Hi Richard, Thanks for the quick reply. My port.c looks different – I believe this one came from Erich Styger’s PEx components at mcuoneclipse.com originally. I know this port has been modified to support Newlib’s malloc() for heap allocation, not sure what else might be different. Here’s the vPortPendSVHandler: ~~~

if (configCOMPILER==configCOMPILERARMGCC)

if configGDB_HELPER

/* prototypes to avoid compiler warnings */ attribute ((naked)) void vPortPendSVHandler_native(void); attribute ((naked)) void PendSV_Handler_jumper(void); attribute ((naked)) void vPortPendSVHandler_native(void) {

elif !MCUC1CONFIGPEXSDKUSED /* the SDK expects different interrupt handler names */

attribute ((naked)) void PendSV_Handler(void) {

else

attribute ((naked)) void vPortPendSVHandler(void) {

endif

if configCPUFAMILYISARMM4M7(configCPUFAMILY) /* Cortex M4 or M7*/

__asm volatile ( ” mrs r0, psp n” ” ldr r3, pxCurrentTCBConst n” /* Get the location of the current TCB. */ ” ldr r2, [r3] n”

if configCPUFAMILYISARMFPU(configCPU_FAMILY)

" tst r14, #0x10             n" /* Is the task using the FPU context?  If so, push high vfp registers. */
" it eq                      n"
" vstmdbeq r0!, {s16-s31}    n"

" stmdb r0!, {r4-r11, r14}   n" /* save remaining core registers */

else

" stmdb r0!, {r4-r11}        n" /* Save the core registers. */

endif

" str r0, [r2]               n" /* Save the new top of stack into the first member of the TCB. */
" stmdb sp!, {r3, r14}       n"
" mov r0, %0                 n"
" msr basepri, r0            n"
" bl vTaskSwitchContext      n"
" mov r0, #0                 n"
" msr basepri, r0            n"
" ldmia sp!, {r3, r14}       n"
" ldr r1, [r3]               n" /* The first item in pxCurrentTCB is the task top of stack. */
" ldr r0, [r1]               n"

if configCPUFAMILYISARMFPU(configCPU_FAMILY)

" ldmia r0!, {r4-r11, r14}   n" /* Pop the core registers */
" tst r14, #0x10             n" /* Is the task using the FPU context?  If so, pop the high vfp registers too. */
" it eq                      n"
" vldmiaeq r0!, {s16-s31}    n"

else

" ldmia r0!, {r4-r11}        n" /* Pop the core registers. */

endif

" msr psp, r0                n"
" bx r14                     n"
"                            n"
" .align 2                   n"
"pxCurrentTCBConst: .word pxCurrentTCB  n"
  ::"i"(configMAX_SYSCALL_INTERRUPT_PRIORITY)
);

else /* Cortex M0+ */

__asm volatile ( ” mrs r0, psp n” ” n” ” ldr r3, pxCurrentTCBConst n” /* Get the location of the current TCB. / ” ldr r2, [r3] n” ” n” ” sub r0, r0, #32 n” / Make space for the remaining low registers. / ” str r0, [r2] n” / Save the new top of stack. / ” stmia r0!, {r4-r7} n” / Store the low registers that are not saved automatically. / ” mov r4, r8 n” / Store the high registers. / ” mov r5, r9 n” ” mov r6, r10 n” ” mov r7, r11 n” ” stmia r0!, {r4-r7} n” ” n” ” push {r3, r14} n” ” cpsid i n” ” bl vTaskSwitchContext n” ” cpsie i n” ” pop {r2, r3} n” / lr goes in r3. r2 now holds tcb pointer. / ” n” ” ldr r1, [r2] n” ” ldr r0, [r1] n” / The first item in pxCurrentTCB is the task top of stack. / ” add r0, r0, #16 n” / Move to the high registers. / ” ldmia r0!, {r4-r7} n” / Pop the high registers. / ” mov r8, r4 n” ” mov r9, r5 n” ” mov r10, r6 n” ” mov r11, r7 n” ” n” ” msr psp, r0 n” / Remember the new top of stack for the task. / ” n” ” sub r0, r0, #32 n” / Go back for the low registers that are not automatically restored. / ” ldmia r0!, {r4-r7} n” / Pop low registers. */ ” n” ” bx r3 n” ” n” “.align 2 n” “pxCurrentTCBConst: .word pxCurrentTCB” );

endif

} ~~~ My ARM assembly language is really weak – what’s the %0 notation? Looks like it assembled as #0x50, which is correct. I’m assuming that must be an assembly-time substitution but I don’t know what it references. When I catch the race condition, BASEPRI is 0. I’ve tried to step through the PendSV handler but the debugger seems determined to make sure I don’t get to see it happening – every time, it hangs before it gets to the msr BASEPRI, r0 instruction. The gdb server sits there with high CPU utilization, whether I’m using an LPC-Link2 or a P&E Cyclone ACP. I’ve attached a screenshot so you can see where it was at. r0 is set to 0x50, so that’s right. There are other interrupts enabled, but I’ve set all of the NVIC priority registers to values in the range of 0x50 to 0x80 whether the interrupt is implemented or not, just in case. I once got the debugger to stop in vPortSwitchContext and saw BASEPRI set, but when it stepped over the stack check it came up 0 again. It also stopped reading registers entirely right after that so it’s hard to say what was actually going on. This is where I’d normally insert a rant about the unreliability of NXP’s debugging tools, but I’ve got at least a partial workaround for the bug that’s been plauging this project for weeks, we just started shipping another new product today, and I had a really great first date last night, so I’m going to quit while I’m still in a good mood. I’ll see what I can find out from Erich about the origin of this port.c, and fuss with the debugger some more tomorrow. Thanks again, Scott

Race condition between vTaskSwitchContext and xTaskNotifyFromISR?

Here’s the vPortPendSVHandler:
Although different, the assembly instructions that set the basepri register are still in place.
My ARM assembly language is really weak – what’s the %0 notation? Looks like it assembled as #0x50, which is correct. I’m assuming that must be an assembly-time substitution but I don’t know what it references.
It is referencing the following constant:
   ::"i"(configMAX_SYSCALL_INTERRUPT_PRIORITY)
When I catch the race condition, BASEPRI is 0. I’ve tried to step through the PendSV handler but the debugger seems determined to make sure I don’t get to see it happening – every time, it hangs before it gets to the msr BASEPRI, r0 instruction. The gdb server sits there with high CPU utilization, whether I’m using an LPC-Link2 or a P&E Cyclone ACP.
Do you set the debugger to step assembly instructions rather than C instructions before stepping through? As this is inline assembly you would need to do that, then open the assembly window in the debugger and step through the code there. That function should never be executing with basepri set to 0 – so that is the curiosity here. Is the function called from anywhere else other than the function that starts the scheduler in the first place? Are you 100% sure there are no other interrupts running? Did you ensure configASSERT() was defined – that should catch misconfigurations – assuming that part of the code has not been changed. You could try running the official port to see if that has the same issues.
I’ve attached a screenshot so you can see where it was at. r0 is set to 0x50, so that’s right. There are other interrupts enabled, but I’ve set all of the NVIC priority registers to values in the range of 0x50 to 0x80 whether the interrupt is implemented or not, just in case. I once got the debugger to stop in vPortSwitchContext and saw BASEPRI set, but when it stepped over the stack check it came up 0 again. It also stopped reading registers entirely right after that so it’s hard to say what was actually going on.
Maybe you are not seeing what you think you are seeing here. Are you 100% sure the issue is occurring inside the tick interrupt? Could the issue actually be being caused elsewhere in the system, and only become evident in the tick interrupt?

Race condition between vTaskSwitchContext and xTaskNotifyFromISR?

Richard, Scott, I’m going to check this one one my side too. That tick handler is a combination of multiple ports to support many architectures. I’ll check if I have missed anything. Erich

Race condition between vTaskSwitchContext and xTaskNotifyFromISR?

Yes, the debugger was set to instruction stepping mode. And no, nothing else calls vTaskSwitchContext. Here’s the sequence of functions from the instruction trace I captured during one crash:
vQueueWaitForMessageRestricted prvProcessTimerOrBlockTask xTaskResumeAll vPortEnterCritical xTaskResumeAll vPortExitCritical xTaskResumeAll prvProcessTimerOrBlockTask vPortYieldFromISR vPortPendSVHandler vTaskSwitchContext xTaskGetTickCountFromISR vPortValidateInterruptPriority xTaskGetTickCountFromISR vTaskSwitchContext ftm0_irq vTaskSwitchContext (crashes because pxCurrentTCB is now null)
configASSERT is defined. I modified it yesterday to add a bkpt instruction to the infinite loop so it’d stop the debugger immediately, but it definitely works, and catches when I screw up interrupt priorities.
Are you 100% sure there are no other interrupts running?
There are 18 peripheral interrupts enabled, in addition to the core interrupts like SysTick and the fault handlers. All of the unused vectors point to a halt function, and all of them, used or not, have priority levels set to be safe with FreeRTOS. There may be other interrupts pending (if I pause the debugger there I get an EWM watchdog interrupt at least) but nothing that should be able to interrupt the kernel. It was definitely the FTM IRQ that caused it in the trace above, and every trace I’ve caught, and the ISR also turns on an LED that gets turned off in a few ticks whenever it generates a task notification. The LED was stuck on every time I checked during a lockup so the FTM ISR had definitely done a task notification very recently every time it locked up.
Are you 100% sure the issue is occurring inside the tick interrupt?
Which tick interrupt are you talking about? The FTM IRQ, or the SysTick that’s presumably kicking off the context switch that’s getting interrupted? I am certain that it’s the FTM IRQ (but not necessarily exclusively that one) that’s interrupting the context switch. There are UART DMA and ADC DMA interrupts also firing more or less frequently depending on what’s going on (for WiFi data and audio samples) and each of those can wake up their respective tasks (FTM sends notifications to the main task) and I haven’t seen them cause this problem, but that could be because of their relative frequencies. Also, the FTM ISR is processing data from a GPS receiver that’s aligned to the start of the GPS second. The SysTick rate is an even 300 Hz and the clock oscillator is going to be off by some ppm relative to GPS, so presumably the start of the NMEA data (and the resulting task notifications) will be slowly slipping relative to the SysTick tick. If there’s an opening for a race condition, a slow slip like that is the way to find it. The WiFi DMA will be very bursty and dependent on traffic (I’ve had four browser clients running to crank up the traffic level) but much less consistent with timing, and the ADC DMA rate divides evenly into the SysTick rate, with both driven by the same clock, so it’s not likely to trigger a race condition. I’m going to work on the debugger today and see if I can get it to step through the PendSV code. Thanks, Scott Edit: Now that I think about it, the GPS time slip shouldn’t be the issue. The FTM overflow still fires at a consistent rate, it’s just the task notification that changes, and for testing I ignored the data entirely and had the notification happen every overflow and it still caused the lockup.

Race condition between vTaskSwitchContext and xTaskNotifyFromISR?

Ok, digging through the instruction trace and I think I’ve got something. You guys are going to have to help me understand why this is. I’ve got configGENERATERUNTIME_STATS enabled, so vTaskSwitchContext() calls xTaskGetTickCountFromISR(), which looks like this: ~~~ uxSavedInterruptStatus = portTICKTYPESETINTERRUPTMASKFROMISR(); { xReturn = xTickCount; } portTICKTYPECLEARINTERRUPTMASKFROMISR( uxSavedInterruptStatus ); ~~~ This is where it’s going off the rails – BASEPRI is being set to 0 by that last line. Here are the macros: ~~~

define portTICKTYPESETINTERRUPTMASKFROMISR() portSETINTERRUPTMASKFROMISR()

define portTICKTYPECLEARINTERRUPTMASKFROMISR( x ) portCLEARINTERRUPTMASKFROMISR( ( x ) )

define portSETINTERRUPTMASKFROMISR() 0;portSETINTERRUPTMASK()

define portCLEARINTERRUPTMASKFROMISR(x) portCLEARINTERRUPTMASK();(void)x

~~~ So for some reason portSETINTERRUPTMASKFROMISR is defined to always return 0, and portCLEARINTERRUPTMASKFROMISR just ignores the mask it’s given. This was clearly someone’s conscious decision. I can’t tell what it was trying to accomplish, or if it was just something that was set for testing and overlooked. Richard, looks like the problem is not on your end. Erich, do you know where this came from? Since I’m not using Processor Expert anymore I suppose I should probably move to the official current port. My big hangup there is that I’m going to have to take another look at my heap usage. I make sparing use of the heap in application code and I don’t like having two different heap mechanisms, so I’ve been using the Newlib malloc option that I don’t think is part of the standard release. I’m going to have to go refresh my memory on heap options. Thanks, Scott

Race condition between vTaskSwitchContext and xTaskNotifyFromISR?

This is actually the standard behaviour – so in the mainline code, not just Erich’s, although I’m not sure what the portTICKTYPECLEARINTERRUPTMASKFROMISR() macro is and its implementation does not appear to do anything other than call portCLEARINTERRUPTMASKFROMISR() (which is our macro). The critical sections used to st the interrupt mask back to the previous mask value, but I was informed by architecture experts that that was not necessary, and that internally the silicon would latch to the value of the current interrupt anyway, so just setting to 0 was the most efficient thing to do on exiting the critical section. Have you tried updating the portSET/CLEARINTERRUPTMASKFROMISR() macros to actually return the current basepri value rather than just hard code to return 0?

Race condition between vTaskSwitchContext and xTaskNotifyFromISR?

portTICKTYPECLEARINTERRUPTMASKFROMISR() seems to be part of the mainline code too. I just grabbed 10.1.1 yesterday. Looks like the portTICK_TYPE wrappers are there to handle atomic or non-atomic tick reads depending on the port. If it’s atomic, the macros do nothing.
and that internally the silicon would latch to the value of the current interrupt anyway
Wouldn’t that still be a problem here? If I understand it correctly, vTaskSwitchContext() is running from the PendSV interrupt, which is set to the lowest priority – in this case 15. If the interrupt priority is what’s latched, then when the tick read critical section within vTaskSwitchContext() completes it’s effectively setting BASEPRI to 15 when it entered at 5, and now the FTM IRQ at priority 7 is able to interrupt what should be a critical section in taskSELECTHIGHESTPRIORITY_TASK(). Next question: Why would reads of tick type variables not be atomic on a Cortex M4? I think maybe that’s where our problems start – here’s the mainline portmacro.h: ~~~ /* 32-bit tick type on a 32-bit architecture, so reads of the tick count do not need to be guarded with a critical section. */ #define portTICK_TYPE_IS_ATOMIC 1 ~~~ That is not defined in Erich’s portmacro.h. That might explain why this hasn’t been a problem for more people – to trigger this bug you have to have portTICKTYPEIS_ATOMIC=0 and have runtime statistics turned on. Setting portTICKTYPEIS_ATOMIC looks like it will let me remove my global interrupt mask, but not knowing the kernel better I’d worry that there might be other places where BASEPRI could get set wrong and it doesn’t feel like a solid fix to me.
Have you tried updating the portSET/CLEARINTERRUPTMASKFROMISR() macros to actually return the current basepri value rather than just hard code to return 0?
I’m not conversant enough in ARM/Thumb assembly to even read more than 70% of what’s there without a reference manual in front of me and I haven’t used gcc’s extended asm syntax before. I’m sure I could get it but I’d feel better if someone who knows what they’re doing showed me the right way, so I don’t wind up chasing down side effects. And if anyone knows a good, complete, concise reference/cheat sheet for the instruction set and assembly syntax, please let me know! Thanks, Scott

Race condition between vTaskSwitchContext and xTaskNotifyFromISR?

So the scenario is like this: 1) PendSV executes. 2) Hardware sets priority in the NVIC to 15 (low priority used by interrupt handler). 3) PendSV software sets basepri to 5 before calling vTaskSwitchContext(). 4) vTaskSwitchContext() calls the stats function, which calls portSETINTERRUPTMASKFROMISR(), which also sets BasePri to 5 (which it already was anyway). 5) When portCLEARINTERRUPTMASKFROMISR() is called software sets BasePri to 0 (unmasks everything) BUT actually basepri used within the hardware itself is the priority of the interrupt – which is 15 – and the critical section has been broken. Makes sense.

Race condition between vTaskSwitchContext and xTaskNotifyFromISR?

Yes, that’s my interpretation. The lack of a portTICKTYPEIS_ATOMIC setting seems to be an oversight in Erich’s port, but it ought to be a harmless one that just costs a few wasted cycles per context switch. I have a few different ways I can avoid this for now. Setting it up so that it does retain the BASEPRI setting seems to me like the way to make it behave in the least-surprising fashion. If that’s not where the fix is made, then the big question is whether there’s anywhere else that this is going to pop up. The most surprising thing about this is that it apparently wasn’t a bug in my application code. Usually the rule is that the longer you spend hunting a bug, the more likely it is to be something absolutely trivial that someone else is going to spot the first time they take a look over your shoulder. Now if SiLabs could just make their WiFi modules not randomly reboot, I could have a pretty solid product…

Race condition between vTaskSwitchContext and xTaskNotifyFromISR?

Richard, do you need me to submit a bug report or anything? I have what I need for now, I just want to make sure this doesn’t bite anyone else.

Race condition between vTaskSwitchContext and xTaskNotifyFromISR?

I have it noted on the backlog, but probably best if you open a bug ticket in SourceForge and point it to this thread so it doesn’t get lost and others have visibility too – thanks.

Race condition between vTaskSwitchContext and xTaskNotifyFromISR?

Will do!