Hard Fault exception in the FreeRTOS context switch on Atmel SAM L21 CPU

We are running Atmel SAM L21J18B gcc version 5.4.1 20160919 (15:5.4.1+svn241155-1) (arm-none-eabi-g++) FreeRTOSv9.0.0 It is our own code configured as tickless and running on our hardware. Eventually we see the HardFault exception (once per few hours). I was able to capture stacks and the trace from the micro trace buffer. It is always the same failure in the same place at the moment, when the context switch loads new stack. I believe that it is pxCurrentTCB (line 1688 in attached stackdebugfreertos.txt file). The crash is pretty consistent and happens several times a day on few different devices. It always happen, when the context switch is initiated by exiting from an interrupt (we see it just after the USB interrupt but I think it is a coincedence – may be any interrupt). Looking to the trace (see the attached file), I decided that the vListInsertEnd() or prvAddCurrentTaskToDelayedList() didn’t complete their operations, when the USB interrupt was followed by the PendSV interrupt. So the RTOS was trying to switch the context while the stack was still invalid. Both vListInsertEnd() and prvAddCurrentTaskToDelayedList() are called from the vTaskPlaceOnEventListRestricted(), which is called by the vTaskPlaceOnEventListRestricted() (see Source/tasks.c). There is no critical section in any of this function. I added taskENTERCRITICAL()/taskEXITCRITICAL() calls at beginning/end of the vTaskPlaceOnEventListRestricted() function. The failure bahavior changed: it started to fail in the vTaskPlaceOnUnorderedEventList(). So I added critical sections to vTaskPlaceOnUnorderedEventList() and vTaskPlaceOnEventList() also. It appears that other functions in tasks.c that call vListInsertEnd() and prvAddCurrentTaskToDelayedList() do have the critical sections. After these three fixes all my systems are running for last 24 hours without the HardFault exception. Not sure if it is a correct fix. Should I open a ticket (not sure how to do it)? I checked v10.1.1 sorces – this part of the code weren’t changed.

Hard Fault exception in the FreeRTOS context switch on Atmel SAM L21 CPU

Forgot to mention: since the code is configured as tickless, if a task is pending on a queue forever, it will be suspended, not delayed (as far as I understand the tasks.c code). And our code has such tasks.

Hard Fault exception in the FreeRTOS context switch on Atmel SAM L21 CPU

Does this still happen when tickless mode is not used? Are you using the default tickless implementation, or one tailored for your hardware. The usual suspect questions to ask also: Can you update to the latest FreeRTOS kernel code and ensure configASSERT() is defined – the newer the code the more assert points there are to catch interrupt priority misconfiguration – the latest version catches nearly all (if not actually all). Does the USB driver, or any other code, touch the basepri or any other interrupt enable/masking registers? Do you have the priority of all interrupts set at or below configMAXSYSCALLINTERRUPT_PRIORITY if they are using FreeRTOS API functions? Sometimes people think they are, but when they switch to the latest FreeRTOS kernel with the extra asserts they realise they are not. Do you have stack overflow protection turned on? If so, be aware that only checks the tasks tacks, not the interrupt stack. The interrupt stack is the stack used by main() (which is then re-used as the interrupt stack to recover the RAM).

Hard Fault exception in the FreeRTOS context switch on Atmel SAM L21 CPU

Hi Richard, It is not easy to catch the problem (takes few hours/days) and we need tickless so I just didn’t have time to try without tickless. Also we adjusted the tickless to use with the L21 RTC. As I said earlier, the trace is pointing to functions that were not changed in v10.1.1. But It is in the plan to go with the latest FreeRTOS soon. The USB driver is from Atmel SDK. It calls its own critical sections: ~~~

define cpuirqenable()

do {                                       
    g_interrupt_enabled = true;            
    __DMB();                               
    __enable_irq();                        
} while (0)

define cpuirqdisable()

do {                                       
    __disable_irq();                       
    __DMB();                               
    g_interrupt_enabled = false;           
} while (0)
~~~ Looks fine to me… It is Cortex m0+… I was playing with priorities before capturing the trace. Finally I set all priorities (including RTOS context switch to the same level). It never changed the outcome. Yes, I turned the stack overflow protection in order to catch the problem – it didn’t show anything. I moved stacks around in the memory also – it still the same: just one pointer in the stack is corrupted. Can you write me in private? Probably we can arrange a meeting and I can walk you through the trace? Thank you, Alex.

Hard Fault exception in the FreeRTOS context switch on Atmel SAM L21 CPU

And here is our RTOS configuration: ~~~

ifndef FREERTOSCONFIGH

define FREERTOSCONFIGH

/* For documentation for all the configuration symbols, go to: * http://www.freertos.org/a00110.html. */

if defined (GNUC) || defined (ICCARM)

/* Important: put #includes here unless they are also meant for the assembler. */

include <stdint.h>

include <shal.h>

void asserttriggered( const char * file, uint32t line );

endif

define configLIST_VOLATILE volatile

define configUSE_PREEMPTION 1

define configUSETICKLESSIDLE 2

define configUSEIDLEHOOK 0

define configUSETICKHOOK 0

define configPRIO_BITS 2

define configCPUCLOCKHZ ( PLATFORMCPUCLK_HZ )

define configTICKRATEHZ ( ( TickType_t ) 1000 )

define configMAXPRIORITIES ( ( uint32t ) 10 ) // 0 == lowestt

define configMINIMALSTACKSIZE ( ( uint16_t ) 100 )

/* configTOTALHEAPSIZE is not used when heap_3.c is used. */

define configTOTALHEAPSIZE ( ( size_t ) ( 18000 ) ) // was 17000

define configMAXTASKNAME_LEN ( 8 )

define configUSETRACEFACILITY 0

define configUSE16BIT_TICKS 0

define configIDLESHOULDYIELD 1

define configUSE_MUTEXES 1

define configQUEUEREGISTRYSIZE 0

ifdef BOOTLOADER

define configCHECKFORSTACK_OVERFLOW 0

else

define configCHECKFORSTACK_OVERFLOW 1

endif

define configUSERECURSIVEMUTEXES 1

define configUSEMALLOCFAILED_HOOK 0

define configUSECOUNTINGSEMAPHORES 1

define configUSEQUEUESETS 1

define configGENERATERUNTIME_STATS 0

define configENABLEBACKWARDCOMPATIBILITY 1

/* Co-routine definitions. */

define configUSECOROUTINES 0

define configMAXCOROUTINE_PRIORITIES ( 2 )

/* Software timer definitions. */

define configUSE_TIMERS 1

define configTIMERTASKPRIORITY ( 2 )

define configTIMERQUEUELENGTH 2

define configTIMERTASKSTACK_DEPTH ( 80 )

/* Set the following definitions to 1 to include the API function, or zero to exclude the API function. */

define INCLUDE_vTaskPrioritySet 1

define INCLUDE_uxTaskPriorityGet 1

define INCLUDE_vTaskDelete 0

define INCLUDE_vTaskSuspend 1

define INCLUDE_xResumeFromISR 1

define INCLUDE_vTaskDelayUntil 1

define INCLUDE_vTaskDelay 1

define INCLUDE_xTaskGetSchedulerState 1

define INCLUDE_xTaskGetCurrentTaskHandle 1

define INCLUDE_uxTaskGetStackHighWaterMark 1 // GHF modified this for DEBUG June 18, 2015

define INCLUDE_xTaskGetIdleTaskHandle 0

define INCLUDE_xTimerGetTimerDaemonTaskHandle 0

define INCLUDE_pcTaskGetTaskName 1 // Alex: test buffers; GHF testing for tick times….

define INCLUDE_eTaskGetState 1

define INCLUDE_xTimerPendFunctionCall 1

/* Normal assert() semantics without relying on the provision of an assert.h header file. */

ifdef BOOTLOADER

define configASSERT(x)

else

extern void shoofASSERT ( void );

define configASSERT(x)

        if( ( x ) == 0 ) { shoofASSERT(); }

endif

/* The lowest interrupt priority that can be used in a call to a “set priority” function. */

define configLIBRARYLOWESTINTERRUPT_PRIORITY 0x0f

/* The highest interrupt priority that can be used by any interrupt service routine that makes calls to interrupt safe FreeRTOS API functions. DO NOT CALL INTERRUPT SAFE FREERTOS API FUNCTIONS FROM ANY INTERRUPT THAT HAS A HIGHER PRIORITY THAN THIS! (higher priorities are lower numeric values. */

define configLIBRARYMAXSYSCALLINTERRUPTPRIORITY 10

/* Interrupt priorities used by the kernel port layer itself. These are generic to all Cortex-M ports, and do not rely on any particular library functions. */

define configKERNELINTERRUPTPRIORITY ( configLIBRARYLOWESTINTERRUPTPRIORITY << (8 – configPRIOBITS) )

define configMAXSYSCALLINTERRUPTPRIORITY ( configLIBRARYMAXSYSCALLINTERRUPTPRIORITY << (8 – configPRIOBITS) )

if defined (GNUC) || defined (ICCARM)

include <portmacro.h>

void vPortSuppressTicksAndSleep( TickType_t xExpectedIdleTime );

endif

define portSUPPRESSTICKSAND_SLEEP vPortSuppressTicksAndSleep

//XXX: Temporary hook to get the malloc boundaries for different tasks

ifndef BOOTLOADER

define traceMALLOC(pvReturn, xWantedSize)

printf("rn Malloc start: 0x%08x; end: 0x%08x; size: 0x%04x ", (int)pvReturn, (int)pvReturn + (int)xWantedSize, (int)xWantedSize)

endif

endif /* FREERTOSCONFIGH */

~~~

Hard Fault exception in the FreeRTOS context switch on Atmel SAM L21 CPU

It is not easy to catch the problem (takes few hours/days) and we need tickless so I just didn’t have time to try without tickless. Also we adjusted the tickless to use with the L21 RTC.
Intention is to see if the issue is related to tickless or not.
The USB driver is from Atmel SDK. It calls its own critical sections:

define cpuirqenable()

 do {                                       
     g_interrupt_enabled = true;            
     __DMB();                               
     __enable_irq();                        
 } while (0)

define cpuirqdisable()

 do {                                       
     __disable_irq();                       
     __DMB();                               
     g_interrupt_enabled = false;           
 } while (0)
This is only ok because it is globally disabling and re-enabling interrupts (assuming the macros are used from inside something that counts nesting, otherwise interrupts are blindly re-enabled by the macro) and FreeRTOS only uses interrupt masking with the basepri register – so there is no conflict with the driver and the RTOS’s accesses to the hardware.

Hard Fault exception in the FreeRTOS context switch on Atmel SAM L21 CPU

Ah, re-reading your original post, I see you are using an M0, which doesn’t have a base pri register, so the critical sections used by the driver could indeed be a problem. If FreeRTOS is inside a critical section when cpuirqenable() is called then the critical section will be exited and you will have trouble. So: 1) Is it possible that cpuirqenable() is called from inside a critical section? Probably not. 2) Is cpuirqenable() called in a may that sets the interrupt mask back to whatever it was when cpuirqdisable() is called? Or does it just enable interrupt regardless of whether they were originally enabled or not?

Hard Fault exception in the FreeRTOS context switch on Atmel SAM L21 CPU

A couple of comments on the configuration file: 1) Set configCHECKFORSTACK_OVERFLOW to 2, not 1. 2) How often is printf() called (how often is memory allocated?). Calling printf() is one of the most common causes of issues in a multithreaded environment with very small stacks.

Hard Fault exception in the FreeRTOS context switch on Atmel SAM L21 CPU

I see your point. Let me check how cpuirqdisabled/enabled are used. Would love to try your suggestion with tickless disabling and a new RTOS but with testing it may take over a week. Thank you, Alex.

Hard Fault exception in the FreeRTOS context switch on Atmel SAM L21 CPU

Hi Richard, So theoretically using two different critical sections may bring a problem – I defined SDK functions to use the FreeRTOS functions. It didn’t solve the problem. It started to happen even more often (probably coincedense). Do you want to look with me into the trace file? IMO, it is pointing to a hole in the OS. Thank you, Alex.

Hard Fault exception in the FreeRTOS context switch on Atmel SAM L21 CPU

We allocate memory just in init. Then the malloc() never called again. Never call free(). We do use printf for debugging. Messages are short. I can disable all of them but I doubt that it will help – I do not see any call to printf while analyzing the trace buffer.

Hard Fault exception in the FreeRTOS context switch on Atmel SAM L21 CPU

I can disable all of them but I doubt that it will help –
Did it change the behaviour?
I do not see any call to printf while analyzing the trace buffer.
Doesn’t have to be for it to have caused an issue.

Hard Fault exception in the FreeRTOS context switch on Atmel SAM L21 CPU

Didn’t try it yet. On Friday I switch the USB driver into another mode (IP over USB rather than single characters). Still testing. Didn’t see an exception for 37 hours.