Question re: sub-microsecond timing, performance tuning

scotthauck
Posts: 20
Joined: Fri Jul 26, 2019 5:50 pm

Question re: sub-microsecond timing, performance tuning

Postby scotthauck » Tue Jan 14, 2020 1:04 am

I'm working on code to interface to a motor controller with 1us and 200ns setup and hold time requirements. If I do timings via micros(), I'll be wasting a noticeable amount of time (probably 2-3 microseconds due to resolution issues), so decided to roll an alternative using ccount. But, the routine seems to be taking 2-3us in and of itself. So, my questions are (1) is there a better way? (2) Why are these taking so long to run - seems like they should be a few dozen cycles in general for each inlined call, and the ESP32 is set to 240MHz.

The routines:

Code: Select all

// Get an initial time, for use as a baseline
__attribute__((always_inline))
uint32_t getBaseTime()
{
  return xthal_get_ccount();
}

// Busy-wait until at least the given amount of time has elapsed.  Note that if that
// amount of time has already elapsed, returns immediately.
// NOTE: won't work if you have already waited more than about 17 seconds since the baseline, due to counter roll-over.
__attribute__((always_inline))
void waitForElapsedNs(const uint32_t baseline, const uint32_t elapsedTime_ns)
{
  const uint32_t cyclesPerUs = XT_CLOCK_FREQ/(1000*1000);
  const uint32_t elapsedCcounts = (elapsedTime_ns > 1000*1000) ? (elapsedTime_ns / 1000 * cyclesPerUs) : ((elapsedTime_ns * cyclesPerUs) / 1000);

  while ((xthal_get_ccount()-baseline) < elapsedCcounts) {};
}

The test code, in the Arduino setup():

Code: Select all

  uint32_t base = getBaseTime();
  unsigned long startTime = micros();

  waitForElapsedNs(base, 3000);
  unsigned long plus3us = micros();

  waitForElapsedNs(base, 3000*1000);
  unsigned long plus3ms = micros();
  
  waitForElapsedNs(base, 3000);
  unsigned long immediately = micros();

  waitForElapsedNs(base, 1000*1000*1000);
  unsigned long plus1s = micros(); 

  waitForElapsedNs(base, 3000);
  unsigned long another = micros();

  Serial.print("3us later "); Serial.println(plus3us - startTime);
  Serial.print("3ms laster "); Serial.println(plus3ms - startTime);
  Serial.print("Immediately return "); Serial.println(immediately - startTime);
  Serial.print("1 second "); Serial.println(plus1s - startTime);
  Serial.print("Immediately "); Serial.println(another - startTime);

The "immediately" and "another" should be returning VERY fast, with the while loop dropping out almost immediately. But, I tend to see 3-4 microsecond delays. Hand in-lining seems to show that the while loop takes about 1/2 microsecond. Moving the const computations earlier might be helping, but given everything is constants and inlined I'd expect it to be compiled away.

Thoughts?

P.S. The micros() calls are really quick, so don't seem to be the issue.

CollinK
Posts: 18
Joined: Mon Apr 16, 2018 11:38 pm

Re: Question re: sub-microsecond timing, performance tuning

Postby CollinK » Tue Jan 14, 2020 1:22 am

This is just an uneducated guess. But, could the scheduler be taking control and processing another task behind your back? The processor is dual core but still there could be other tasks running on your same core. That and interrupts. If an interrupt comes in it might cause phantom delays that you can't readily explain. You might need to take measures to ensure your code is not interrupted.

ESP_Sprite
Posts: 9020
Joined: Thu Nov 26, 2015 4:08 am

Re: Question re: sub-microsecond timing, performance tuning

Postby ESP_Sprite » Wed Jan 15, 2020 9:20 am

In general, trying to do tight timing loops on the ESP32 is not advised - there's a RTOS running in the background, there's caches being refilled, there's all sorts of reasons for jitter. If you can, use a peripheral that's specifically intended for it - MCPWM, for instance, is very good with motors, and the RMT peripheral is ideal for generic-purpose timing things.

scotthauck
Posts: 20
Joined: Fri Jul 26, 2019 5:50 pm

Re: Question re: sub-microsecond timing, performance tuning

Postby scotthauck » Sat Jan 18, 2020 9:11 pm

Using lots of steppers, so nothing pre-built I could find.

Through disassembly and hunting around found a few of the issues:
  • xthal_get_ccount() is a subroutine call for a single assembly instruction - better to do the assembly equivalent to avoid call overhead.
  • XT_CLOCK_FREQ is also a subroutine call, not a defined constant, so math done with it gets done at execution time, not compile time.
    [*} Still seeing some random losses of 2us, so looks like I'm being pre-empted.
However, since what I need is a way to make sure I don't violate minimum pulse times on signals (which may be as low as 200ns), and would like typical case overhead to be low, the following code seems to be working for me.

Note that since the accesses of CCOUNT take about 50 cycles/200ns, may look at if there are fixed delays available somewhere.

Code: Select all

// Normal Arduino timers and clocks have a resolution of about 2 milliseconds.
// This routine allows for precision at more like 200ns on a 240MHz ESP32
// (or worse resolution at slower clock rates).  It uses the ccount register,
// which is updated once a clock cycle.
// 
// Note: looks like there are lots of random interrupts, so although the
//   accesses can get to perhaps 50ns delays, there are random 1us interrupts happening.
//   So, this will do good average-case performance, but can have bad worst-case on some
//   calls.  The joys of an RTOS.

// XT_CLOCK_FREQ is the clock speed.  A 240MHz clock will be 240*1000*1000
void printFrequency()
{
  Serial.print("XT_CLOCK_FREQ: "); Serial.print(XT_CLOCK_FREQ/1000/1000); Serial.println("MHz");
}

// XT_CLOCK_FREQ is a system call, and slow.  We define it here and check it at runtime,
// so that compiler can optimize away the constants.
#define SYSTEM_CLOCK_FREQ  (240*1000*1000)
void checkFrequency()
{
  if (SYSTEM_CLOCK_FREQ != XT_CLOCK_FREQ) {
    Serial.print("*** ERROR: Wrong clock frequency.  XT_CLOCK_FREQ = ");
    Serial.print(XT_CLOCK_FREQ);
    Serial.print(", code assumes ");
    Serial.println(SYSTEM_CLOCK_FREQ);
    while (1) {};
  }
}

// ccount counts once for each clock cycle, and is a 32-bit unsigned.  So, with a 240MHz clock
// this overflows every 17.9 seconds.  Each increment is about 4.2ns.
// We use the assembly language accessor here, since "xthal_get_ccount()" has subroutine overhead.

// Get an initial time, for use as a baseline
__attribute__((always_inline))
uint32_t getCcount()
{
  uint32_t ccount;
  __asm__ __volatile__("rsr %0,ccount":"=a" (ccount));

  return ccount;
}

// Busy-wait until at least the given amount of clock cycles have elapsed.  Note that if
// that amount of cycles have already elapsed, returns immediately.
// Note: if cpu is set to 240MHz, a cycle is roughly 4.2ns.
__attribute__((always_inline))
uint32_t waitForElapsedCycles(const uint32_t baseline, const uint32_t cycles)
{
  uint32_t ccount;
  do {
    ccount = getCcount();
  } while (ccount-baseline < cycles);
  return ccount;
}

// Compute the number of cycles it takes to last at least a given ns.
__attribute__((always_inline))
uint32_t const NS_TO_CYCLES(const uint32_t ns)
{
  assert(sizeof(double)==8);
  const double cyclesPerNs = SYSTEM_CLOCK_FREQ/(1000.0*1000.0*1000.0);
  const double elapsedCcounts = (double)ns * cyclesPerNs;
  return ceil(elapsedCcounts);
}

// Busy-wait until at least the given amount of time has elapsed.  Note that if that
// amount of time has already elapsed, returns immediately.
// NOTE: won't work if you have already waited more than about 17 seconds since the baseline, due to counter roll-over.
//
// Returns the last value of Ccount sampled.
__attribute__((always_inline))
uint32_t waitForElapsedNs(const uint32_t baseline, const uint32_t elapsedTime_ns)
{
  return waitForElapsedCycles(baseline, NS_TO_CYCLES(elapsedTime_ns));
}

Last edited by scotthauck on Wed Jan 22, 2020 10:59 pm, edited 1 time in total.

ESP_Sprite
Posts: 9020
Joined: Thu Nov 26, 2015 4:08 am

Re: Question re: sub-microsecond timing, performance tuning

Postby ESP_Sprite » Sun Jan 19, 2020 10:05 am

How many steppers are we talking about here?

scotthauck
Posts: 20
Joined: Fri Jul 26, 2019 5:50 pm

Re: Question re: sub-microsecond timing, performance tuning

Postby scotthauck » Wed Jan 22, 2020 10:56 pm

Seven in one case, three in another.

ESP_Sprite
Posts: 9020
Joined: Thu Nov 26, 2015 4:08 am

Re: Question re: sub-microsecond timing, performance tuning

Postby ESP_Sprite » Thu Jan 23, 2020 8:59 pm

Suggest you use the RMT peripheral then; it has eight channels and can do the control fully automatically (i.e. no CPU use after you set the values).

Who is online

Users browsing this forum: No registered users and 72 guests