I did some testing of the hardware SPI support yesterday, to try to get async send working.
I thought that async sends were not implemented since all posts were discussions about how hard it was to use DMA etc etc. Turns out I've misunderstood. 8bit-per-byte is fully supported using the callback version of SPI.transfer! When saying that 16bit (and more) transfers were not supported, they mean that 16bits per byte (or sending strings of 16bit ints) is not supported.
I only discovered this after trying to understand what was done in the example from Garug in this post: https://forum.pjrc.com/threads/67247-Teensy-4-0-DMA-SPI/page2
I deleted everything special and it still worked. Here are some outputs:
My main loop output without any SPI sends running, to see what speed I can get without anything else interfering. |
Normal "one byte per transfer", blocking SPI makes the main loop output grind to a halt |
SPI transfer using callbacks. Main loop speed is almost the same as with no load :-D (Spi transfers 0x00 and 0xFF which is shown as the slow output on line 2 |
Changed to transferring 8bytes at a time, to measure the delay between transfers. |
8 byte transfers but with changing speeds between 2 and 8MHz. There is not much overhead doing the switch. |
Retriggering send from callback
This seems to work fine:
Moving the start/end transaction so that it only happens when we need a speed change saves us a little time between blocks:
We still get a 1.5uS "penalty" here, which amount to 96uS per 64 blocks sent. In this time period, it looks like the SPI actually blocks the main thread too. That's not good. I have to look for ways to increase the speed here, but we're using attachImmediate which supposedly is the fastest callback.
Things to look at:
- There is a lot to save by writing a custom version of
bool SPIClass::transfer(const void *buf, void *retbuf, size_t count, EventResponderRef event_responder) {
(found on line 1199 in SPI.cpp), one that does not use an EventResponder callback and that skips a lot of the checking the library version does.
The best thing would be to make a version that does not require an interrupt for every three bytes sent, which controls the chip select/sync pin by itself.
Update:
I've managed to do async SPI with 24bit transfers and automatic CS switching. This probably means I can update all 4 channels of the DAC in one go, I just need to make it async. Since I was only sending 3 bytes, doing DMA transfers were costly and unnecessary.
I've managed to do async SPI with 24bit transfers and automatic CS switching. This probably means I can update all 4 channels of the DAC in one go, I just need to make it async. Since I was only sending 3 bytes, doing DMA transfers were costly and unnecessary.
As for Teensy 4.0 vs 4.1: The 4.1 has three CS pins connected to the default SPI port, meaning I can switch between them. 4.0 on the other hand has only one. BUT - If I can manage to turn OFF CS handling after writing to DAC, I can do DCO CS manually. I will only write one 24bit block to DCOs so I would have had to do config changes between sends anyway.
Also, I discovered that the Teensy 4.0 actually exposes all four SPI1 pins after all (MISO1 and CS1 as pin 0 and 1) so I can use any of the two ports for either slave or master functionality.
Update:
The image above shows that background SPI transfers are working as they should, with speed and CS switching. Main loop speed is 20MHz approx.
Updating 4 DAC channels, including time spent in ISR, takes 3uS. Updating two DCOs take 13,7. In total, 61.7uS for 64 channels + 2 DCOs.
The max update speed is thus 1 000 000 / 61.7 = 16.2kHz when the DAC bus runs at 50MHz.
Reducing speed to 25MHz makes the total time 93,7uS and the throughput drops to 10.6kHz which is a bit too slow.
Moving the DCO updates to a separate bus would mean DAC transfers at 25MHz would take 80uS giving us a throughput of 12.5kHz. That would be sufficient.
If we choose to upgrade to an 8 channel DAC running at 50MHz we would get a total of 53,7uS for a speed of 18.6kHz.
No comments:
Post a Comment