A to Synth: A closer look at the super saw code

In this post I try to understand most of the super saw code found by the Usual Suspects when reverse engineering the Roland/Toshiba TC170C140 ESP chip. Since I have no real DSP programming experience, it took some time to realize what is actually going on.

Presumably, this is the code for the ESP emulator itself: https://github.com/dsp56300/gearmulator/tree/main/source/ronaldo/esp

Coefficients - floats

The coefficients listed in the usual suspects' presentation, when represented as floats, are:

[0, 0.01953125, -0.01953125, 0.06225585, -0.0628662, 0.107421875, -0.10986328125]

if we multiply by 8192 we get

[0, 160, -160, 509.9999, -514.99999, 880, -900]

and if we presume that the strings of nines are the result of a division somewhere, we get

[0, 160, -160, 510, -515, 880, -900]

Pretty neat!

8192 is 2 to the power of 13. The real values may be that, or perhaps either 14 or 16 bits, so either

[0, 320, -320, 1020, -1030, 1760, -1800]

[0, 1280, -1280, 4080, -4120, 7040, -7200]

Coefficients - integers

The code example lists these integer coefficients

[0, 318, -318, 1020, -1029, 1760, -1800]

Now that's really cool. Those are almost a prefect match for the 14 bit representation of the coefficients.

There are some strange mismatches though - 318 and -1029 instead of -320 and 1030.

I'm not sure WHY this is yet.

Chat GPT suggest:

The small mismatches (318 vs 320 etc) come from:

rounding
deliberate asymmetry to reduce beating regularity
truncation after multiplication
and accumulator overflow behavior

But the mismatches are there in the originals, not the calculated values, and if they had the originals, they would have used a shared factor of 16384 when dividing down to the floats (weirdly, even 1020 / 16384 is listed as 0.06225585, when the real value - without any rounding error - is 0.06225586).

Pitch value

The pitch value is how much we need to increase the saw wave for every sample. We now know that the JP8000 uses a sample rate of 88.2kHz, and 24bit accumulators that have a range of 16,777,216 (bipolar)

The exact pitch range of the JP8000 is not known, but let's for a start consider the midi standard.

Midi note 0 (C-1) has a frequency of approximately 8.18 Hz, while the highest note, MIDI 127 (G9), is around 12,544 Hz

Running at 88200Hz, every cycle of

8.18Hz is 10782.396 samples long

12544Hz is 7.03125 samples long.

With a 24bit accumulator, each increment will be

for 8.18Hz: 1555.98, e.g. 1556

for 12554Hz: 2386092,942, e.g. 2386093.

From this, we can assume that, approximated

Pitch range is1556 to 2,386,093

Knowing this, the only unknown in the detuning equation, is the int24_t detune parameter.

Detuning

Given the center oscillator frequency F0,

When using floats, the outer oscillators should have a frequency Fn:

Fn = F0 * (1 + floatCoefficient[n])

When using the integer coefficients, we instead get

Fn = F0 * (1 + integerCoefficient[n] / 2^14)

Fn = F0 * (1 + integerCoefficient[n] >> 14)

Now, doing the bitshift on the coefficient alone would lead to a massive loss of precision (or rather, all the coefficients would become 0), so at very least, we have to do the bitshift after multiplying with F0:

Fn = F0 + (F0 * integerCoefficient[n]) >> 14

Ah, this is starting to look like the code, exciting!

Detune amount

There is a third number in the detune calculation. The code calls it "detune" but in reality it's detune amount. It says how much of the detune coefficient to apply.

According to Szabo, pitch * coefficient is the _maximum_ amount to apply. That means the detune amount should be a ratio between 0 and 1. This, of course, is not possible using an integer, without a division following the multiplication.

Let's take a look at the original equation:

int24_t voice_detune = (detune_table[i] * (pitch * detune)) >> 7

We know that detune somehow should give us a scaling between 0 and 1, so let's ignore how we get there for a second and remove it. That leaves us with detune_table[i] * pitch.

We also know that we should divide this result with 2^14, though that is not included in the equation. Let's include it still, it has to be there somehow.

A quick check of the multiplication and following bitshift in the detune frequency calculation:

The lowest possible value is (1550 * 318) >> 14, which is 30. (lowest frequency, lowest detune without detune amount scaling).

The highest possible value is (2386093 * 1800) >> 14, which is 262144. (highest frequency, highest detune amount).

The results are within a 24bit int range. However, the intermediate value from the multiplication is 4,294,967,400, which is much higher than what can be stored in a 24 bit int. Under normal conditions, this would make everything overflow. To be able to properly store the multiplication result, we need a division, and one that happens before the result is stored. Something strange is clearly going on.

Side note: The max result is even slighty higher than what can be stored in an uint32 (4,294,967,295). At the same time, its so amazingly close that it's hard to believe that its just random? And actually, if we go back to the highest frequency, it's actually 2386092,942. With that fractional result, the product is 4,294,967,295.6 - a mere 0.6 above. This is too weird to be a coincidence? Also, it turns out that the max frequency isn't 12554, it's slightly lower. That would keep the detune within a uint32 range. Interesting.

Now, lets go back to the equation and reintroduce detune:

int24_t voice_detune = (detune_table[i] * (pitch * detune)) >> 7

pitch * detune, let's call it pitchWithDetuneAmount will be calculated first. Max pitch is 2386093, so detune may be up to 3 without overflowing. That doesn't make much sense.

Next, detune_table[i] * pitchWithDetuneAmount is calculated. detune_table[i] is at most 1800, so it will overflow if pitchWithDetuneAmount is larger than 4660.

Finally, everything is divided by 128

And all of this, with detune at max, should not be larger than 0.10986328125 * pitch.

In a normal situation, we should divide detune table by 2^14, and detune by a maxDetune to make it into a ratio between 0 and 1. It is highly probable that maxDetune would be a factor of 2 as well, to make division a bitshift here too.

Side note: maxDetune needs to be high enough to represent the curve shown by Szabo, at least if there are 128 different values not linearly spaced apart.

But the mystery remains. Where have the remaining divisions gone? We see some division (>>7 is the same as / 128) but that's not enough and it's not in the right place to prevent overflow.

DSP magic

I admit it, I had to ask ChatGPT about this one. At first it was reluctant to admit that there are something that makes division/bitshifts superfluous, but then everything dropped into place!

Enter multiply-high

DSPs have two major tasks - summing and scaling. Summing is +, and scaling is multiplication followed by a division.

In fact, scaling is so important that most DSPs do multiplications in a slightly different way. They multiply the two numbers into an accumulator with twice the bit count of the factors, but then it only returns _the high order_ bits. E.g. if it multiplies, say, two unsigned uint16_t variables a and b, it would store the intermediate result in a u32_t, but then only return the 16 MSB. This is equal to bit shifting >> 16 or dividing by 65536. If we let a be our signal value, and b a scaling factor, this turns b into a scaling between 0 and 1!

So there you go, we get free bit shifts, invisible in the code.

In other words, the code the Usual Suspects is showing is is not normal C code, it's DSP code (of course...) meaning the * does not do what it normally does, it also bit shifts.

Implicit bit shifts

Lets go back to the code again and see how this works out

int24_t voice_detune = (detune_table[i] * (pitch * detune)) >> 7

First we do

pitch * detune

which we called pitchWithDetuneAmount earlier.

By making detune max equal to the bitshift included in *, it becomes a ratio between 0 and 1. Hooray!

Then we do

detune_table[i] * pitchWithDetuneAmount

Again, * will introduce bitshifting.

If we go back to the paragraph about multiply-high, an int24_t * int24_t would result in a 48bit intermediate, of which the upper 24 are returned (it may be slight differences working with signed ints but the principle is the same).

Shifting >> 24 is fine for detune amount. It would mean we could use the full 24 bits as a factor, getting any detune amount curve we could possibly want.

But shifting detune_table * detuneAmount by 24 is too much. It should only be 14, and even then 7 of the shifts are done outside the parenthesis.

Now, it's not important exactly how the bit shifts are done to understand the code. We can just accept that the code

int24_t voice_detune = (detune_table[i] * (pitch * detune)) >> 7

is equal to

pitch * (detune / maxDetune)* detune_table[i] / 2^14

without overflows etc. It is only important if we want to run the exact code and use the same range for detune amount.

Different bit shift?

There is however a posibility that the DSP doesn't actually shift by 24. Maybe it shifts by 7? That would make

(detune_table[i] * pitchWithDetuneAmount ) >> 7

the same as

detune_table[i] * pitchWithDetuneAmount >> 14

in a normal system.

It would mean that detune amount has to be 7 bit if it uses the same multiplication-high, leaving 128 steps of detune amount. That does not work well with the Szabo curve, but it is possible. Or maybe the C code is just inaccurately translated from assembly and that they use different multiplication operations.

The DSP (ESP) emulator code for the JE-8086 is on github, so we can peak at what it actually does. I have not studied it in detail, but there are traces of a configurable bit shift multiplication.

Studying https://github.com/dsp56300/gearmulator/blob/main/source/ronaldo/esp/esp.hpp may shine some light on this.

multResult in esp.hpp shifts the result after multiplication, 5, 6 or 7 places. It uses two bits of the instruction to select the shift, and 0,0 will default to 7 bit shifts. That sounds exactly like what we are looking for.

So, while not proving that 7 is the correct answer here, at least it makes it plausible that the ESP does infact use a different shift than 24.

UPDATE: The multiplication (kMAC in the code) does two bitshifts. First, it shifts the second term by >> 16, meaning it only uses the 8MSB. THEN it does an up-to 7 bit right shift). In other words, detune and spread are not 0-128, they are the full 24bit range but only the 8MSB are used. There is also a double precision multiplication available so it is possible that is used for higher resolution

Something that may support this theory is that on a slide about the ESP, it says that it has a 24 x 8 bit multiplier. This seems to indicate that it does NOT have a 24 x 24 bit multiplier, and that, combined with signed arithmetic where one bit of the 8 bit variable is used as sign, would make a shift of >> 7 quite plausible and 7 bit (positive value) detune the way to go.

It does however leave a question as to how the multiplication of detune_table and pitch works since both are definitely > 128. Perhaps it does multiple 24 x 8 multiplications?

--> It looks like it is possible to do that. The final bitshift will always be >> 7 which makes the last >> 7 explainable. This would make the detune_table[i] * pitch >> 14 multiplication possible. Detune and spread would still have to be max 127.

Example: 24 × 24 multiply-high using three 24×8 blocks

According to ChatGPT. Test this!

Let the 24-bit multiplier be split into bytes:

B = b2·2^16 + b1·2^8 + b0

You compute:

P0 = (A·b0) >> 7
P1 = (A·b1) >> 7
P2 = (A·b2) >> 7

Then re-align:

Result =
  P0
+ P1 << 8
+ P2 << 16

Substituting:

= A·(b0 + b1·2^8 + b2·2^16) >> 7
= (A·B) >> 7 (approximately)

Update: Double precision multiplication

The ESP supports "double precision" multiplication. I have not yet fully understood the result but essentially it does exactly what is suggested above - it first multiplies A with the 8MSB of B. It then multiplies A >> 7 with (B >> 9) & 0x7F, or bits 15 to 8 (0 indexed) of B

first time: acc += ((mulInputA_24 * (mulInputB_24 >> 16)) >> shift)

second time: acc += (((mulInputA_24 >> 7) * ((mulInputB_24 >> 9) & 0x7f)) >> shift)

... I am missing something here, TBC.

Spread

There is a third multiplication in the code, saw[i] * spread. Just as with detune, this looked very strange and would lead to an overflow in a normal system. But with multiply-high this too becomes a scaling from 0 to 1, just as we needed. It could, as detune, be a value between 0 and 127 and work fine with >> 7. Again, it's not important to the understanding of the code, we can just accept that it's a scaling factor.

Conclusion

There are some questions left unanswered. Shifting by 7 on * means that detune_table[i] * pitch may still overflow (that 32bit thing above, remember), and the curves of detune and spread can't be explained properly. And finally, as mentioned in the previous post, summing of the seven saws will overflow.

In general, however, the code looks like it could do exactly what we think. If we were to reimplement it we would just take care of these issues - increasing sum to 32bit and doing the appropriate bits shifts manually, and selecting whatever resolution for detune and spread that we want.

The only thing I cannot explain at the moment is how Szabo could see a attenuation of the center oscillator when doing mix (spread), as that is not part of the code. Perhaps it is some kind of normalization effect, that the center oscillator contributes less to the total. Guess that one just has to remain a mystery for the time being.

A to Synth

Tuesday, January 13, 2026

A closer look at the super saw code

Coefficients - floats

Coefficients - integers

Pitch value

Detuning

DSP magic

Implicit bit shifts

Different bit shift?

Example: 24 × 24 multiply-high using three 24×8 blocks

Update: Double precision multiplication

Spread

Conclusion

No comments:

Post a Comment