Monday, January 19, 2026

Roland/Toshiba TC170C140 ESP multiplication

The Roland/Toshiba TC170C140 ESP is a 24bit DSP, presumably fixed point. 24 bit fixed point DSPs store numbers as Q1.23, meaning one sign bit and 23 "value bits". The numbers, while strictly large integers, are treated as numbers from -1.0 to +0.9999999.

The ESP has an internal 24 x 8 multiplier, and does multiply-high, meaning it keeps a full multiplication result and then returns the upper 24 bits of the result (excluding any extended sign bits) by doing a bit shift right after the multiplication.

It is common for DSPs to do multiply high, as it makes the second factor behave like it's in the range -1 to 1 (which, I guess, is fixed point). It makes the multiplier a scaler, scaling the multiplicand between plus/minus the original value. 

Multiply high in a 24bit fixed point DSP would multiply two 24bit numbers, then shift >> 23 and store the result as a 24bit integer (keeping the sign bit and 23 MSB of the multiplication).

The fact that the ESP multiplier is 24 x 8 does not mean that the second factor of the multiplication, as seen from the system side, is 8bit. It is actually 24 bits, but only the upper 8 bits (sign + 7 bits) are used during multiplication. The rest of the bits are discarded beforehand using >> 16, to make it fit within an 8 bit variable. 

After the multiplication, the result is shifted a further 7 bits for a total of 23, making it a normal 24bit multiply high. However, the precision of the multiplier is only 7 bits + sign, e.g. only +127/-128 steps are available. This, of course, will often not be good enough.

 

Multiplication "by hand"

Let's take a step back and consider how multiplication is done by hand, to get a more intuitive feeling of what precision loss means:

67 * 52

First multiply 6 by 52 and align with the 10-place

6 * 52 = 312

Then multiply by the 7 and align with the 1-place, below 

7 * 52 = 364

Finally, take the sum:

  312_

+ _364  

= 3484

Now let's pretend we're doing a multiply-high in a decimal system. We only have two places available to store the result, so we're dividing the multiplication with 100 afterwards. The result would be 34.

Then consider a multiplication where the precision of the first factor is a single place only - the resolution is 10s, not 1s. 

In this case, the result of the multiplication would be 3120, and the final result 31. We've lost precision.

This is similar to what happens with the 24 x 8 >> 7 multiplication in the ESP.

But there is something interesting here. The full multiplication is done in two steps, one for the 1's and one for the 10's of the first factor, and the ESP is only doing one of them in its normal multiply and accumulate (MAC).


Double precision multiplication

Fortunately, the ESP provides a second multiplication instruction, called  DMAC (Double precision multiply and accumulate). This lets us calculate another part of the multiplication using the bits below the 8 MSB.

To see exactly how this works, let's go back to multiply-high:

With A and B being 24 bit, we want to do A * B >> 23.

But as B cannot be more than sign + 7 bits, we need to divide up the multiplication, just like how we did separate multiplications for 10s and 1s in the decimal example above.

Here are the bits of B grouped in chunks of 7 (X, Y, Z, R) preceded by a sign bit S:

SXXXXXXX YYYYYYYZ ZZZZZZRR 

Now we can write the expression as:

Product P = A * B >> 23 

P = (A * (B[22:16] << 16 + B[15:9] << 9 + B[8:2] << 2 + B[1:0])) >> 23 

 

Substituting the values for the B-bits:

B[22:16] = B >> 16 = b1

B[15:9]  = (B >> 9) & 0x7F = b2

B[8:2]   = (B >> 2) & 0x7F = b3

B[1:0]   = (B & 0x03) = b4

 

P = (A * ((b1 << 16) + (b2 << 9) + (b3 << 2) + b4)) >> 23 

P = (A * (b1 << 16)) >> 23 + (A * (b2 << 9)) >> 23 + (A * (b3 << 2)) >> 23 + (A * b4) >> 23

and finally

P = (A * b1) >> 7 + (A * b2) >> 14 + (A * b3) >> 21 + (A * b4) >> 23

We now have four separate 24 x 8 bit multiplications.

The ESP does not have support for doing all the four parts in hardware, but it can do b1 using MAC and then b2 using DMAC. This increases the precision to 15 bits (sign + 2 * 7 bits), at the cost of an additional cycle. 

  

Substituting b1 back to B shows us how:

MAC: 

acc = (A * (B >> 16)) >> 7

During MAC, both A and B are stored in their full 24bit representation so they are available to the next instruction. Then a flag is set that tells the next instruction that the previous one was MAC.

DMAC: 

Now, A and B are still available, and we can substitute b2 to get

acc += (A * ((B >> 9) & 0x7F) >> 14

We know that the ESP implicitly does a >> 7 after every multiplication*, so we have to keep that separate. That leaves us with 7 more shift rights. We cannot shift b2 7 bits - it is 7 bits only so the result would be 0. Instead, we have to shift A, leaving us with:

acc += ((A >> 7) * ((B >> 9) & 0x7F)) >> 7  

In other words, running MAC and DMAC on A and B means

acc =  (A * (B1 >> 16)) >> 7  + ((A >> 7) * ((B >> 9) & 0x7F)) >> 7

The precision has increased from 7 to 14 bits (+sign). It is still not a completely correct 24 x 24 bit multiplication, but it lets us work with a precision that is good enough for rock'n roll.

(* it can do 3,5,6,7 but default is 7)


PS: the accumulator in the ESP may be cleared between every instruction - the flag that decides this is set either in the instruction itself or in a memory location. For MAC + DMAC to work it has to be set to clear = true before MAC and clear = false before DMAC.

Friday, January 16, 2026

Mix and oscillator amplitudes, more research

I just can't let this one go. The code appears to add a linear amount of the side oscillators to the sum, but Adam Szabo says differently - the center oscillator is attenuated and the outer ones follow a curved gain.

I just had a happy accident. I am trying to find the contribution of each oscillator to the total if one normalize the sum - saying the total should always be 1.

However, I only added one single outer oscillator - but the output was very interesting:

Here it is compared to the graph in the article:

 

Here, the value of oscillator 1 is 1 / (1 + mix), whereas the plot for the others is mix / (1 + mix). The plots are quite similar! It really makes me want to understand this even more!

But - if I assume that ALL 6 outer oscillators should be included in the normalization, everything breaks down, so clearly that's not correct.

Going back to the article, we have this graph (Figure 9):

Note that the amplitudes of 5, 6 and 7 are higher than 1, 2, 3.

In the text, Szabo says that 2, 3, 5, 6 and 7 are removed, leaving 1 and 4, as illustrated in the first graph.

I don't know if he DID measure those too, but there is a chance that they don't follow the exact same curve. We'll see if we can figure that one out.

Also, let's renumber the spikes in the plot to match the order in the detune_table:

[0, 318, -318, 1020, -1029, 1760, -1800], let's call them A-G to keep them separated

That gives us:

A = 4, B = 5, C = 3, D = 6, E = 2, F = 7, G = 1

In case it matters, what is called 1 here is actually the last element added in the summing in the ESP code. 

 

Now, I'm not entirely sure how to interpret Figure 9 in terms of "max wave amplitude". The spectrum has peaks of a certain width, not just a single frequency, and there are no units on the Y axis. If the scale is linear and one assumes that the amplitude of each wave in the result is actually propotional to the max value of the center oscillator, we get the numbers in the table (each pure frequency is a sine wave, so the highest peak would correspond to the root frequency of the saw waves, wouldn't it)?

It looks like that's what Szabo means that they are, so let's accept that.

If so, the sum of all amplitudes is definitely not 1. Could the perceived total "loudness" be equal if one uses dB instead of a linear scale? The total energy or something?  

And in any case, how does one go from the sum += saw[i] * mix to this thing? 

My own measurements

I wanted to confirm my understanding of Szabo's graph, so I fired up JE-8086 and coded a spectrum analyzer in web audio. I fed the audio from JE-8086 back to the input of my mac using the virtual microphone/input "VB-cable".

I used a linear Y-axis and a logarithmic X-axis. Then, with detuning at max, for every line on the mix pot (11 in total) I screenshot'ed the spectrum. Finally, I went through every graph, measuring the height in pixels.

Here is the first and last spectrum plots:



The measurements were of course wildly inaccurate, and the spacing between the mix values not quite even as I couldn't see the slider value in the display (and also, the slider resolution wasn't high enough). 

Here are the values:



And, more importantly, the plot of the values relative to max value of the center oscillator:


 



This is indeed very cool. It does confirm most of what Szabo described - the center oscillator is fairly linear and the others are definitely curved, and the center oscillator ends at an amplitude lower than the outer ones. There are some small differences though:

- I don't get the feeling that the outer oscillators actually go DOWN in amplitude at the end. The 7th oscillator appears to go slightly  down again, but I think it's more likely a measuring error. 

- The higher pitch / right hand oscillators have a higher amplitude than the lower ones. This matches what can be seen in Figure 9 in Szabo. The top oscillator is even higher though, and that does not match. I am still not sure if this is an artifact of the spectrum analyzer or if it is real. It could be an artifact of approximate multiplication or running average or something. 

- Something else to note: Both in mine and Szabo's spectrum analyzers, the minimum amplitudes for the outer oscillators is not zero. If the summing is actually saw[i] * spread, there should be no trace of the oscillator if spread is 0. Very strange. Also - why do they call it spread and not mix?

Tuesday, January 13, 2026

A closer look at the super saw code

In this post I try to understand most of the super saw code found by the Usual Suspects when reverse engineering the Roland/Toshiba TC170C140 ESP chip. Since I have no real DSP programming experience, it took some time to realize what is actually going on.
 
Presumably, this is the code for the ESP emulator itself: https://github.com/dsp56300/gearmulator/tree/main/source/ronaldo/esp 


Coefficients - floats 

The coefficients listed in the usual suspects' presentation, when represented as floats, are:

[0, 0.01953125, -0.01953125, 0.06225585, -0.0628662, 0.107421875, -0.10986328125]

if we multiply by 8192 we get

[0, 160,  -160, 509.9999, -514.99999, 880, -900]

and if we presume that the strings of nines are the result of a division somewhere, we get

[0, 160,  -160, 510, -515, 880, -900]

Pretty neat!

8192 is 2 to the power of 13. The real values may be that, or perhaps either 14 or 16 bits, so either  

[0, 320,  -320, 1020, -1030, 1760, -1800] 

or

[0, 1280, -1280, 4080, -4120, 7040, -7200] 

 

Coefficients - integers

The code example lists these integer coefficients

[0, 318, -318, 1020, -1029, 1760, -1800]

Now that's  really cool. Those are almost a prefect match for the 14 bit representation of the coefficients. 

There are some strange mismatches though - 318 and -1029 instead of -320 and 1030. 

I'm not sure WHY this is yet.

Chat GPT suggest:

The small mismatches (318 vs 320 etc) come from:

  • rounding
  • deliberate asymmetry to reduce beating regularity
  • truncation after multiplication
  • and accumulator overflow behavior

But the mismatches are there in the originals, not the calculated values, and if they had the originals, they would have used a shared factor of 16384 when dividing down to the floats (weirdly, even 1020 / 16384 is listed as 0.06225585, when the real value - without any rounding error - is 0.06225586).


Pitch value

The pitch value is how much we need to increase the saw wave for every sample. We now know that the JP8000 uses a sample rate of 88.2kHz, and 24bit accumulators that have a range of 16,777,216 (bipolar)

The exact pitch range of the JP8000 is not known, but let's for a start consider the midi standard.

Midi note 0 (C-1) has a frequency of approximately 8.18 Hz, while the highest note, MIDI 127 (G9), is around 12,544 Hz 

Running at 88200Hz, every cycle of

8.18Hz is 10782.396 samples long

12544Hz is 7.03125 samples long.

With a 24bit accumulator, each increment will be

for 8.18Hz:  1555.98, e.g. 1556

for 12554Hz: 2386092,942, e.g. 2386093.

 

From this, we can assume that, approximated

Pitch range is1556 to 2,386,093 

Knowing this, the only unknown in the detuning equation, is the int24_t detune parameter.

 

Detuning

Given the center oscillator frequency F0,

When using floats, the outer oscillators should have a frequency Fn:

Fn = F0 * (1 + floatCoefficient[n])

When using the integer coefficients, we instead get

Fn = F0 * (1 + integerCoefficient[n] / 2^14)

Or 

Fn = F0 * (1 + integerCoefficient[n] >> 14)

Now, doing the bitshift on the coefficient alone would lead to a massive loss of precision (or rather, all the coefficients would become 0), so at very least, we have to do the bitshift after multiplying with F0:

Fn = F0 + (F0 * integerCoefficient[n]) >> 14

Ah, this is starting to look like the code, exciting!  

 

Detune amount

There is a third number in the detune calculation. The code calls it "detune" but in reality it's detune amount. It says how much of the detune coefficient to apply. 

According to Szabo, pitch * coefficient is the _maximum_ amount to apply. That means the detune amount should be a ratio between 0 and 1. This, of course, is not possible using an integer, without a division following the multiplication. 

Let's take a look at the original equation:

int24_t voice_detune = (detune_table[i] * (pitch * detune)) >> 7 

We know that detune somehow should give us a scaling between 0 and 1, so let's ignore how we get there for a second and remove it. That leaves us with detune_table[i] * pitch. 

We also know that we should divide this result with 2^14, though that is not included in the equation. Let's include it still, it has to be there somehow.

A quick check of the multiplication and following bitshift in the detune frequency calculation:

The lowest possible value is (1550 * 318) >> 14, which is 30. (lowest frequency, lowest detune without detune amount scaling).

The highest possible value is  (2386093 * 1800) >> 14, which is 262144. (highest frequency, highest detune amount).

The results are within a 24bit int range. However, the intermediate value from the multiplication is 4,294,967,400, which is much higher than what can be stored in a 24 bit int. Under normal conditions, this would make everything overflow. To be able to properly store the multiplication result, we need a division, and one that happens before the result is stored. Something strange is clearly going on.

Side note: The max result is even slighty higher than what can be stored in an uint32 (4,294,967,295). At the same time, its so amazingly close that it's hard to believe that its just random? And actually, if we go back to the highest frequency, it's actually 2386092,942. With that fractional result, the product is 4,294,967,295.6 - a mere 0.6 above. This is too weird to be a coincidence? Also, it turns out that the max frequency isn't 12554, it's slightly lower. That would keep the detune within a uint32 range. Interesting.


Now, lets go back to the equation and reintroduce detune:

int24_t voice_detune = (detune_table[i] * (pitch * detune)) >> 7

pitch * detune, let's call it pitchWithDetuneAmount will be calculated first. Max pitch is 2386093, so detune may be up to 3 without overflowing. That doesn't make much sense.

Next,  detune_table[i] * pitchWithDetuneAmount is calculated. detune_table[i] is at most 1800, so it will overflow if pitchWithDetuneAmount is larger than 4660.

Finally, everything is divided by 128 

And all of this, with detune at max, should not be larger than 0.10986328125 * pitch.

In a normal situation, we should divide detune table by 2^14, and detune by a maxDetune to make it into a ratio between 0 and 1. It is highly probable that maxDetune would be a factor of 2 as well, to make division a bitshift here too.  

Side note: maxDetune needs to be high enough to represent the curve shown by Szabo, at least if there are 128 different values not linearly spaced apart.

But the mystery remains. Where have the remaining divisions gone? We see some division (>>7 is the same as / 128) but that's not enough and it's not in the right place to prevent overflow.

 

DSP magic

I admit it, I had to ask ChatGPT about this one. At first it was reluctant to admit that there are something that makes division/bitshifts superfluous, but then everything dropped into place!

Enter multiply-high

DSPs have two major tasks - summing and scaling. Summing is +, and scaling is multiplication followed by a division. 

In fact, scaling is so important that most DSPs do multiplications in a slightly different way. They multiply the two numbers into an accumulator with twice the bit count of the factors, but then it only returns _the high order_ bits. E.g. if it multiplies, say, two unsigned uint16_t variables a and b, it would store the intermediate result in a u32_t, but then only return the 16 MSB. This is equal to bit shifting >> 16 or dividing by 65536. If we let a be our signal value, and b a scaling factor, this turns b into a scaling between 0 and 1!

So there you go, we get free bit shifts, invisible in the code.

In other words, the code the Usual Suspects is showing is is not normal C code, it's DSP code (of course...) meaning the * does not do what it normally does, it also bit shifts.

 

Implicit bit shifts 

Lets go back to the code again and see how this works out

int24_t voice_detune = (detune_table[i] * (pitch * detune)) >> 7

 

First we do 

pitch * detune

which we called pitchWithDetuneAmount earlier.

By making detune max equal to the bitshift included in *, it becomes a ratio between 0 and 1. Hooray! 

Then we do 

detune_table[i] * pitchWithDetuneAmount

Again, * will introduce bitshifting.


If we go back to the paragraph about multiply-high, an int24_t * int24_t would result in a 48bit intermediate, of which the upper 24 are returned (it may be slight differences working with signed ints but the principle is the same).

Shifting >> 24 is fine for detune amount. It would mean we could use the full 24 bits as a factor, getting any detune amount curve we could possibly want.

But shifting detune_table * detuneAmount by 24 is too much. It should only be 14, and even then 7 of the shifts are done outside the parenthesis.

Now, it's not important exactly how the bit shifts are done to understand the code. We can just accept that the code

int24_t voice_detune = (detune_table[i] * (pitch * detune)) >> 7

is equal to

pitch * (detune / maxDetune)* detune_table[i] / 2^14

without overflows etc. It is only important if we want to run the exact code and use the same range for detune amount.


Different bit shift?

There is however a posibility that the DSP doesn't actually shift by 24. Maybe it shifts by 7? That would make  

(detune_table[i] * pitchWithDetuneAmount ) >> 7

the same as

detune_table[i] * pitchWithDetuneAmount >> 14 

in a normal system.

It would mean that detune amount has to be 7 bit if it uses the same multiplication-high, leaving 128 steps of detune amount. That does not work well with the Szabo curve, but it is possible. Or maybe the C code is just inaccurately translated from assembly and that they use different multiplication operations.


The DSP (ESP) emulator code for the JE-8086 is on github, so we can peak at what it actually does. I have not studied it in detail, but there are traces of a configurable bit shift multiplication.

Studying https://github.com/dsp56300/gearmulator/blob/main/source/ronaldo/esp/esp.hpp may shine some light on this.

multResult in esp.hpp shifts the result after multiplication, 5, 6 or 7 places. It uses two bits of the instruction to select the shift, and 0,0 will default to 7 bit shifts. That sounds exactly like what we are looking for.

So, while not proving that 7 is the correct answer here, at least it makes it plausible that the ESP does infact use a different shift than 24. 

UPDATE: The multiplication (kMAC in the code) does two bitshifts. First, it shifts the second term by >> 16, meaning it only uses the 8MSB. THEN it does an up-to 7 bit right shift). In other words, detune and spread are not 0-128, they are the full 24bit range but only the 8MSB are used. There is also a double precision multiplication available so it is possible that is used for higher resolution

Something that may support this theory is that on a slide about the ESP, it says that it has a 24 x 8 bit multiplier. This seems to indicate that it does NOT have a 24 x 24 bit multiplier, and that, combined with signed arithmetic where one bit of the 8 bit variable is used as sign, would make a shift of >> 7 quite plausible and 7 bit (positive value) detune the way to go. 

It does however leave a question as to how the multiplication of detune_table and pitch works since both are definitely > 128. Perhaps it does multiple 24 x 8 multiplications?

--> It looks like it is possible to do that. The final bitshift will always be >> 7 which makes the last >> 7 explainable. This would make the detune_table[i] * pitch >> 14 multiplication possible. Detune and spread would still have to be max 127.

Example: 24 × 24 multiply-high using three 24×8 blocks

According to ChatGPT. Test this!

Let the 24-bit multiplier be split into bytes:

B = b2·2^16 + b1·2^8 + b0

You compute:

P0 = (A·b0) >> 7 P1 = (A·b1) >> 7 P2 = (A·b2) >> 7

Then re-align:

Result = P0 + P1 << 8 + P2 << 16

Substituting:

= A·(b0 + b1·2^8 + b2·2^16) >> 7 = (A·B) >> 7 (approximately)

Update: Double precision multiplication

The ESP supports "double precision" multiplication. I have not yet fully understood the result but essentially it does exactly what is suggested above - it first multiplies A with the 8MSB of B. It then multiplies A >> 7 with (B >> 9) & 0x7F, or bits 15 to 8 (0 indexed) of B

first time: acc += ((mulInputA_24 * (mulInputB_24 >> 16)) >> shift)        

second time: acc += (((mulInputA_24 >> 7) * ((mulInputB_24 >> 9) & 0x7f)) >> shift)

... I am missing something here, TBC. 

Spread 

There is a third multiplication in the code, saw[i] * spread. Just as with detune, this looked very strange and would lead to an overflow in a normal system. But with multiply-high this too becomes a scaling from 0 to 1, just as we needed. It could, as detune, be a value between 0 and 127 and work fine with >> 7. Again, it's not important to the understanding of the code, we can just accept that it's a scaling factor.

 

Conclusion

There are some questions left unanswered. Shifting by 7 on * means that detune_table[i] * pitch may still overflow (that 32bit thing above, remember), and the curves of detune and spread can't be explained properly. And finally, as mentioned in the previous post, summing of the seven saws will overflow.

In general, however, the code looks like it could do exactly what we think. If we were to reimplement it we would just take care of these issues - increasing sum to 32bit and doing the appropriate bits shifts manually, and selecting whatever resolution for detune and spread that we want. 

The only thing I cannot explain at the moment is how Szabo could see a attenuation of the center oscillator when doing mix (spread), as that is not part of the code. Perhaps it is some kind of normalization effect, that the center oscillator contributes less to the total. Guess that one just has to remain a mystery for the time being. 

Thursday, January 8, 2026

The super saw code from the Usual Suspects

At the 39C3 conference, the Usual Suspects talked about how they reverse engineered the Toshiba DSP chip from the JP80x0. In itself an incredible feat, and a super exciting talk, but the one thing that REALLY caught my interest, was what they claim to be the code for the original super saw.

https://www.youtube.com/watch?v=XM_q5T7wTpQ&t=1804s 

They describe it as simply 7 saw waves, high pass filtered, with detuning, running at 88.2kHz to prevent aliasing within the audible range (?).

The code even shows the detuning coefficients, and state that it's integer maths, making them a bit hard to get right.

 

Now, of course I had to see if I understand the code. Here is a screenshot:

 

Not much. I assume next is run once per DAC update, i.e. 88200 times per second.

The saw oscillators are simply 24bit signed integers used as accumulators. For every round, "pitch" is added to the accumulator. Once the value reaches the max value that can be stored in a 24bit int, it overflows and wraps to negative minus. This way, by continously adding to the accumulator, we end up with a saw wave. 

Oh, btw - looking at the code, the initialization of the array is a bit strange. This being a global array, it should automatically be initialized to all 0s. {0} explicitly sets the first element to 0, why is that needed?

Let's for a start ignore detuning. If we set detune to 0, the whole voice_detune parameter goes away and saw[i] is just incremented by pitch for every cycle.

Also, let's set spread to 1, so all oscillators have the same amplitude.

 

Assuming the saw waves are in perfect phase, summing them would give us a saw wave that increases 7 times faster than a single wave. But then there is something weird. 

sum is also defined as a int24. My only way of understanding this is that it will overflow too, just like the saw wave accumulators. And that, would lead to a saw wave with the same amplitude as the individual waves, but with a frequency seven times higher!

Lets reduce the number of oscillators to 2 and introduce a phase shift of 25%. Without overflowing, this would lead to some tops higher than max, some lower than min and some cycles where the amplitude is less than min and max. But with overflowing, the parts above and below max/min fills in the gaps, and once again we're back to having a single waveform with a 2x frequency but the same amplitude:

Red horizontal lines are where the sum accumulator overflows.

 Now, the PHASE of the output wave is different from the initial wave. 

Here is a way of thinking about this. 

For every step. each saw wave contributes "pitch" to the sum. The saw waves wrap, but the rest of pitch will be added to the bottom. This is similar to having a single saw wave with 2*pitch increase for every step.

Now consider different pitch values for the two saw waves (=different frequencies). Each wave still contributes its pitch to the sum, creating a single saw wave with pitch equal to the sum of the two other saw waves.

This extends to the rest of the saw waves, adding another saw wave just adds its pitch to the sum. In the end, the seven waves end up as a single wave with its pitch being the sum of all the pitches. 

Here is an example. The grey line is all saws, with slight detuning, summed up without overflow (and plotted in a chart where y is at most 8 times that of a single saw wave. The blue line is the same waves summed with overflowing.

The horizontal lines divide the range into 8 parts, each corresponding to "one overflow".

 

 

If you look carefully, you can see that at every discontinuity, the part protruding above a grey line, is exactly the same as the part missing from the bottom and down to the previous grey line. When using overflow (or modulo), the top will wrap and be added to the bottom. Any of the divides that are empty, simply goes away in the wrapping, and we end up with the blue line.

 

Ok, that was a convoluted way of saying -  I don't understand how the sum code is supposed to work. Saw waves of any frequency will always combine to a single saw wave of higher frequency if the sum also overflows. As neither the frequency nor the detune of a wave changes, the sum wave will stay unchanged.

If sum was a 32bit int, this would work fine and we would get an ever changing combination of the waves. 

Detune and detune coefficients

Now, as for the other parts of the code, they have me confused as well, but maybe they use overflow as part of a trick? 

In other parts of the presentation, a comparison between Adam Szabo's coefficients and the "real" ones is done. The coefficients are fractions, small ones too. To calculate a detune frequency, one uses 

basefrequency * (1 + coefficient)

or 

basefrequency + basefrequency * coefficient.

In the code above, 

saw[i] = pitch + voice_detune

or 

saw[i] = pitch + ( detune_table[i] * ( pitch *  detune )) >> 7

Now, I presume the parenthesis are place the way they are for a reason, perhaps the parts inside the parenthesis overflow in a certain way that makes things work out, but substituting /127 for >> 7 and reordering gives us

saw[i] = pitch * (1 + detune_table[i] * detune / 128) 

The lowest coefficients are 128, and the lowers integer value for detune is 1. Following that, we end up with 

saw[i] = pitch * 2

This is clearly wrong. Perhaps the overflow inside the parenthesis, and the values chosen for detune, will lead to something that, when divided by 7, is always much less than (and propotional to) pitch?

As for the coefficients themselves, the individual propotions are not the same as for the fractional coefficients, so something strange is going on there as well.


Spread

Finally, we have "spread"

The outer saw waves are multiplied by spread before adding them to the sum. Presumably, this is the same as "mix" on the JP8000. 

But again, being integers, spread can only INCREASE the amplitude of the saw wave (or perhaps rather the pitch, since the product of saw[i] * spread will overflow. 

In Adam Szabo's paper, the center oscillator amount is reduced linearly, while the outer oscillators are increased by a curve:
 

Perhaps there is some kind of normalization going on, where, by increasing the outer oscillators, the relative contribution from the center one is decreased? 

Again, I'm confused. 

I have a feeling that at least one trick is used here. Since division is probably extremely expensive on a DSP which is built for Multiply and Accumulate, perhaps one instead uses multiply + overflow? (bitshift >> 7 is used to divide by 128, but this only works for powers of two).

 

I really wish someone could confirm a couple of things.

First of all, is the code completely correct - while I don't understand it at the moment, at least that would give me more confidence in looking for the solution

and

confirmation that the output of this code is indeed "samples" that, after filtering, will be output to a DAC (or the next DSP in the case of the JP8000).


My analog super saw

Years ago I build an analog 7 saw oscillator with control circuitry that emulated the control curves seen in Adam Szabo's paper. I had the curve for the detune pot using a three leg approximation, and something that looked close to the mix curves. 

I actually built the whole thing before I realised

1) Mixing the saw waves would lead to clipping if the headroom was not high enough and

2) Part of what makes the supersaw sound the way it does, is that it's digital (D'oh).

 

Adam wrote a few things too, in the paper or on a forum, I can't quite remember. Quoted from memory: - The naive approach of generating multiple saw waves would not work as the JP8000 was not powerful enough

- He had discovered some kind of trick that Roland would not tell the world about.

Not sure if that was all smoke and mirrors, but I was hoping that this last trick was somehow related to how to prevent the overflow while still keeping gain high.

 

Oh well. Time to go to bed.  

Saturday, January 3, 2026

Power-up sequencing

As there is a significant amount of power needed for the synth, it may be a good idea to not turn on everything at once.

Fortunately, the DC-to-DC converters have CTRL-pins on them. Setting this to GND disables the output, and setting a logic high turns on power.

I have not found the exact voltage needed for logic high on the converters I use (YLPTEC URA2412YMD-20WR3), but elsewhere I've found 3.5V for similar converters. Testing with direct connection to the 3v3 output pins on the Teensy was successful, but I'm still a bit skeptical about doing this "in production", in case it doesn't work.

The solution is to level shift the signals. At the same time, I don't want to invert the signal, so a simple transistor level shifter would require multiple parts.

My solution: Use TTL-compatible CMOS, the 74HCT series. By connecting the supply to 5V, the output logic will be 5V. At the same time, the input is 3v3 compatible. I've tested this with a 74HCT08 quad AND gate and it works perfectly.

Adding 4.7k resistors to the output makes sure the output stays at GND as long as the power is off (? these are not tristate outputs, so not sure this is a good or even needed solution. Perhaps a pulldown on the input instead or in addition is better) 

 I've ordered two cmos chips for testing:

74HCT164, a octal output shift register - this lets me control all voice card power using two pins of the microcontroller. There is no inhibit though, so the outputs have to be turned on in sequence

CD4504 - hex level shifter, with dual supply pins. 

We'll see what feels best later. I guess a power distribution board with some logic on it will be worth considering.