Why did blanking of screen on C64 SAM enable better quality speech synthesis?

Question

There's this old piece of software still present in today's Windows called SAM that enabled C64 to speak. I remember that we could choose either screen on and worse speech quality or screen off and better quality.

Why did it work that way? Did SAM use "varying SID volume = 4 bit digital audio" trick, thus blanking the screen allowed better sample rate for phonemes?

As in other similar questions, this is almost certainly down to interrupt conflicts. Blank the screen, no screen interrupts, more time for audio output interrupts. — Chenmunka, Oct 31 '17 at 08:46
Whatever Windows does that the C64 also did, it is not because it shares any software whatsoever. Software voice synthesis was produced by a bunch of companies (on the C64 see also e.g. Superior Software's Speech!) because it's a fairly simple idea. The modern stuff is unrelated to the old other than in concept. — Tommy, Oct 31 '17 at 10:45
Just to be clear, SAM ( Software Automatic Mouth ) was not a C64 support program running on an msWindows machine. It was a speech-with-visual-mouth program available on several machines, developed in the early 1980's. I initially interpreted your question the first way. — RichF, Oct 31 '17 at 11:58
@Tommy - seeing how C64 SAM, Amiga's narrator.library and Windows' SAM share the same phoneme mnemonics it's highly unlikely they aren't related. But then again - who knows... — user100858, Oct 31 '17 at 13:17
@user100858 per Andy Hertzfeld, who was on the original Macintosh team, the SAM author visited on account of his audio expertise and the team's awareness of the Apple II version of SAM, Jobs insisted the thing be put into his unveiling demo and it worked so well it ended up in the OS. The same authors also licensed their work to Commodore. But it's much more likely that Microsoft just hewed towards the defacto standard as on http://www.text2speech.com/#aboutsv the authors mention the Apple and Commodore deals, but then try to sell you their Windows solution (albeit that the site is ancient). — Tommy, Oct 31 '17 at 13:29
@Tommy: True. While working at Apple, I purchased a SAM for my Apple II, and demonstrated it to the Mac team, as well as taking it with me to Hi-Toro/Amiga. Both those teams subsequently contracted with the SAM developers to do a 68000 port. — hotpaw2, Oct 31 '17 at 15:54
It was probably using Pulse Width Modulation as many computers did not have volume control. — Stavr00, Nov 09 '17 at 15:06
@Stavr00: I don't think SAM was supported on systems that didn't have some form of DAC. There was an Apple II version, but it wouldn't do anything useful on systems which didn't have some kind of DAC card installed. — supercat, Dec 17 '18 at 19:33

Omar and Lorraine · Accepted Answer · 2019-03-12T12:58:38.530

20

How SAM works

SAM was written to be usable on many different computers. So instead of using the SID chip in the usual way, the CPU has to work to sample the phonemes itself. The SID could have taken a lot of that on, since it has its own oscillators, waveform generators, ASDR volume control, and all that. This would all have been very useful in speech synthesis, but it is not used because the program was written to be portable.

For most phonemes, the program does a tight loop, which gets two sine waves and one rectangular function, each at a different frequency and each scaled by an amplitude, adds these three together, and stores the result (divided by a constant) in the SID's master volume register. That's good for sounding out continuants¹. For other phonemes, say the sibilants², random data and other tables were used instead. And for plosives³, I'd wager the same thing was done, but the amplitude was simply altered with each loop iteration. This arrangement, with the looping, scaling, adding, as you can imagine, takes more than a few cycles on the lowly 6502. And it is very sensitive to timing because even a few cycles here and there will change the resulting soundwave quite drastically.

Fine, but what's it got to do with blanking the screen?

On the C64, if the screen is being displayed⁴, then every so often the CPU gets halted for some time, to allow the video chip to read the memory faster. Between 40 and 62 cycles (depending on various details) are taken from the CPU in this way every 8 scanlines (again, depending on various details). These scanlines are called badlines, and under normal operation there are either 24 or 25 of them per frame. That means the CPU loses 960-1550 cycles every frame.

The CPU on a European C64 ran at 985 KHz, and so 1000 machine cycles is 1.015 milliseconds. That's about the amount of time, per frame, during which the CPU may not run if it is stopped by the graphics chip, which means those loops I talked about do not run during that time.

When it comes to normal human speech, a phoneme lasts only a very short amount of time. A phoneme can easily last less than, say, fifty milliseconds. Losing even 40 machine cycles puts a considerable dent in that timeslice, so they made sure that wouldn't happen, by making sure no badlines would happen, by making sure no text or graphics would get displayed.

¹: continuants are vowels and consonants which can be elongated, like m, i, s,

²: sibilants (a subclass of continuants) are consonants like s, z, f, th etc. which work by constricting airflow so that it kind of hisses in the mouth.

³: stops (aka plosives), are consonants that involve blocking and then releasing airflow, like p, k, g. These sound very different from start to the finish, and cannot be pronounced for a long time

⁴: The VIC-II has a bit in register $d011, which blanks the display. It disables all badlines too.

edited Mar 12 '19 at 12:58

answered Oct 31 '17 at 09:42

Omar and Lorraine

38,883
14
134
274

you're probably correct though (and hence got my vote). The Atari version of SAM also blanked the display, it seems. Almost certainly it's because there's a tight CPU loop repeatedly setting a volume level in order to effect PCM audio, and therefore any CPU timing irregularities would cause audio output slurring. – Tommy Oct 31 '17 at 13:57
I wonder if the C64 version could be adapted to play cleanly, at least when certain sprites are disabled, without sounding bad? If one normalized the sample rate to 126 or 130 cycles and set up a CIA interrupt that would fire just after the end of a badline, that should give a stable timebase for audio output. The interrupt would take 36 cycles per sample to play from a rolling 256-byte buffer, but buffering would allow samples to be generated in groups of five at a cost of ~300 cycles/group. – supercat Mar 12 '19 at 17:34
@supercat Interesting idea. I don't see any strong evidence why your idea wouldn't work. But if you wanted to go to a great effort to adapt SAM to the C64, it would probably be better to actually use the sound chip, like modern day demos seem to do. I wonder if anyone back in the day had the chutzpah to do that. It would take much less CPU time and memory, leaving these free for useful work. – Omar and Lorraine Mar 12 '19 at 20:56
1

@Wilson: Looking at the snippet of C64 code shown, there are a lot of load-modify-store sequences for each sample. If running code in zero page, the LMS sequences for five samples could be replaced by lda phase1a / adc freq1 / sta phase1b / adc freq1 / sta phase1c", etc. with thephasevalues being the LSBs of addresses stored within instructions that use wavetables:ldy wavetable1 / lda (mulTable1),y / lda wavetable2 / adc (mulTable2),y / ldy waveTable3 / adc (mulTable3),y`. Cute, eh? – supercat Mar 12 '19 at 21:01
@supercat "running code in zeropage"? – Omar and Lorraine Mar 13 '19 at 08:52
@Wilson: Using a piece of self-modifying code stored within the first 256 bytes of memory. The lda wavetable1 etc. instruction would need to have the LSB of the address patched with the phase value. If the instruction is stored in zero page, that would take 3 cycles. If the instruction were stored elsewhere, it would take four. – supercat Mar 13 '19 at 14:03
@Wilson: Since all interrupts other than the audio output interrupt would need to be disabled, it should be safe to copy the contents of zero page elsewhere, then copy the routine into zero page, perform the speech, and then restore the state of zero page after speech is complete. – supercat Mar 13 '19 at 14:05
@Wilson: The SAM code loads a computed product value, then stores it, loads another computed product value, adds that to the previously-stored one, stores that, loads a third computed product value, and adds that to the previously stored one, so it spends twelve cycles storing and reloading computed products. Further, each phase value would need to be loaded and stored once for each cycle. Using self-patching code in zero page, the stores and re-loads of the product values are eliminated, as are most of the phase loads. One slight quirk of this approach is that for best performance... – supercat Mar 13 '19 at 14:09
...the wave tables should be designed to repeat every 255 cycles rather than 256. This would avoid the need for a clc after each addition; I don't think the audio disruption should be noticeable, but I'm not sure. For a different project, I wrote a four-voice digitized music player whose main output code took 46 cycles to generate and store a pair of four-bit samples generated from wave tables, but that was designed to produce 61 specific pitches in two specific amplitudes. – supercat Mar 13 '19 at 14:17
@supercat you may be on to something there. I'd be concerned about the cost of shifting stuff in and out of zeropage though. What if there's an NMI. Then portability becomes a bigger concern. – Omar and Lorraine Mar 14 '19 at 09:25
@Wilson: The simplest solution to the NMI problem is to exploit the fact that one of the CIA chips is wired to the NMI. The Restore key might slightly disrupt audio playback, but that would be it. Otherwise, the cost of shifting stuff into and out of zero page would be borne at the starts and ends of utterances. – supercat Mar 14 '19 at 14:45
@Wilson: Does the C64 allow rapid enough modulation of the wave generators' amplitudes to make speech workable? – supercat Mar 14 '19 at 16:32
@supercat you mean in the SID? – Omar and Lorraine Mar 14 '19 at 16:34
@Wilson: Yeah. I meant the built-in sound chip (SID). – supercat Mar 14 '19 at 16:35
@supercat I think so. There are demos now which I believe do speech synthesis in that way. For example https://www.youtube.com/watch?v=ENsBJ19YbbI – Omar and Lorraine Mar 14 '19 at 16:38
@Wilson: Interesting demo. The articulation doesn't generally seem as clean as SAM, however, so it's not clear that relying upon the sound chip wouldn't require a sacrifice in the quality of at least some allophones. – supercat Mar 15 '19 at 14:28
@supercat: The C64 demo "Hardware Accelerated Samples: My Humps" does playback samples using the SID waveforms: https://csdb.dk/release/?id=133968 – DisplayName Mar 31 '19 at 19:13

hotpaw2 · Answer 2 · 2018-12-17T23:52:51.093

SAM ran on several systems, including the C64, the Atari 400/800, and, via a 6-bit DAC board that went in one of the peripheral slots, the Apple II. The voice synthesis algorithms were later ported to 68000 for the Mac and the Amiga.

Timing jitter of the sample clock of an audio DAC causes distortion (unwanted severe phase and/or frequency modulation). On 6502 systems, the sample timing for the SAM audio output to DAC was done in software by using code paths and loops with known CPU cycle timings. So anything that could vary the timing of the software timing loops could distort the synthesized voice. Leaving the display enabled can cause a number of things (interrupts, screen refresh memory fetches, etc.) that could vary the the number of clock cycles taken by software timing loops and paths. Thus, blanking the display reduced potential distortion of the audio output due to DAC sample timing jitter.

Video blanking was not needed for synthesized audio on the Mac and Amiga because audio DAC timing was done in hardware rather than by software timing loops.

score 1 · Answer 3 · answered Nov 02 '17 at 11:10

As mentioned above, timing distortion, especially if your CPU is heavily involved in live video generation as it is in "home computer" style machines, is an issue.

Also, such designs did not tend to be very well screened internally, and power supply wiring (especially ground layout and filtering!) was not exactly optimized for a high-grade "mixed signal" type system. Fast rise pulses (the currency of digital electronics) and low level analog signals (the currency of audio) do not readily live in peace together: Digital circuitry does not mind if eg a ground is lifted by 100mV for a microsecond, and can easily CAUSE such interference. If the same ground is used for anything analog, this could cause an analog value at audio level to be misinterpreted by several percent of full scale... Such problems have not even been fully eradicated from budget-grade, on board, sound cards in modern PCs: Audible interference from CPU activity is still commonly observed.

The problem on the 64 wasn't caused by the lack of video shielding, but rather the fact that when video is enabled the Commodore 64 disables the processor 25 times per frame, for 43 cycles each. — supercat, Nov 02 '17 at 14:43

Why did blanking of screen on C64 SAM enable better quality speech synthesis?

3 Answers3

How SAM works

Fine, but what's it got to do with blanking the screen?

Linked