39

For example on S40 or the 3310, users were able to set voice tags (also known as voice commands) for contacts. The phone could then recognize the command later and dial the right number.

With neural networks and much faster computers, this is not rocket science today. However, even desktops back in 1999 did not do neural networks as far as I know, let alone a handheld device designed at that time. The recognition on my 6230i works quite well and appears to be live: it starts listening and displays a result about 1 second after you've spoken the matching sample. I can also speak a whole sentence with only the pre-recorded keyword in it, it patiently waits about 4 seconds for me to mention something it recognizes, and shortly after I do it displays the match. (Or if I remain silent, then after 2 seconds it will report no match.)

I've been trying to find how this worked but didn't find anyone talking about it. The results are all about newer devices and how to use the feature, not how it technically worked. Adding keywords like machine learning or "fft" (with quotes) to the search doesn't yield anything either.

Did it have some very basic machine learning? Though that might also need more training data rather than just one recording to match against. Perhaps, then, it used Fast Fourier Transform to extract and then later match the frequencies? The device can play back the recordings (as an aside, it turns out 13-year-old me sounded like a little girl), perhaps because it uses that as a reference and it doesn't merely extract features.

Is it known how these systems (must have) worked?

Luc
  • 423
  • 4
  • 7
  • 2
    '90s home computers were able to do at least basic neural networking. See, e.g., Neural Computation and Self-Organizing Maps: An Introduction (1992) or C++ Neural Networks and Fuzzy Logic (1995), both of which I used as programming references. – Alex Hajnal May 19 '21 at 03:26
  • 1
    In my recollection: poorly. I remember I had 3 names preprogrammed, the second and third were resonably accurate. The first howerver, no matter what noise, words or silence I'd uttered, it would always consider it a match. Pretty funny to just hiss at my phone to call that person – Pelle May 19 '21 at 08:11
  • 1
    In the late 90s there was desktop speech recognition (dictation) software. It wasn't up to much but in command and control mode didn't tax a 66MHz Windows machine too much. Speech recognition is a fairly slow-moving field overall. – Chris H May 19 '21 at 11:38
  • 3
    Limited domain voice recognition has been available for a long time. Even hobbyists could buy hardware to do it from Radio Shack since the late 1970s. It's continuous, full-vocabulary speech recognition that was never quite practical until ~2010. – hobbs May 19 '21 at 14:42
  • 1
    @hobbs The chip that Radioshack sold (the VCP200) had a fixed set of words it could recognize. It had two modes: In command mode the words it could recognize were "go", "stop", "left turn", "turn right", and "reverse". In the other mode the words it could recognize were "no" or "on" (both of which asserted one pin), "yes" or "off" (which both asserted another pin); a third pin would be asserted if the chip wasn't sure what word had been spoken. Dave at EEVblog did a video about it which is well worth a watch. – Alex Hajnal May 19 '21 at 15:53
  • @hobbs, even full-vocabulary speech recognition could work quite well, if you used it properly. My uncle used speech-recognition software on a ~100MHz Windows machine that would fall apart if you spoke casually, but got better than 99% accuracy if you spoke as if you were dictating to a secretary. – Mark May 19 '21 at 22:06
  • I have no idea if it's relevant to cell phones, but in research, statistical methods became dominant around 1992. Neural networks became dominant around 2013. The typical components are
    1. A language model (something that tells probabilities of word sequences) built from a Hidden Markov Model.

    2. An acoustic model (something that tells you the probabilities of sounds given a sequence). They used Mel Cepstral Coefficients which I think are like Fourier transforms.

    A great textbook is "Statistical Methods for Speech Recognition" by Fred Jelinek.

    – Jetpack May 20 '21 at 06:27

1 Answers1

50

I did basic voice recognition on an Atari ST (8MHz 68000, 8-bit mono sampling1). If it could be done on a 1985 desktop2 then it should be no problem for an early naughties cell-phone3.

IIRC4, the algorithm was roughly as follows:

  • Sample the audio (8-bit mono @ 22kHz?)
  • Split the audio into short (½ second?) pieces
  • Do an FFT on each piece. The results are placed into a 2-dimensional array (piece #, binned frequency intensity)
  • Compare the array against a set of reference patterns (one for each recognizable word, stored in the same format) and return the closest match (along with the strength of the match). A diagram illustrating this is at the end of this answer.

No neural networks were used (though I undoubtedly experimented with them), just basic arithmetic. Training was done by recording the same word multiple times and then averaging the resulting arrays. Note that the algorithm only worked for discrete words, not continuous speech.


1 Using the Soundoff! cartridge from E. Arthur Brown/Diverse Data Systems.

2 Or earlier: The MC68000 was released in 1979.

3 I don't have detailed specs for the phones you listed but the Handsping Visor (released in 1999) used a MC68EZ328 (Dragonball) at either 16 or 20 MHz. The Dragonball processors are basically embedded variants of the MC68000.

4 I used the following book as a guide. I don't have it easily at hand so I can't verify the details.

Based on this review of the book, the code presented in the book was actually in Apple ][ BASIC (6502 processor) which I then adapted to the ST (probably using GFA Basic).



Diagram showing how recognition is done

Alex Hajnal
  • 9,350
  • 4
  • 37
  • 55
  • 2
    Looks like the pieces are around 0.1 sec, not 0.5 sec. An entire word can easily be spoken in 0.5 sec. – nanoman May 19 '21 at 08:12
  • 2
    @nanoman Is that from the book? If it is I'll update the answer. (My copy of the book is in a box 5 miles away.) – Alex Hajnal May 19 '21 at 08:14
  • 1
    No, I'm just applying common sense to challenge your estimate of "short (½ second?) pieces" -- ½ second is not short in speech. It doesn't take 3.5 sec to say "Alice" or "Carol". – nanoman May 19 '21 at 08:15
  • 1
    As I recall the window was pretty long. Not sure exactly what it was though. – Alex Hajnal May 19 '21 at 08:17
  • Related https://youtu.be/RRsq9apr5QY – Tim May 19 '21 at 09:50
  • 2
    Tim's video is how Shazam works. TL;DW is that they assume it uses FFT and that didn't actually look at the app's binary code or anything. The speaker remade it using an FFT and the first 9 minutes are explaining what FFT is for anyone wishing to skip through. Their implementation also drops anything outside of 100Hz--5kHz. At 10:15 they show like one line of code and talk about that for five minutes, then at 16:49 they start to show the reimplementation demo software. – Luc May 19 '21 at 10:18
  • 1
    @Luc The fingerprinting (FFT) is the easy part. The rather tricky part is matching it against billions of possible matches. – Alex Hajnal May 19 '21 at 10:20
  • @AlexHajnal ah, is that later in the video? I stopped watching after they started toying with their demo. – Luc May 19 '21 at 10:22
  • @Luc Maybe, I didn't watch it all. Just my US$0.02 based on experience with writing search algorithms. – Alex Hajnal May 19 '21 at 10:24
  • 5
    @AlexHajnal completely offtopic, but I just noticed that "US$0.02" and "2 cents" have the exact same amount of characters – htmlcoderexe May 19 '21 at 12:54
  • @Luc They do discuss it a bit in the latter quarter of the video. To make the problem tractable you need to massively reduce your search space. They present one approach to doing this and discuss how to deal with noisy data. – Alex Hajnal May 19 '21 at 14:16
  • 1
    Macs in the early nineties had much-touted PlainTalk speech recognition out of the box, also - roundabout the System 7 era, and before PowerPC, so all on the older 680x0 platform. It wasn't perfect, but it worked. – J... May 19 '21 at 17:59
  • @Luc It doesn't actually explain Shazam in particular, the demonstration doesn't work with the actual Shazam database, does it? Here is an actual open source implementation and an explanation how it works in detail. – AndreKR May 19 '21 at 20:06
  • 1
    @htmlcoderexe: That's why I use "US$2E-2" or "USD2E-2". ;) – Eric Duminil May 20 '21 at 14:53