Most efficent way to transmit binary data by human voice

Question

Suppose two people want to transmit some binary data (a png image perhaps) by voice.

Two of my characters want to share binary data but they can only use their voice and no other form of communication is possible.

By "voice", I mean any sounds that can be reliably produced and differentiated by average humans.

They can use computers to encode or decode the message but the transfer of information needs to happen between them. For example: They can't use computers to encode the data to sound, play it back and let a computer on the other side record it.

They could do it by pronouncing every single zero and one. ("One, Zero, Zero, One, ..") This is very slow and inefficient though.

Or they could use the hex encoding. ("B, Four, F, Nine, ...") This is better, but there is still more room for improvement.

Perhaps they could use base64 to encode it. ("D, capital G, H, three, ...") But notice how they have to specify capital letters ("capital D"), this lessens it efficiency.

What's the most efficient way to do it?

Comments are not for extended discussion; this conversation has been moved to chat. — L.Dutch, Apr 15 '21 at 15:46
Can the receiver record the sounds for interpretation by a computer, or do they need to write it down? — wizzwizz4, Apr 16 '21 at 13:12
@wizzwizz4 From my question: They can't use computers to encode the data to sound, play it back and let a computer on the other side record it. So no, they can't record it and must write it down — Cookie04, Apr 16 '21 at 14:58
@Cookie04 Thanks for the clarification. In my dialect of English that would be “or”; I thought you were referring to the entire process being disallowed. — wizzwizz4, Apr 16 '21 at 15:37
Note: A PNG image is massive when transformed into binary-voice data. Even a stupidly simple image, like say - a 20 x 20 black and white picture - can end up a couple hundred bytes in length and be very hard to memorize or speak. If you plan to send conspiracy theories, gold maps or illegal porn over this method, it will take quite a while... — Mermaker, Apr 16 '21 at 19:14
@T. Sar I am aware that it may take a long time, that's why I'm looking for the most efficient method to see if it even is feasible. Also: No such intentions. — Cookie04, Apr 17 '21 at 08:56

Neil_UK · Accepted Answer · 2021-04-16T15:02:46.783

This problem has been much studied. While it may be difficult to define what 'best' is, there is a very efficient one already in existence, and that's PGP Words. This method was designed for transmitting long binary keys over a voice link, each word encoding a whole 8 bit byte.

It addresses a number of problems you probably haven't thought of, like reliability over a voice link. What happens if a word is missed, or a repetition for clarity is mistaken as an actual repetition or, reading from a long list, two words get swapped?

With 8 bits per word, you would normally require 256 words. The system is more sophisticated than that, and uses 512 words, an 'even' table of two syllable words, and an 'odd' table with three syllables, that are used alternately. That way, a missed or repeated word, or two swapped words, can be immediately spotted as an error.

Here are the first few and last few, from the wikipedia article linked above. The whole table can be printed on a single sheet of A4. They are in alphabetical order to aid the receiver. Obviously the list is optimised for English. Speakers of other languages may prefer a different list.

Hex  Even Word  Odd Word
---  ---------  --------
00   aardvark   adroitness
01   absurd     adviser
02   accrue     aftermath
03   acme       aggregate
04   adrift     alkali
..  ......      .......
FB   watchword  Wichita
FC   wayside    Wilmington
FD   willow     Wyoming
FE   woodlark   yesteryear
FF   Zulu       Yucatan

Quote from the wikipedia article

The PGP Word List was designed in 1995 by Patrick Juola, a computational linguist, and Philip Zimmermann, creator of PGP. The words were carefully chosen for their phonetic distinctiveness, using genetic algorithms to select lists of words that had optimum separations in phoneme space. The candidate word lists were randomly drawn from Grady Ward's Moby Pronunciator list as raw material for the search, successively refined by the genetic algorithms. The automated search converged to an optimized solution in about 40 hours on a DEC Alpha, a particularly fast machine in that era.

An alternative for a 4 bit nybble per word is simply to use the hex alphabet, perhaps using the ICAO/NATO pronunciation, 'zero' to 'niner' then 'alpha' through to 'foxtrot'. There are more than 16 further letters left if needed for even/odd coding. Whether the complexity of the 512 words needed for doubling the throughput with PGP words is warranted against the simplicity of hex begs the question of how you define 'best', what factors in the setup or operation of the communication are important.

You could get even higher efficiency by using more bits per word. 12 bits would need a 4096/8192 long dictionary. This would sacrifice much of the hard-won inter-word phonetic distance of the PGP scheme, so would require a higher fidelity voice channel, and more careful speakers.

Noting ruakh's comment, it's worth looking at the speed of the channel. His estimate is two seconds per word, which would probably be quite good for untrained users. That's 4 bits/s. If we compare that with Morse code, the minimum speed required by the FCC to grant a radio operator's license used to be 16 five-letter code groups per minute, which very roughly equates to about 8 bits/s.

The difference between the two systems is that Morse Code requires training. I couldn't transcribe Morse, at any speed, without a lot of practice, and probably some tuition as well. Many English speakers could transcribe those words without practice, but what about those with a limited vocabulary, or English as a second language, or speakers of other languages? PGP words is not really training-free, if it's to be used by any human at all. It's only ready-to-go if used by people like those who invented it, educated fluent English speakers, being a programmer would help as well. It's probably a skill that's easier to pick up than Morse though.

With speed in mind, it might be worth reviewing the performance of hexadecimal via ICAO pronunciation. While a word every two seconds would be good going for recording a PGP word manually, I think hexadecimal could be done at easily twice that rate, transcribing as you go, making the bit rate of the two methods equivalent.

Clearly a lot of other assumptions about the training or experience of users, the setup costs, the quality of the audio link, have to be defined before the best system can be determined.

Awesome! Always cool to see a worldbuilding problem that it turns out people have already been working on. — Qami, Apr 14 '21 at 15:45
+1. It may be worth pointing out that although this is much more efficient than reciting a sequence of "zero" and "one", it is still quite slow -- at a rough handwavy guess, maybe half an hour per kilobyte? (That's assuming about two seconds for the speaker to determine the word to say and then say it, or for the listener to identify the word that was said and then transcribe either the word or the byte.) I hope the characters in this story are getting something really nice in exchange for spending so much time doing something so boring! — ruakh, Apr 15 '21 at 02:56
i bet you could transcribe binary morse! dit=0 dah=1 (or vice versa) — user253751, Apr 16 '21 at 14:00
@user253751: You could, but that's 1 bit per syllable; very inefficient. PGP Words is 3.2 bits per syllable. It should be theoretically easy to get up to 7 bits per syllable, if we lose the error correction. — Mooing Duck, Apr 16 '21 at 14:57
"Longing, rusted, seventeen, daybreak, furnace, nine, benign, homecoming, one, freight car". Obviously number of syllables may differ in Russian, but I wonder if they were thinking of a system like this? — Darrel Hoffman, Apr 16 '21 at 15:31
@MooingDuck it's binary Morse code. One syllable costs many Morse code characters. Sending syllables that represent binary, in Morse code, is slower. If you can get 7 bits per syllable it doesn't matter because it takes more than 7 bits to encode that syllable back into binary for transmission. — user253751, Apr 16 '21 at 15:35
@user253751: After reading your comments several times, I've determined I have no idea what you're saying now :( — Mooing Duck, Apr 16 '21 at 17:15
@MooingDuck morse code but in binary: 10010101 -> -..-.-.- ;; Your idea: 10010101 -> 1001 0101 -> cow fox -> -.-. --- .-- ..-. --- -..- — user253751, Apr 16 '21 at 17:22
@user253751: Morse code has three symbols: dit, dah, and pause, so you can't just directly treat it as binary. You'd have to go from ternary to binary. — Mooing Duck, Apr 16 '21 at 18:33
@MooingDuck yes, "cow fox" in morse code is even longer than I said, because I didn't include the pauses. If we're sending 0s and 1s we don't need pauses. — user253751, Apr 16 '21 at 19:36
@user253751 You have to have the pauses. "SOS HELP" and "I AM HIS DATE" have the exact same 0s and 1s. https://cs.stackexchange.com/q/34067/1429 — Mooing Duck, Apr 16 '21 at 20:12

score 18 · Answer 2 · edited Apr 14 '21 at 13:22

18

Efficiency means something that is easy/quick to encode, easy/quick to pronounce and understand, and easy/quick to decode. base64 was never made to be read by a human, and saying "capital" every now and then will seriously slow you down. Hexadecimal is slightly better than spelling binaries, but perhaps too short. Seven are two syllables.

Define 32 (or 64 or 128) short, distinctive words. Which ones depend on the languages and accents of the participants. Cat, Dog, Fox, ... but if Cat is in the list then Bat should not be there, or vice versa.
The suggestion by the commenter Zeiss Icon to use NPA has merit, but it is limited to 5-bit units. Finding 64 words might be feasible ...
Assign each word a number.
Break the binary data into "bytes" of 5 (or 6 or 7) bits, encode and read the words.
Listen to the words and then decode into "bytes."

It might be a good idea (even if it decreases efficiency) to add a checksum to your transmission.

edited Apr 14 '21 at 13:22

Matthew

14,443
1
29
89

answered Apr 13 '21 at 18:05

o.m.

114,994
13
170
387

The checksum is an interesting idea. It would however need to be one that doesn't require a computer to check. Does that still work? (The endgoal is to put the data back into a computer but the transmission needs to be between two humans and any errors need to be identified and corrected there and then) – Cookie04 Apr 13 '21 at 18:20
17

There is an existing 36-character set of very distinct symbols: the Internation Phonetic Alphabet. Alpha, Bravo, Charlie, Delta, Echo, etc. along with the numbers Zero through Niner. This gives five bits for each character, plus leaving a few for special characters. This set was created specifically to avoid confusion of one character for another (hence by Bravo, Charlie, Delta, and Tango sound very different, for instance). Further, they're distinguishable even with an accent, a fair amount of noise, and to those who aren't native English speakers. – Zeiss Ikon Apr 13 '21 at 18:28
@Cookie04 A simple checksum such as a parity bit can be checked by hand, though it would be relatively time-consuming. – Cadence Apr 13 '21 at 18:32
Yes that could work, although it probably wouldn't be as save as "real" checksums. But as you rightly pointed out, speed is a concern and depending on the checksum algorithm, checking the checksum could be very slow for a human. – Cookie04 Apr 13 '21 at 18:35
IPA has four more words than you need... which should conveniently let you drop some of the more awkward ones (I'd probably go with whichever four of India, Juliet, November, Romeo, Sierra, and Unicorn/Uniform you like least, as those are the three-syllable ones). That said, note that IPA also chose words that are, y'know, actual words (in English), which helps with identification. For your purposes, that may not be useful? – Matthew Apr 13 '21 at 18:45
@Matthew Identification is very useful! Remember, they have another person on the other end of the message who has to interpret this by hand. You want to make sure they're not hung up on the difference between two similar phonemes in the middle of decoding your high-speed transmission. – Cadence Apr 13 '21 at 18:52
@Cadence, yes, but what I meant is that IPA is somewhat geared to being identifiable by untrained English speakers. It wasn't clear if the OP is asking this for a quasi-real-world use, or for some hypothetical world subject to Translation Convention, or to what extent the receiver is expected to be trained in the use of the system and would therefore be able to recognize made-up (but still distinct from each other) words equally well. – Matthew Apr 13 '21 at 19:32
18

@ZeissIkon Ok, this needs to stop. The acceptable alternative names for the International Radiotelephony Spelling Alphabet (the Alpha, Bravo, Charlie... thing) are the NATO phonetic alphabet, the NATO spelling alphabet, the ICAO phonetic alphabet and the ICAO spelling alphabet. Note that International Phonetic Alphabet (IPA for short) is not on the list. That's because the IPA is something completely different that is not at all suited for this. – No Name Apr 14 '21 at 11:07
The checksum might be in the meaning: if you use not words but certain letter combinations as 'characters' (for instance only the first two consonants: 'fish' and 'fast' would both (f-s) be '01', 'table' and any other t-b... word would be '02',... ). you could have a check[sum] by switching category every, say, ten words: animals, verbs, things, adjectives,... or you might have to rhyme pairs, etc. . The checks would not be sums as such, but would still work. i.e. a failure in transmission would stick out most of the time ( a checksum is not foolproof either) – bukwyrm Apr 14 '21 at 12:12
2

@NoName Not going to disagree, but as a US licensed amateur radio operator ("ham") I've never heard the terminology you give. Perhaps, like many other things, the "wrong" way is far more common than the "right" -- but if you use one of those other terms, few if any who actually use this thing will know what you're talking about out of context. – Zeiss Ikon Apr 14 '21 at 12:39
@ZeissIkon Fair enough, but the conlangers on this site (myself included) will be confused until they read the comments. My real goal was to get o.m. to change the wording in the actual answer. I only picked on you because they cited you and I can only include one "at". Actually, now that I look closely, you used "Internation Phonetic Alphabet", without the -al suffix, and didn't abbreviate at all. My bad. – No Name Apr 14 '21 at 13:01
1

@NoName, ack, so sorry! You are right of course, and I know better (not a "ham", but I did make myself learn NPA... although somehow I learned "yoyo" instead of "yankee"). I just copied Zeiss Ikon without thinking. If only it was not too late to go back and fix my comment... . Oh, and look, I can edit... and add a helpful link while I'm at it. You're welcome, and thanks for the correction! – Matthew Apr 14 '21 at 13:21
@NoName None of us type perfectly, even after (in my case) four and a half decades at it. In any case, the idea seems reasonable even if there's controversy over what to call it. It's a system of hard-to-confuse words that represent letters, which in turn can represent 5-bit words (equivalent to original Baudot code). – Zeiss Ikon Apr 14 '21 at 13:40
1

@Matthew Thanks for the correction, linguists everywhere rejoice! – No Name Apr 14 '21 at 15:10
6

@ZeissIkon I know what the NATO Alphabet is, yes - I'm a navy brat. All I'm saying is that the IPA I, and a good fraction of the people around this particular stack, know is a system for transcribing speech into writing in a way that indicates the sounds spoken without need for an audio recording, rather than a method of transmitting alphanumeric serial numbers over a noisy media channel without loss of information. Wiki link for your reading pleasure: https://en.wikipedia.org/wiki/International_Phonetic_Alphabet – No Name Apr 14 '21 at 15:21
2

@NoName Oh, that thing. Yep, completely useless for this need. The ICAO one is what's needed here. – Zeiss Ikon Apr 14 '21 at 15:51
Funny story - after I had written out my answer (which references IPA) and then seen the alpha-bravo-charlie stuff being called IPA here, it sent me running back to the Wikipedia article I had linked to make sure it was the right one. xD – Qami Apr 15 '21 at 15:39

Qami · Answer 3 · 2021-04-13T19:20:28.500

13

Choose X syllables, and then base-X-encode your data.

Choose the largest set of syllables that you consider for your purposes to be mutually distinguishable, and then base-X-encode your binary data, where X is the number of syllables you've chosen.

A simple example: suppose we choose only syllables that begin with a consonant, and are followed by a vowel, from the following sets:

{p,k,t,ch,b,g,d,j}

{a,e,i,o,u}.

We now have 8 x 5 combinations, giving us 40 syllables. You can base-40 encode your binary data into a rapid-fire of syllables that would sound something like:

"pagidachotajapikachutagujiko.... "

This example used latin-alphabet letters and assumed a more-or-less standard English pronunciation. To make a more rigorous system, it would be advisable to use the International Phonetic Alphabet to choose the sounds for your syllables and ensure that you avoid phonemes that (1) would be difficult to distinguish from one another and (2) do not lend themselves to fast pronunciation.

Note: This answer is similar to o.m.'s...just "encoding" at the level of syllables instead of words. By customizing your list of syllables to all be quickly-speakable and using every possible combination, I expect this would tend to increase the efficiency of information per syllable. However, if the memory and cognitive processing rate of the average humans using it is taken into account, it's possible (probable?) that o.m.'s word-sequence-based transmissions will be more easily remembered without error.

edited Apr 13 '21 at 19:20

answered Apr 13 '21 at 18:27

Qami

7,001
23
41

5

beep boop, trademark violation detected! please pay $1 royalty to Nintendo Corporation for use of the word "pikachu" in your phone call. This has been a message from galactic cyber-enforcers inc. – user253751 Apr 14 '21 at 11:11
1

@user253751 Deliberate easter egg - well spotted! I'll pay the $1. ;) – Qami Apr 14 '21 at 13:21
btw "j" and "ch" are too similar, so don't use both – user253751 Apr 14 '21 at 13:48
@user253751 Well, the same could be argued about the other pairs (the last four consonants are just the voiced versions of the first four). But it's up to the OP to decide, based on intended usage, what's distinguishable... – Qami Apr 14 '21 at 15:07
1

This approach does sound an awful lot like Bubble Babble, referenced in this stackoverflow question, defined in this IETF draft, implemented in OpenSSH for fingerprint validation over voice channels. – orithena Apr 15 '21 at 13:34
@user253751 /j/, on the other hand, would be fine. – wizzwizz4 Apr 15 '21 at 15:17

JBH · Answer 4 · 2021-04-14T14:56:43.713

11

Sing it

You're thinking too much along programming lines. This is valuable for the sake of compression: but everything else you've mentioned (like hex encoding or base64) is just a way to make binary more consumable by the human eye (and other things).

Frankly, from a programmatic perspective, the only thing you really care about is compression. You want to send as little data as possible to guarantee maximum transmissibility. But other than compression, it doesn't matter how you express your ones and zeros. (It helps to remember the good old days on the Apple II computers where the average geek cared about Assembly Language.)

What you really want to think about is music

If you want the average human to convey ones and zeros, ask them to sing. Generally speaking there are only seven notes — but there are half notes, quarter notes, sharps and flats. Those alone give you 63 notes. Add shifts in octave and you get more. Your average person can span two octaves. Now we're up to 126 notes. 127 with a "pause" (no sung note in the meter of the song). Expand the vocal range just a hair and you get to all 128 positions, allowing you to express every combination of "11111111" or 2⁷.

BTW, I'm not a music expert. I wouldn't be at all surprised that you could express a whole lot more data than 2⁷ with the magnificent expressiveness of musical notes.

From here, it's just a question of mapping notes to binary combinations and, boom, vocally expressed digital data.

Finally, I'm not suggesting that what you get would sound good... only that it could be done.

edited Apr 14 '21 at 14:56

answered Apr 13 '21 at 19:31

JBH

122,212
23
211
522

4

Interesting idea. Wouldn't it be difficult for a non-musician to hit those notes and correctly identify each note though? If someone would sing a note, I probably couldn't tell you which one. As I said in my question I need a solution that works for the average human. Not sure if the average human could do that – Cookie04 Apr 13 '21 at 19:34
@Cookie04 Your average person can follow simple to moderately complex songs fairly well. Add lyrics (using an averaging algorithm to "hit the note') and it's even easier. The distinction between each note on the vocal scale is quite high, so we're back to an averaging algorithm (already solved, there are apps that make your singing sound better, that's the basis for the tech to "decode" the data). Frankly, I can't see how this wouldn't work with one possible exception: some note shifts would require a slower tempo or the human voice can't make the shift. But, just slow it down. – JBH Apr 13 '21 at 19:38
I'm somewhat musically talented, and I wouldn't be able to transcribe this very fast. You don't need (and don't want, due to differing vocal abilities) to use absolute pitches, so having perfect pitch is not required, but even so, some people may not be able to work out relative pitches, especially if you're throwing sharps into the mix. (Note that there is no such thing as sharps and flats. Yes, pedantically, I'm simplifying a bit, but in practice you can't expect to tell an a-sharp from a b-flat.) – Matthew Apr 13 '21 at 19:45
At best, I don't expect adding pitch is a practical way to improve total transmission speed. You'd probably need to be able to record the "song" and decoding would be slow. But if you just need to maximize "transmission" speed and don't care about "reception" speed, then maybe this would work. (But also, don't forget that some people can't carry a tune to save their lives.) OTOH, if your receiver is a computer, you have a totally different set of problems. But that's not how I understood the question? – Matthew Apr 13 '21 at 19:47
Oh, and another issue is that an average person probably can't sing an arbitrary tune from a written representation with any reliability. Particularly since this is an application that is going to either ignore all the "rules" about forming a "nice" melody, which is going to make it hard to sing (again, I have some knowledge/experience in this area), or else will involve a very complicated encoding process which will make it much harder to decode if you don't have computer assistance for that. – Matthew Apr 13 '21 at 19:51
1

@Matthew, you're absolutely right... you're also taking all the fun out of this. Nobody in their right mind would convey digital data by voice. Not under any circumstances. But since we're dealing with a fictional world with fictional rules... I think this would be fun. Could you imagine a society that had become reasonably adept at this - and then we defenseless humans met them? Shaka, when the walls fell... – JBH Apr 14 '21 at 02:18
It'd be interesting to do this mapping programmatically just to hear what images sound like. I imagine it would be something like a dialup router. – gregsdennis Apr 14 '21 at 04:55
3

This is musically and mathematically quite an overestimate - There are 12 semitones in an octave, including sharps and flats (which are the half-notes), so two octaves gives you about 24 notes, or 48 if you allow quarter-notes (which I think most people would not be able to reliably sing or discern), not 63. Using a pause is also problematic if the data happens to encode to multiple pauses in a row, it may not be easy for the listener to decode exactly how long the pause was. Even so, 128 is 2^7, not 2^8. – kaya3 Apr 14 '21 at 08:11
That said, this is a good idea, but having a wider range doesn't necessarily make the communication channel more efficient, because it's harder to hit notes exactly if they are far apart and you have to switch between them quickly. I think 8 or 16 different notes would be reasonable for an average human to sing and hear at a reasonable speed, more than that and it's either going to be unreliable, or the singer would have to slow down so the bandwidth would be lower. – kaya3 Apr 14 '21 at 08:14
Hi @kaya3. The OP did not specify that this had to be heard by another human, so I ignored that aspect (as a rule, I suspect that "people" would be terrible at decoding digital data by audio. The encoding action is plausible because we can be taught to sing, indeed to speak. but to hear something like this and allow an image to form in our heads? That's a stretch for me). Good catch about 2^7, I'll correct that (I must have been counting 2^0 as "1", duh...) – JBH Apr 14 '21 at 14:56
@JBH The question asks for a way of transmitting which "can be reliably produced and differentiated by average humans", so a human would have to be able to listen to it and tell what the bits are. (Of course the human wouldn't have to also figure out what the bits actually mean as a PNG image or something like that.) – kaya3 Apr 14 '21 at 18:08
@kaya3 You're right, I completely overlooked the "differentiated" part... still... no matter how hard it may seem to do this with music, the effort on the receiving human's part to correlate the incoming vocal data and draw a mental picture exceeds the difficulty by orders of magnitude. The idea that a human (or any creature) could parse and recreate a 10"x10" image at 1200dpi is unbelievable to the point of being ludicrous. (Note that knowing what it is, is actually simple. An example already exists in HTTP (internet browser) protocol. It's called a MIME header.) – JBH Apr 15 '21 at 16:50
The question doesn't say the human would have to work out what the data means - just that they'd have to receive it. – kaya3 Apr 15 '21 at 21:44
@kaya3 That might be sophistry. Simply acknowledging the existing of noise is meaningless. We might need the OP's confirmation, but it seems that either the receiving human must be capable of reproducing the example image in their head - or it's irrelevant that they hear it at all. – JBH Apr 16 '21 at 02:40
Or they could write it down and enter it into a computer later... if the problem were really about a human visualising an image in PNG format from the raw bits of the file, I would think the question would ask if that is feasible. It is hard to imagine the OP simply assuming it is. – kaya3 Apr 16 '21 at 06:53
To clarify: The receiving human doesn't need to interpret the data, just receive it. They will enter it into a computer later and then the computer does the decoding for them. – Cookie04 Apr 16 '21 at 15:33
Thanks, @Cookie04! – JBH Apr 17 '21 at 01:55

score 5 · Answer 5 · answered Apr 13 '21 at 19:15

5

The best way to communicate binary information long distance by human "vocal" is by a whistle language. The existing whistle languages are used in rough landscapes to communicate longer distances than possible through voice.

In this case, you can have one pitch for one and another for zero (or work up a common set of tones for numbers 0 through 7).

answered Apr 13 '21 at 19:15

David R

5,653
1
7
18

1

Nice to see that here, even if it not necessarly fits op's situation – MolbOrg Apr 14 '21 at 15:29

score 5 · Answer 6 · answered Apr 14 '21 at 04:23

Human speech is astonishingly effective as establishing communication between humans. Not only does it use the vocal chords efficiently, but it uses the language centers of the brain efficiently as well. Our languages have just enough redundancy to support humans.

Accordingly, the best way to have an average human transmit and recieve binary data is to map it to human language. Use an AI to construct an injective mapping between a string of bits and sentences in English (or whatever the native language of your characters is). The result should sound like a typical monologue.

This is almost certainly the most efficient way to communicate a binary string between people. If one side has other tools available , such as a tape recorder, there may be more efficient means But never underestimate the benefits of leveraging a few decades of speech on behalf of both parties.

Amusingly enough, the resulting encoded binary should look something like the content of the answer itself... although probably more nonsensical (or at least, I hope what I wrote is more sensical than the encoded content would be!) — Cort Ammon, Apr 14 '21 at 13:29
Yeah, markow chans back in business, woohoo... Was looking for the answer — MolbOrg, Apr 14 '21 at 15:34

score 3 · Answer 7 · answered Apr 13 '21 at 18:50

What you need is a drum. Actually, any old stick and a hard surface to strike it on will do in a pinch, but that limits you to binary data ("hit" or "no hit") and doesn't allow for any out-of-band information such as "end of message". A drum that makes a different sound on sustains (the stick is kept in contact) vs. rests (the drum is allowed to vibrate on its own) will let you have notes of varying lengths which can form more complex codes. Drumbeats can be very fast and accurate, far more than most singers, and can send longer messages without pausing.

For instance, with such a drum, you could have a code of "short" and "long" beats that encodes letters and punctuation; you may know this as Morse code. Or, you could cut to the chase and encode a binary or quaternary (or any other convenient base) number directly as a series of beats.

The advantage and goal here is to keep the encoding and decoding simple, because the limitation is not in the ability of the human body to make noise, but the speaker's ability to figure out which noises they ought to be making. A base-32 or -64 code is more efficient in terms of symbols, but any speed benefit is undercut by the user's own speed in interpreting those symbols as actual sounds.

In contrast, using only three symbols (short, long, and silence), Morse code is a proven example of encoding and decoding a message, by human operator, in real time. So it makes sense to investigate the area of low-density but high-speed communications to make most efficient use of the human part of the equation.

I specified that I need a system which uses the human voice (Any sound that can be reliably produced and differentiated by the average human) — Cookie04, Apr 13 '21 at 18:54
@Cookie04 Ah, I misunderstood. By "reliably produced" I assumed you just meant it couldn't require any specialized or complex equipment, rather than no equipment at all. — Cadence, Apr 13 '21 at 18:58
Clapping or clicking (e.g. with the tongue) would work in place of a drum. — gregsdennis, Apr 14 '21 at 04:51
The main trouble here is that in order to identify a non-sound, you need a reliable clock. — gregsdennis, Apr 14 '21 at 04:52

score 3 · Answer 8 · answered Apr 14 '21 at 10:31

I'd encode the binary to a huge lookup table of values and concepts, link these concept and values to phonemes that are easy for the human vocal apparatus to process, and for the human ear to hear.

Group these sounds into more complex structures, for improved bit density.

Ideally, I'd assign some contextual meaning to these complex structures of phonemes, to ensure greater ease of comprehension, and cut down on the error rate when producing them. There would be intricate rules regarding juxtaposition of these structures, disallowing obvious errors. We might need to make lookup tables of these structures, as an aide to learning and for error-checking.

Of course, the art of learning such a highly complex and convoluted data pattern will need to commence from childhood, and continue throughout adult life.

For convenience, we will call this convoluted binary-encoded-as-phonetic-sounds-in-structured-groups-with-contextual-meanings an easier name to identify it.

How about "English"?

Chris H · Answer 9 · 2021-04-14T14:39:43.287

I suggest combining the best of o.m.'s accepted answer and JBH's suggestion of singing, to use a spoken tonal approach. Many people use languages in daily life that rely on tone for meaning, so your "average person" may well be able to.

Tone isn't usually employed in English, but combining even a simple high-low distinction, which pretty much anyone can do, with the NATO alphabet's 36 characters gets you 72 distinct values (or 6 bits if you want simpler mapping). I reckon this is optimal for a pair of English speakers with no training.

Three tones (high-mid-low) gives you 108 values (more than enough for something based on ASCII85), but if you can use rising and falling tones too, you get 36×5=180 values. Some Chinese dialects use even more. Unless mapping bits to vocalisations has to be done by people using a look-up table, you don't need your unique vocalisations to add up to a power of two, as demonstrated by ASCII85. The more tones used, the more training will be required, at least if your speakers aren't used to it.

Cooking up a new phonetic alphabet using only one- and two-syllable words is probably helpful for efficiency. Of course you could discard the mapping to the alphabet, but that mapping does makes transcription easier (tone markers will be needed).

You should consider the fidelity of the channel - a quiet room, across a deep gorge with a noisy river, a telephone line, a party, etc. will have different characteristics. Some will need a wider distinction between sounds than others. Only in the quietest settings could volume be used as another variable.

Running some numbers on the rate using simple mappings (i.e. rounding the number of unique sounds down to a power of two): English speech is apparently around 4 syllables per second. Using simple mappings, with 6 bits over two tonal variants of the two-syllable A-Z0-9 alphabet, you'd get 12 b/s. With 5 tones rounded down to 7 bits, but upping the rate to a still-reasonable 6 syllables/second you're now at 21 b/s. Pushing it further, if you can come up with 64 distinct single-syllable words and can manage to apply 4 tones to those sounds you'd get 256 values per syllable, or a grand total of 32b/s. Don't forget to breathe.

score 1 · Answer 10 · answered Apr 14 '21 at 13:26

There are approximately 470k English words. $\log(470000)/\log(2)$ yields 19 bits per word. Good luck teaching kids all English words and their binary counterparts as well.

At that point, it might be easier to create 256 "words" at 1 byte per word. Words that are syllabically short and enunciated carefully. You could likely do 1-syllable each, so, "cabagathadedodunit", where each syllable is a 1-byte value "word". That string would yield 8 bytes or 64 bits.

If you think your kids are just really damn smart, you could bump that up to 65536 words stowing 2 bytes per word with more syllables per word, but that presents a much greater error risk.

score -1 · Answer 11 · answered Apr 16 '21 at 15:16

-1

Morse code may be voiced or whistled. Computers can turn the binary data into letters (base 26, or perhaps a lower base by dropping the longest morse codes.)

Those who train, gets their morse up to 60 wpm or so. The average word is about 5 letters. So 300 letters per minute, each encoding 4.7 bits. 1410 bits per minute, or 23.5 bps. Well, that is for morse using equipment. I found no data for voiced morse.

answered Apr 16 '21 at 15:16

Helge Hafting

511
2
2

Voiced Morse is considerably slower. You can't move your mouth fast enough to get more than about two bits per second. – Mark Apr 16 '21 at 22:08
@Mark have you never heard anything like this https://www.youtube.com/watch?v=lINneylEo0U I recon he'd be competitive with PGP words as a bit rate – Pete Kirkham Apr 16 '21 at 22:53

Most efficent way to transmit binary data by human voice

11 Answers11

Sing it