0

Usually ASR systems are evaluated using WER (word error rate), which summarizes 3 types of changes when calculating the edit distance: insertions, deletions and substitutions. According to the wikipedia page, there are two versions:

A. each type is given the same score: 1, 1, 1 (S, D, I)

B. Hunt's version: 1, 0.5, 0.5

I've noticed that sclite has it's own weights

C. sclite: 4, 3, 3

What are the use cases for each weighting scheme? My goal is to compare different speech recognition APIs.

dimid
  • 105
  • 5

1 Answers1

1

The various weighting systems are a domain-speific attempt to adapt the raw edit distance metric into something more "perceptual". The needs & error tolerances for one domain, say large vocab open speech, may be different from the command-and-control, "assistant" systems.

Humans differentially weigh different types of ASR errors and are less forgiving with errors (like say insertions) which may distort the meaning of the transcription.

Have a look at this paper: Predicting Human Perceived Accuracy of ASR Systems

ruoho ruotsi
  • 1,770
  • 9
  • 10
  • Thanks, that's an interesting paper, I wonder what were their final weights. My goal is to compare APIs for conversations between people (i.e, open with a large vocabulary). – dimid Dec 06 '16 at 06:04