0

I'm currently working on voice activity detection (VAD) theme and I just started to wonder - if computer generated speech (as when using voice generators) can be discriminated from human speech with classic VAD approaches (ex. 4Hz energy modulation, zero-crossing ratio, modulation entropy, short time energy, MFCC coefficients).

Or - from mathematical point of view - are they just the same signals (or with very similar characteristics)?

mathreadler
  • 453
  • 2
  • 10
  • 1
    A sufficiently advanced synthesizer could possibly be designed to match any desired short term voice characteristic. Over the longer term, you end up with a Turing test. – hotpaw2 Jan 06 '16 at 13:32

1 Answers1

1

Many current computer speech synthesizers seem to produce less jitter or variability in inter glottal closure periodicity timing than most typical human speech. (e.g. humans often produce subtle vibrato as well as tremolo, but synthesizers could change to do this as well.)

hotpaw2
  • 35,346
  • 9
  • 47
  • 90
  • Explanation - accepting as answer with connection to comment added under my quesiton.

    And thank you for explanation.

    – user3038744 Jan 06 '16 at 14:31