Solving Conversational Voice Recognition

Speech is not a solved problem.

  • All applications will have some errors.
  • Key: design the system so some aspects are easy.
Voice Dialing
Amazon Echo
Meeting Transcription
Small vocabulary
Unlimited vocabulary
Click the product names below to see how each of them solves the speech recognition problem.
Quiet environment
Noisy environment
Near microphone
Distant microphone
Deliberate, careful speech
Spontaneous speech
Fixed speaker
Multiple speakers


Many of the biggest problems facing humanity today, like curing diseases or addressing climate change, would be vastly easier to solve with the help of AI. At Convosense we believe that we can channel this revolutionary technology to radically improve human communications and collaboration.

By making human discourse fully machine readable we can open a vast landscape of business, education and consumer opportunities.

Conversational Speech Recognition Is Not A Solved Problem.
It Will Be Soon ...

Massive Growth Market

Starting from a base of $249 million in 2015, global speech and voice biometrics revenue will reach $5.1 billion by 2024, with cumulative revenue for the 10-year period totaling $19 billion at a compound annual growth rate (CAGR) of 40%. Enterprise growth markets are expected to include call centers, government IT, enterprise IT, and healthcare.

NIST STT Benchmark Test History - May 09

Via speakerphone:
80% Word Error Rate
Via individual headsets:
50% Word Error Rate

Solving The Problem Through Deep Learning Ai.

The Tipping Point Facilitated by Deep Learning.

Speech Recognition (Google Now): 30% reduction in Word Error Rate for English. Biggest single improvement in 20 years of speech research.
Voicemail transcriptions (Google Voise / Project Fi): Using a long short-term memory deep recurrent neural network the transcription errors was cut by 49%.

How To Reduce The Error Rate To Ca. 20%?
Three broad areas for improvements:

Acoustic Modeling.

Captures how various language sounds appear in audio.

Language Modeling.

Captures frequency information about various words and phrases.


Combines the audio & the above models to produce the best word sequences.

