The blog post How Just Listen Works describes the basic working model of the Just Listen voice analytics solution. This time we dive into the details behind the speech-to-text aspect of voice analytics with the help of an interesting example: the Omarosa tapes.
First, let’s define the 3 basic processes that form the building blocks of speech-to-text technology:
In selecting technologies for Just Listen, we took into consideration the following:
Let’s try out the speech-to-text capability of Just Listen using the audio clip below, a conversation between US President Donald Trump and Omarosa.
For speaker recognition, we use an open source library pyAudioAnalysis to identify the two speakers in the clip, President Trump and Omarosa. This library uses feature extraction and clustering techniques to group short segments of the audio. The advantages of this library are that it is language independent and fast. It is also easy to implement in our program.
from pyAudioAnalysis import audioSegmentation as aS
results = aS.speakerDiarization( audio, n_speakers = 2, plot_res=False)
The results from the function will give us three lists: speaker IDs, start and end times. Each corresponding element in the lists represent a continuous segment of audio in the original clip. For our audio, 3 segments were found:
speaker_ids = [1, 0, 1]
start_times = [0.0, 4.7, 9.6]
end_times = [4.7, 9.6, 21.8]
Once the different segments have been identified, the audio clip can be broken up into chunks. Each chunk is then transcribed, and the resulting text can be sent for text analytics using various Natural Language Processing (NLP) methods.
In this example, Google Speech-to-Text service is used to perform the transcription. To make use of the service, you sign up for the service where authentication keys will be provided. These keys are to be used together with Python libraries. In the code block below, we show here a simplified example on how the Google library is used.
from google.cloud import speech
from google.cloud.speech import types
# Authentication keys are defined before this
client = speech.SpeechClient(credentials=scoped_credentials)
# Loads the audio into memory
with io.open(file_name, 'rb') as audio_file:
content = audio_file.read()
audio = types.RecognitionAudio(content=content)
# Define configuration for the transcription
config = types.RecognitionConfig(
encoding = enums.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz = 8000,
language_code = 'en-US',
enable_automatic_punctuation = True,
model = 'phone_call' )
# Sends audio for transcription. Results are in the response variable
operation= client.long_running_recognize(config, audio)
response= operation.result(timeout=500)
Speaker recognition and transcription for our audio clip gives us the following results:
The speaker recognition library was able to identify which parts of the conversation were spoken by President Trump and Omarosa. This is not surprising given that to human ears, the two voices are distinctive in their aural properties.
Subjectively, the transcription here is quite accurate with only a few words missing or transcribed erroneously (i.e. Marissa instead of Omarosa). This is probably due in large part to the good audio quality with no significant noise distorting the speakers’ words.
The next logical feature is speaker identification. In our example, we would like to automatically identify the speakers of the audio clip to be either President Trump and Omarosa. In a Call Center scenario, we would want to identify which parts of the conversation are spoken by the agent and the customer. This will allow us to do a more accurate analysis of the content of the call. For example, we can audit the compliance requirements only on the agents’ side and perform a more in-depth study on the customers’ response.
Identification is very easily done by a human but not so much by an automated system. General speech models are built to recognize all types of voice profiles. To identify a specific person in a speech requires building a custom voice model. This is usually done by providing audio samples of the target person together with the text equivalent. You can read about this aspect of the solution in this blog post.