Speaker diarization is an important and dedicated domain of speech research in which the number of speakers involved as well as the intervals during which each speaker is active is determined in an unsupervised fashion. It has a utility in a majority of applications related to audio and/or video document processing. We are interested in its application in financial services where the automatic detection of the number of speakers during phone conversations of traders, wealth brokers and contact center workers and the periods where each speaker is active could facilitate the delivery of regulatory requirements.
A typical speaker diarization system usually consists of four primary components:
1. Speech segmentation, where the input audio is segmented into short sections to reduce the chance of having more than one speaker per section. Non-speech sections are also filtered out at this stage.
2. Audio embedding extraction, where specific features that contain unique characteristics of the speaker are extracted from the segmented sections.
3. Clustering, where the extracted audio embeddings are clustered. A cluster per speaker Id is formed at this phase of the process.
4. Re-segmentation, where the clustering results are further refined to provide the final diarization results.
The main concentration of this research is on the second component, i.e. audio embedding extraction.
For many years, identity-vector (i-vector)-based audio embedding has been the dominant approach for speaker diarization. However, recently, neural network-based audio embedding has gained some attention. For instance, authors have used an LSTM network to extract audio embeddings. But, LSTM networks perform best on sequential data and the sequential nature of the audio signal has limited impact on the features that identify the unique characteristics of the speaker. Hence, LSTM networks might not be the best type of neural networks to solve this problem. On the other hand, Siamese networks have shown lots of potential in finding similarities and relationships between two comparable objects, with face recognition
being a good example of this. This work concentrates on this class of neural networks and proposes a modified version of the Siamese network for the audio embedding extraction problem. Our modified version incorporates an auxiliary loss to ensure the convergence of the model. In addition, slow cooling was implemented to facilitate the exploration of the search space*.
To train the model, a large number of audio recordings with/without background noise, i.e. babble noise, music noise, street noise, car noise and white noise, were used. Our training dataset contains voices equally split between male and female speakers, and of 27 different common languages. Finally, we tested the accuracy of our model on our test dataset that contains two minutes recordings of either one or two speaker(s) speaking the same language with/without the same accent. The accuracy achieved on our test data is 98.69%.
Below is a short video of the performance of the trained model on three cases of one minute recordings of two speakers with/without background noise. The speech segmentation has been done using our previous work and no re-segmentation method has been applied to refine the results of the clustering:
It should be noted that the most difficult situation happens when both speakers are from the same gender and have the same accent. As can be seen in this video, the model does a good job in all those cases. Also, it is worth mentioning that the voice of these speakers were not used to train the model.