multi-speaker-recognition

Multi-Speaker Recognition and Speaker Specific Question Answering System

Team CryogenX:

Introduction:

In this project, we have developed a text independent multi-speaker recognition system and speaker specific question answering system. In real world scenarios, we interact with voice assistants which currently responds to queries which are either speaker independent or related to the person who is logged into the system. We aim to make such system more useful by adding a capability to recognize the speaker and respond to queries based on the speaker. The main approaches in area of speaker recognition includes template matching, nearest neighbor, vector quantization, frequency estimation, hidden Markov models, Gaussian mixture models, pattern matching algorithms, neural networks, decision trees, Support Vector Machine (SVM), etc.

The state of the art Speaker Classification and Verification systems use neural networks to achieve 100% classification rate and less than 6% Equal Error Rate, using merely about 1 second and 5 seconds of data (single speaker per data file) respectively [1].

In our system, we have trained a neural network classifier to work for multiple concurrent speakers while providing a limited domain speaker specific question answering system.

What it does:

Multi Speaker Recognition and Limited domain question answering system is an application were people can ask question and get answers specific to their knowledge.

Suppose there are 3 people enrolled in the system A, B, C, (A is a professor, B is a football player and C is a student). Using voice input “B” asks our system “What is my schedule for today?”, the system responds “Hello B, your schedule for today is football practice at 5pm”.

Then later “A” also asks the same system “What is my schedule for today?”, and the system responds “Hello A, your schedule for today is grade midterm exam”. So our system successfully distinguishes between all the enrolled user and interacts with them based on their specific domain knowledge.

The enrolling process is as simple as recording your voice by reading a paragraph on the UI and answering a few limited domain questions.

UI/UX:

Main Screen:

Main Screen

Enrollment Screen:

Enrollment Screen

How we built it:

Our system has a speaker classification deep neural network model hosted on the cloud. This model is trained using the data mentioned below. There will be an enrollment module wherein a new speaker can enroll into the system and the neural network will be fine-tuned accordingly. The client is desktop oriented.

Initially we have trained the base model on the TIMIT[2] corpus with 8K sampling rate. Only the first 200 male speakers from the “train” folder are used to train and test the classifier. After creating a satisfiable classifier for speaker recognition using the TIMIT corpus, we then fine-tune the model on our own data consisting of short audio data files (2 to 5 sentences in each) of at least the three members of the team.

For the classification, we record the audio of multiple speakers talking. Then we extract MelFCC features and feed all the data into the model for it to classify. This model will then return a speaker ID for that segment of speech and this ID is passed as a token to the QA system via the client. This limited domain QA system will process it and return the answer appropriate to the speaker.

Architecture:

Architecture

Timeline:

Detailed steps:

Experiments and Results:

The detailed results of some of our experiments for Speaker Classification are as follows:

Matlab Experiments (using Rasmussen’s conjugate gradient algorithm for training):

Python Experiments (using Stochastic gradient descent for training and 16K audio sampling rate):

Discussion on results:

Using WebRTC VAD for pre processing data, best results (98.5% file level accuracy for testing data) were obtained with alpha=1e-4 and adaptive learning rate with learning_rate_init=0.01

Similarly, when using VAD by Theodoros Giannakopoulos, best results (92.5% file level accuracy for testing data) were obtained with alpha=0.5 and adaptive learning rate with learning_rate_init=0.05

Challenges we ran into:

What We learned:

Conclusion:

We achieved 92.5% accuracy for Multi Speaker Recognition for our system. This is achieved by combining the preprocessing components (VAD, normalization, feature extraction) along with the neural network. There is room for improvement in the system to experiment and apply better deep learning algorithm to improve accuracy. Also currently the enrollment process is one person at a time. In the future, we can try to enroll many speakers as a whole. The main objective of the project to classify multiple speakers in real time and answer speaker specific question was achieved.

References: