With the evolution of the concept of Speaker diarization using LSTM, it is relatively easier to understand the speaker identities for specific segments of input audio stream data than manually tagging the data. With such a concept, it is highly desirable to consider the possibility of using the identified speaker identities to aid in recognizing the Speaker States in a conversation. In this study, the Markov Chains are used to identify and update the Speaker States for the next conversations between the same set of speakers, to enable identification of their states in the most natural and long conversations. The model is based on several audio samples from natural conversations of three or greater than three speakers in two datasets with overall total error percentages for recognized states being lesser than or equal to 12%. The findings imply that the proposed extension to the Speaker diarization is effective to predict the states for a conversation.
View on arXiv