In order to improve the performance of microphone array-based sound source localization (SSL), a robust SSL algorithm using convolutional neural network (CNN) is proposed in this paper. The Gammatone sub-band steered response power-phase transform (SRP-PHAT) spatial spectrum is adopted as the localization cue due to its feature correlation of consecutive sub-bands. Since CNN has the “weight sharing” characteristics and the advantage of processing tensor data, it is adopted to extract spatial location information from the localization cues. The Gammatone sub-band SRP-PHAT spatial spectrum are calculated through the microphone signals decomposed in frequency domain by Gammatone filters bank. The proposed algorithm takes a two-dimensional feature matrix which is assembled from Gammatone sub-band SRP-PHAT spatial spectrum within a frame as CNN input. Taking the advantage of powerful modeling capability of CNN, the two-dimensional feature matrices in diverse environments are used together to train the CNN model which reflects mapping regularity between the feature matrix and the azimuth of sound source. The estimated azimuth of the testing signal is predicted through the trained CNN model. Experimental results show the superiority of the proposed algorithm in SSL problem, it achieves significantly improved localization performance and capacity of robustness and generality in various acoustic environments.

The aim of microphone array-based sound source localization (SSL) is to determine the location information by applying a series of signal processing on multichannel received signals. It plays an important role in numerous application fields including speech enhancement, speech recognition, human-computer interaction, autonomous robots, smart home monitor system, etc [

Over the past decades, many microphone array-based SSL approaches have been presented. In generally, the traditional approaches for SSL can be divided into two categories [

With the development of artificial neural network (ANN), the usage of deep learning for SSL task have been proposed in recent years. The usage of deep learning for SSL task can be performed in two ways. The first way is to apply deep learning techniques in traditional methods, while the second way considers SSL problem as a classification task and use deep learning to map the input features to the azimuth. For first way, the related research is as follows. Wang et al. [

The second way of applying deep learning to SSL task has been more widely studied, and a variety of input features types are involved by the approaches, such as inter-aural level difference (ILD), inter-aural phase difference (IPD), cross-correlation function (CCF), generalized cross correlation (GCC) and so on. The related research is as follows. An SSL approach based on CNN with multitask learning has been proposed in [

In this paper, we focus on SSL in far-field and come up with a novel robust SSL approach. As our previous work described in [

The rest of the paper is organized as follows. Section 2 illustrates the proposed SSL algorithm based on CNN, which include system overview, feature extraction, the architecture of CNN and the training of CNN. The simulation results and analysis are presented in Section 3. The conclusions follow in Section 4.

The proposed algorithm treats the sound source localization problem as a multi-classification task, and constructs the mapping regularity between spatial feature matrix and the azimuth of sound source through CNN model.

The physical model for signal received by

where _{m}(_{s}, _{s} to the _{m}(_{m}(_{s},

As our previous work described in [

where _{mn}(_{mn}(_{m}(_{m}(

Gammatone filter bank, which has different central frequencies and bandwidths, is used to simulate the time-frequency analysis to acoustic signals by human auditory system. The impulse response of the

where _{i} denotes the decay coefficient, _{i} denotes the central frequency of the

The feature parameter extracted from the array signals is the basis of sound source localization. Considering the spatial location information contained in the SRP-PHAT spatial spectrum and the time-frequency analysis capability of the Gammatone filter, the SRP-PHAT spatial spectrum in each band of Gammatone filter bank is exploited as the feature for sound source localization in this paper.

The microphone signals are decomposed into consecutive sub-bands in frequency domain by Gammatone filters bank. The central frequencies of Gammatone filters ranges from 100 to 8000 Hz on the equivalent rectangular bandwidth (ERB). The SRP-PHAT function of a Gammatone sub-band is defined as:

where _{i}(_{i}(_{i}(

The microphone signals are divided into 32 ms frame length without frame shift. Then the sub-band SRP is calculated by

where _{i}_{l}, _{l} in _{l} is simplified to the azimuth with a distance of 1.5 m from the steering position to the microphone array, and the azimuth ranges from 0° to 360° with a step of 5°, corresponding to 72 steering positions. Thus the dimension of SRP-PHAT feature matrix is 32

CNN is introduce to train a set of SRP-PHAT feature matrices constructed in Section 2.2. To improve the robustness and generality of model, training signals with known azimuth information in diverse environments are used together to train the CNN model. The training azimuth ranges from 0° to 360° with a step of 10°, corresponding to 36 training positions.

As depicted in

The training of CNN includes forward propagation process and back propagation process. Forward propagation is the process of transferring features layer by layer. In the forward propagation process, the output of network under the current model parameters is calculated for the input signal. The output of the

where ^{d} denotes the output of the ^{d} denotes the weight of the convolution kernel in ^{d} denotes the bias of the

The output of the

The expression of the output layer is as follows:

where ^{D} and ^{D} denote the weight and bias of the fully connected layer respectively, ^{D} is a vector with size of

The cross-entropy loss function

where the subscript ^{D},

The stochastic gradient descent with momentum (SGDM) algorithm is adopted to minimize the loss function. The momentum is set to 0.9, the L2 regularization coefficient is set to 0.0001, the mini-batch is set to 200, and the initial learning rate is set to 0.01. The learning rate is reduced by 0.2 times every 6 epochs.

Over-fitting often occurs during the construction of complex network models. Cross Validation and DropOut are utilized to prevent over-fitting in the training phase. The training data is divided into training set and validation set randomly according to the ratio of 7:3 for cross validation. The DropOut method is introduced in the fully connected layer, and the Dropout ratio is set to 0.5.

Simulation experiments are conducted to evaluate the performance of the proposed algorithm. The dimensions of the simulated room are given as 7 m × 7 m × 3 m. A uniform circular array with a radius of 10 cm is located at (3.5 m, 3.5 m, 1.6 m) in the room. The array consists of six omnidirectional microphones. The clean speech sampled at 16 kHz which are taken randomly from the TIMIT database are adopted as the sound source signals. The Image method [

The source is placed in the far-field, and the azimuth ranges from 0° to 360°with a step of 10°, corresponding to 36 training positions. During the training phase, the SNR is varied from 0 to 20 dB with a step of 5 dB, and the reverberation time T60 is set to two levels as 0.5 and 0.8 s. The microphone array signals in different reverberation and noise environments are taken together as training data to enhance the generalization ability of the CNN model.

The localization performance is measured by the percentage of correct estimates, which is defined as follows:

where _{all} is the total number of testing frames, _{c} is the number of correct estimate frames, and the correct estimate is defined that the estimated azimuth is equal to the true azimuth. The performance of the proposed algorithm is compared with two related algorithms, namely the SRP-PHAT [

In this section, the localization performance is investigated and analyzed in the situation that the test signals are generated in the same setting as the training signals.

From

From

In this section, we investigate the robustness and generality of the proposed algorithm in untrained environment. For the testing signals, the untrained SNR is varied from −2 to 18 dB with a step of 5 dB, and the untrained reverberation time T60 is set to two levels as 0.6 and 0.9 s.

As shown in

In this work, a robust SSL algorithm using convolutional neural network based on microphone array has been presented. Considering the feature correlation of consecutive sub-bands, the sub-band SRP-PHAT spatial spectrum based on Gammatone filter bank is exploited as the feature for sound source localization in the proposed algorithm. CNN is adopted to establish the mapping relationship between the spatial feature matrix and the azimuth of sound source due to its advantage on processing tensor data. Experimental results show that the proposed algorithm provides better localization performance in both the trained and untrained environments, especially in high reverberation environments, and achieves superior capacity of robustness and generality.