Listening at the Cocktail Party with Deep Neural Networks and TensorFlow

Many people are amazing at focusing their attention on one person or one voice in a multi speaker scenario, and ‘muting’ other people and background noise. This is known as the cocktail party effect. For other people it is a challenge to separate audio sources.

In this presentation I will focus on solving this problem with deep neural networks and TensorFlow. I will share technical and implementation details with the audience, and talk about gains, pains points, and merits of the solutions as it relates to:

  • Preparing, transforming and augmenting relevant data for speech separation and noise removal.
  • Creating, training and optimizing various neural network architectures.
  • Hardware options for running networks on tiny devices.
  • And the end goal : Real-time speech separation on a small embedded platform.

I will present a vision of future smart air pods, smart headsets and smart hearing aids that will be running deep neural networks .

Participants will get an insight into some of the latest advances and limitations in speech separation with deep neural networks on embedded devices in regards to:

  • Data transformation and augmentation.
  • Deep neural network models for speech separation and for removing noise.
  • Training smaller and faster neural networks.
  • Creating a real-time speech separation pipeline.
展开查看详情

1.WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

2.Listening at the Cocktail Party with Deep Neural Networks and TensorFlow Christian Grant tcpip001@gmail.com #UnifiedDataAnalytics #SparkAISummit

3.Agenda • The cocktail party problem • Solving the cocktail party problem with deep neural networks (DNN’s) • Future vision, and use cases • Barriers to adopting deep neural networks at the edge • Overcoming infrastructure, datasets, AI chips, and edge AI software barriers

4.Vision Real time speech separation at the edge #UnifiedDataAnalytics #SparkAISummit 4

5.Problem 5

6.The Cocktail Party Problem When multiple people are speaking at the same time • at a restaurant • an airport • or a cocktail party Tuning in to one speaker is relatively easy for individuals with no hearing impairment. Individuals with hearing impairment have difficulty in understanding speech in the presence of competing voices. #UnifiedDataAnalytics #SparkAISummit 6

7.Problem - Mixed Audio M1: It was a great Halloween party F1: The nest was build with small twigs 7

8.Solution – Separated Tracks M1: It was a great Halloween party F1: The nest was build with small twigs 8

9.Speech Separation Approach Speaker 1 source Speaker 2 source Input audio STFT – Short –time Fourier transform Deep Neural Network Weights Model Speaker 1 mask prediction ISTFT ISTFT Speaker 1 prediction Speaker 2 prediction 9

10.Speech Separation Approach Speaker 1 source Speaker 2 source Input audio STFT – Short –time Fourier transform Deep Neural Network Weights Model Speaker 1 mask prediction ISTFT ISTFT Speaker 1 prediction Speaker 2 prediction 10

11.Speech Separation Approach Speaker 1 source Speaker 2 source Input audio STFT – Short –time Fourier transform Deep Neural Network Weights Model Speaker 1 mask prediction ISTFT ISTFT Speaker 1 prediction Speaker 2 prediction 11

12.Speech Separation Approach Speaker 1 source Speaker 2 source Input audio STFT – Short –time Fourier transform Deep Neural Network Weights Model Speaker 1 mask prediction ISTFT ISTFT Speaker 1 prediction Speaker 2 prediction 12

13.Speech Separation Approach Speaker 1 source Speaker 2 source Input audio STFT – Short –time Fourier transform Deep Neural Network Weights Model Speaker 1 mask prediction ISTFT ISTFT Speaker 1 prediction Speaker 2 prediction 13

14.Speech Separation Approach Speaker 1 source Speaker 2 source Input audio STFT – Short –time Fourier transform Deep Neural Network Weights Model Speaker 1 mask prediction ISTFT ISTFT Speaker 1 prediction Speaker 2 prediction 14

15. Publications Tool & Platform Selection Partner Collaboration Configuration Monitoring Evaluation Data Verification ML Training Code Process Management Inference Transformations + Data Collection Feature Extraction Machine Tasks Resource Management 15

16. Tasks Platform Data Transformation Code Training Evaluation Predictions • Deep learning • Generalized • STFT • Theano to • HINT dataset • Lab listening • Prediction virtual machine data set • ISTFT Keras + TF • Training lots of tests pipeline • Real time • Noise data • Spectrogram • Keras + TF to models • Metric • Predict 1000’s prediction tf.keras • Signal to of examples for platform • Estimator API distortion ratio lots of models • Demo platform • User friendly • Tiny platform code 16

17.Platform & Tool Selection 17

18.Tool Selection Keras on Theano Keras on TensorFlow TensorFlow Keras API •Keras •Keras •Distributed and local •Easy to convert •Very easy to convert •Keras models No development •Google •Google •Google •Large ecosystem •Large ecosystem •TensorFlow Lite •TensorFlow Lite •TensorFlow Lite •TensorFlow Extended •GPU •GPU •Production ready 18

19.Tool Selection Keras on Theano Keras on TensorFlow TensorFlow Keras API •Keras •Keras •Distributed and local •Easy to convert •Very easy to convert •Keras models No development •Google •Google •Google •Large ecosystem •Large ecosystem •TensorFlow Lite •TensorFlow Lite •TensorFlow Lite •TensorFlow Extended •GPU •GPU •Production ready 19

20.Tool Selection Keras on Theano Keras on TensorFlow TensorFlow Keras API •Keras •Keras •Distributed and local •Easy to convert •Very easy to convert •Keras models No development •Google •Google •Google •Large ecosystem •Large ecosystem •TensorFlow Lite •TensorFlow Lite •TensorFlow Lite •TensorFlow Extended •GPU •GPU •Production ready 20

21.Tool Selection Keras on Theano Keras on TensorFlow TensorFlow Keras API •Keras •Keras •Distributed and local •Easy to convert •Very easy to convert •Keras models No development •Google •Google •Google •Large ecosystem •Large ecosystem •TensorFlow Lite •TensorFlow Lite •TensorFlow Lite •TensorFlow Extended •GPU •GPU •Production ready 21

22.Data 22

23.Data 6 speakers • 3 males + 3 females 1560 sentences / files • 13 lists * 20 sentences * 6 speakers 200 KB to 312 MB • 2 – 3 seconds per sentence 260 examples / speaker • ~ 10 minutes of speech per speaker 44.1 kHz sampling rate • 16 bits per sample

24.Feature Engineering 24

25.Short-time Fourier Transform # Load audio file wav1, sr1 = librosa.load(‘voice.wav’, sr=None, mono=True, duration=2) # Short-time Fourier transform stft1 = librosa.stft(wav1)

26.Models 26

27.

28.Fully Connected Neural Network int_in = int(in_dim[0]) # 1032 inputs = Input(shape=(int_in,)) x = inputs for i in range(n_hidden_layers): # 4 hidden layers x = Dense(units=1024)(x) x = Activation(‘sigmoid’)(x) x = BatchNormalization()(x) x = Dropout(dropout_val)(x) int_out = int(op_dim[-1]) # 129 final_output = Dense(int_out, activation=‘sigmoid’)(x) model = Model(inputs, final_output) AO = tf.keras.optimizers.Adam(lr=lr, beta_1=beta_1, beta_2=beta_2, epsilon=epsilon) loss_func = 'mse' model.compile(loss=loss_func, optimizer=AO)

29.Training 29