05_Speech Recognition
展开查看详情
1.Ch. 5: Speech Recognition An example of a speech recognition system using dynamic programming Speech recognition techniques, v.8b 1
2.Overview (A) Recognition procedures (B) Dynamic programming Speech recognition techniques, v.8b 2
3.(A) Recognition Procedures Preprocessing for recognition endpoint detection Preemphasis Frame blocking and windowing distortion measure methods Comparison methods Vector quantization Dynamic programming Hidden Markov Model Speech recognition techniques, v.8b 3
4.LPC processor for a 10word isolated speech recognition system Steps Endpoint detection Preemphasis  high pass filtering (3a) Frame blocking and (3b) Windowing Feature extraction Find cepstral cofficients by LPC Autocorrelation analysis LPC analysis, Find Cepstral coefficients, or find cepstral coefficients directly Distortion measure calculations Speech recognition techniques, v.8b 4
5.Example of speech signal analysis Frame size is 20ms, separated by 10 ms, S is sampled at 44.1 KHz Calculate: n and m. Speech recognition techniques, v.8b 5 One frame =n samples 1 st frame 2 nd frame 3 rd frame 4 th frame 5 th frame Separated by m samples ( one set of LPC > code word) ( one set of LPC > code word) ( one set of LPC > code word) ( one set of LPC > code word) Speech signal S E.g. whole duration about is 1 second Answer: n=20ms/(1/44100)=882, m=10ms/(1/44100)=441
6.Step1: Get one frame and execute end point detection To determine the start and end points of the speech sound It is not always easy since the energy of the starting energy is always low. Determined by energy & zero crossing rate Speech recognition techniques, v.8b 6 recorded endpoint detected n s(n) In our example it is about 1 second
7.A simple End point detection algorithm At the beginning the energy level is low. If the energy level and zerocrossing rate of 3 successive frames is high it is a starting point. After the starting point if the energy and zerocrossing rate for 5 successive frames are low it is the end point. Speech recognition techniques, v.8b 7
8.Energy calculation E(n) = s(n).s(n) For a frame of size N, The program to calculate the energy level: for(n=0;n< N;n ++) { Energy(n)=s(n) s(n); } Speech recognition techniques, v.8b 8
9.Energy plot Speech recognition techniques, v.8b 9
10.Zero crossing calculation A zerocrossing point is obtained when sign[s(n)] != sign[s(n1)] (=at change of sign) The zerocrossing points of s(n)= 6 (you may exclude n=0 and n=N1. Speech recognition techniques, v.8b 10 n s(n) 1 2 3 4 5 6 n =N1 n=0
11.Step2: Preemphasis  high pass filtering To reduce noise, average transmission conditions and to average signal spectrum. Tutorial: write a program segment to perform preemphasis to a speech frame stored in an array int s[1000]. Speech recognition techniques, v.8b 11
12.Preemphasis program segment input=sig1, output=sig2 void pre_emphasize (char far *sig1, float *sig2) { int j; sig2[0]=(float)sig1[0]; for (j=1;j< WINDOW;j ++) sig2[j]=(float)sig1[j]  0.95*(float)sig1[j1]; } Speech recognition techniques, v.8b 12
13.Preemphasis Speech recognition techniques, v.8b 13
14.Step3(a): Frame blocking and Windowing To choose the frame size (N samples )and adjacent frames separated by m samples. I.e.. a 16KHz sampling signal, a 10ms window has N=160 samples, m=40 samples. Speech recognition techniques, v.8b 14 m N N n s n l =2 window, length = N l =1 window, length = N
15.Step3(b): Windowing To smooth out the discontinuities at the beginning and end. Hamming or Hanning windows can be used. Hamming window Tutorial: write a program segment to find the result of passing a speech frame, stored in an array int s[1000], into the Hamming window. Speech recognition techniques, v.8b 15
16.Effect of Hamming window (For Hanning window See http://en.wikipedia.org/wiki/Window_function ) Speech recognition techniques, v.8b 16
17.Matlab code segment x1=wavread( violin3.wav ); for i=1:N hamming_window(i)= abs(0.540.46*cos(i*(2*pi/N))); y1(i)=hamming_window(i)*x1(i); end Speech recognition techniques, v.8b 17
18.Cepstrum Vs spectrum The spectrum is sensitive to glottal excitation (E). But we only interested in the filter H In frequency domain Speech wave (X)= Excitation (E) . Filter (H) Log (X) = Log (E) + Log (H) Cepstrum =Fourier transform of log of the signal’s power spectrum In Cepstrum, the Log(E) term can easily be isolated and removed. Speech recognition techniques, v.8b 18
19.Step4 a: Autocorrelation analysis Autocorrelation of every frame ( l =1,2,..)of a windowed signal is calculated. If the required output is pth ordered LPC Autocorrelation for the lth frame is Speech recognition techniques, v.8b 19
20.Step 4a : LPC calculation (if you use LPC as features, less accurate than MFCC) Recall: To calculate LPC a[ ] from autocorrelation matrix * coef using Durbin’s Method (solve equation 2) void lpc_coeff (float * coeff ) { int i , j; float sum,E,K,a [ORDER+1][ORDER+1]; if( coeff [0]==0.0) coeff [0]=1.0E30; E= coeff [0]; for ( i =1;i<= ORDER;i ++) { sum=0.0; for (j=1;j< i;j ++) sum+= a[j][i1]* coeff [ i j]; K=( coeff [ i ]sum)/E; a[ i ][ i ]=K; E*=(1K*K); for (j=1;j< i;j ++) a[j][ i ]=a[j][i1]K*a[ i j][i1]; } for ( i =1;i<= ORDER;i ++) coeff [ i ]=a[ i ][ORDER];} Speech recognition techniques, v.8b 20
21.Step 4 b : Cepstral coefficients calculation (more accurate) If Cepstral coefficients are weight by the Mel scale, it is called MFCC (Mel scale Cepstral coefficients) C(n)=IDFT[log 10 S(w)]= IDFT[ log 10 {E(w)} + log 10 {H(w)} ] In C(n), you can see E(n) and H(n) at two different positions Application: useful for ( i ) glottal excitation removal (ii) vocal tract filter analysis Speech recognition techniques, v.8b 21 windowing DFT Logx(w) IDFT X(N) X(w) Logx(w) N=time index w=frequency IDFT=Inversediscrete Fourier transform S(N) C(n)
22.MFCC implementation Detailed implementation of using filter banks and Fourier transform to obtain the MFCC parameters: Signal s(n) (each filter band m) power ceptrum c(m) Speech recognition techniques, v.8b 22 https:// haythamfayek.com/2016/04/21/speechprocessingformachinelearning.html Or http:// kom.aau.dk/group/04gr742/pdf/MFCC_worksheet.pdf
23.Step 5: Distortion measure  difference between two signals measure how different two signals is: Cepstral distances between a frame (described by cepstral coeffs (c 1 ,c 2 … c p )and the other frame (c’ 1 ,c’ 2 … c’ p ) is MFCC ( Mel scale Cepstral coefficients) : Weighted Cepstral distances to give different weighting to different cepstral coefficients( more accurate) Speech recognition techniques, v.8b 23
24.Matching method: Dynamic programming DP Correlation is a simple method for pattern matching BUT: The most difficult problem in speech recognition is time alignment. No two speech sounds are exactly the same even produced by the same person. Align the speech features by an elastic matching method  DP. Speech recognition techniques, v.8b 24
25.Example: A 10 words speech recognizer Small Vocabulary (10 words) DP speech recognition system Store 10 templates of standard sounds : such as sounds of one , two, three ,… Unknown input –>Compare (using DP) with each sound. Each comparison generates an error The one with the smallest error is result. Speech recognition techniques, v.8b 25
26.(B) Dynamic programming algo. Step 1: calculate the distortion matrix for dist ( i,j ) Step 2: calculate the accumulated matrix for D( i,j ) by using Speech recognition techniques, v.8b 26 D( i , j) D( i1, j) D( i , j1) D( i1, j1)
27.Example in DP(LEA , Trends in speech recognition.) Step 1 : distortion matrix For dist(I,j) Reference Step 2: Reference Speech recognition techniques, v.8b 27 unknown input accumulated score matrix For D(i,j) jaxis i axis unknown input reference Dist(i,j) Accumulated scores D(i,j) iaxis jaxis
28.To find the optimal path in the accumulated matrix (and the minimum accumulated distortion/ distance) (updated) Starting from the top row and right most column, find the lowest cost D (i,j) t : it is found to be the cell at (i,j)=(3,5), D(3,5)=7 in the top row. *(this cost is called the “ minimum accumulated distance” , or “minimum accumulated distortion” ) From the lowest cost position p(i,j) t , find the next position (i,j) t1 =argument_min_i,j{D(i1,j), D(i1,j1), D(i,j1)}. E.g. p(i,j) t1 =argument_min i,j {11,5,12)} = 5 is selected. Repeat above until the path reaches the left most column or the lowest row. Note: argument_min_i,j{cell1, cell2, cell3} means the argument i,j of the cell with the lowest value is selected. Speech recognition techniques, v.8b 28
29.Optimal path It should be from any element in the top row or right most column to any element in the bottom row or left most column. The reason is noise may be corrupting elements at the beginning or the end of the input sequence. However, in fact, in actual processing the path should be restrained near the 45 degree diagonal (from bottom left to top right), see the attached diagram, the path cannot passes the restricted regions. The user can set this regions manually. That is a way to prohibit unrecognizable matches. See next page. Speech recognition techniques, v.8b 29

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distributed Keras on AZ
Spark开源社区

Retrieving VisuallySimilar Products for Shopping Recommendations using Spark and Tensorflow
Spark开源社区

Graph Features in Spark 3.0  Integrating Graph Querying and Algorithms in Spark Graphg
Spark开源社区

Updates from Project Hydrogen  Unifying StateoftheArt AI and Big Data in Apache Spark
Spark开源社区

Tensorflow Faster RCNN 2.0
GDG