the nerd's backyard -- 伪技术杂谈: 四月 2015

1、原理简述

为了对GMM-HMM在语音识别上的应用有个宏观认识，花了些时间读了下HTK（用htk完成简单的孤立词识别）的部分源码，对该算法总算有了点大概认识，达到了预期我想要的。不得不说，网络上关于语音识别的通俗易懂教程太少，都是各种公式满天飞，很少有说具体细节的，当然了，那需要有实战经验才行。下面总结以下几点，对其有个宏观印象即可（以孤立词识别为例）。

一、每个单词的读音都对应一个HMM模型，大家都知道HMM模型中有个状态集S，那么每个状态用什么来表示呢，数字？向量？矩阵？其实这个状态集中的状态没有具体的数学要求，只是一个名称而已，你可以用’1’,’2’,‘3’…表示，也可以用’a’,‘b’,’c’表示。另外每个HMM模型中到底该用多少个状态，是通过先验知识人为设定的。

二、HMM的每一个状态都对应有一个观察值，这个观察值可以是一个实数，也可以是个向量，且每个状态对应的观察值的维度应该相同。假设现在有一个单词的音频文件，首先需要将其进行采样得到数字信息（A/D转换），然后分帧进行MFCC特征提取，假设每一帧音频对应的MFCC特征长度为39，则每个音频文件就转换成了N个MFCC向量（不同音频文件对应的N可能不同），这就成了一个序列，而在训练HMM模型的参数时（比如用Baum-Welch算法），每次输入到HMM中的数据要求就是一个观测值序列。这时，每个状态对应的观测值为39维的向量，因为向量中元素的取值是连续的，需要用多维密度函数来模拟，通常情况下用的是多维高斯函数。在GMM-HMM体系中，这个拟合函数是用K个多维高斯混合得到的。假设知道了每个状态对应的K个多维高斯的所有参数，则该GMM生成该状态上某一个观察向量（一帧音频的MFCC系数）的概率就可以求出来了。

三、对每个单词建立一个HMM模型，需要用到该单词的训练样本，这些训练样本是提前标注好的，即每个样本对应一段音频，该音频只包含这个单词的读音。当有了该单词的多个训练样本后，就用这些样本结合Baum-Welch算法和EM算法来训练出GMM-HMM的所有参数，这些参数包括初始状态的概率向量，状态之间的转移矩阵，每个状态对应的观察矩阵（这里对应的是GMM，即每个状态对应的K个高斯的权值，每个高斯的均值向量和方差矩阵）。

四、在识别阶段，输入一段音频，如果该音频含有多个单词，则可以手动先将其分割开（考虑的是最简单的方法），然后提取每个单词的音频MFCC特征序列，将该序列输入到每个HMM模型（已提前训练好的）中，采用前向算法求出每个HMM模型生成该序列的概率，最后取最大概率对应的那个模型，而那个模型所表示的单词就是我们识别的结果。

五、在建立声学模型时，可以用Deep Learning的方法来代替GMM-HMM中的GMM，因为GMM模拟任意函数的功能取决于混合高斯函数的个数，所以具有一定的局限性，属于浅层模型。而Deep Network可以模拟任意的函数，因而表达能力更强。注意，这里用来代替GMM的Deep Nets模型要求是产生式模型，比如DBN，DBM等，因为在训练HMM-DL网络时，需要用到HMM的某个状态产生一个样本的概率。

六、GMM-HMM在具体实现起来还是相当复杂的。

七、一般涉及到时间序列时才会使用HMM，比如这里音频中的语音识别，视频中的行为识别等。如果我们用GMM-HMM对静态的图片分类，因为这里没涉及到时间信息，所以HMM的状态数可设为1，那么此时的GMM-HMM算法就退化成GMM算法了。

2、使用说明

一、离散输出的隐马尔科夫模型（DHMM，HMM with discrete outputs）

最大似然参数估计EM（Baum Welch算法）

The script dhmm_em_demo.m gives an example of how to learn an HMM with discrete outputs. Let there be Q=2 states and O=3 output symbols. We create random stochastic matrices as follows.

O = 3;

Q = 2;

prior0 = normalise(rand(Q,1));

transmat0 = mk_stochastic(rand(Q,Q));

obsmat0 = mk_stochastic(rand(Q,O));

Now we sample nex=20 sequences of length T=10 each from this model, to use as training data.

T=10; %序列长度

nex=20; %样本序列数目

data = dhmm_sample(prior0, transmat0, obsmat0, nex, T);

Here data is 20x10. Now we make a random guess as to what the parameters are,

prior1 = normalise(rand(Q,1)); %初始状态概率

transmat1 = mk_stochastic(rand(Q,Q)); %初始状态转移矩阵

obsmat1 = mk_stochastic(rand(Q,O)); %初始观测状态到隐藏状态间的概率转移矩阵

and improve our guess using 5 iterations of EM...

[LL, prior2, transmat2, obsmat2] = dhmm_em(data, prior1, transmat1, obsmat1, 'max_iter', 5);

%prior2, transmat2, obsmat2 为训练好后的初始概率，状态转移矩阵及混合状态概率转移矩阵

LL(t) is the log-likelihood after iteration t, so we can plot the learning curve.

序列分类

To evaluate the log-likelihood of a trained model given test data, proceed as follows:

loglik = dhmm_logprob(data, prior2, transmat2, obsmat2) %HMM测试

Note: the discrete alphabet is assumed to be {1, 2, ..., O}, where O = size(obsmat, 2). Hence data cannot contain any 0s.

To classify a sequence into one of k classes, train up k HMMs, one per class, and then compute the log-likelihood that each model gives to the test sequence; if the i'th model is the most likely, then declare the class of the sequence to be class i.

Computing the most probable sequence (Viterbi)

First you need to evaluate B(i,t) = P(y_t | Q_t=i) for all t,i:

B = multinomial_prob(data, obsmat);

Then you can use

[path] = viterbi_path(prior, transmat, B)

二、具有高斯混合输出的隐马尔科夫模型（GHMM，HMM with mixture of Gaussians outputs）

Maximum likelihood parameter estimation using EM (Baum Welch)

Let us generate nex=50 vector-valued sequences of length T=50; each vector has size O=2.

O = 2;

T = 50;

nex = 50;

data = randn(O,T,nex);

Now let use fit a mixture of M=2 Gaussians for each of the Q=2 states using K-means.

M = 2;

Q = 2;

left_right = 0;

prior0 = normalise(rand(Q,1));

transmat0 = mk_stochastic(rand(Q,Q));

[mu0, Sigma0] = mixgauss_init(Q*M, reshape(data, [O T*nex]), cov_type);

mu0 = reshape(mu0, [O Q M]);

Sigma0 = reshape(Sigma0, [O O Q M]);

mixmat0 = mk_stochastic(rand(Q,M));

Finally, let us improve these parameter estimates using EM.

[LL, prior1, transmat1, mu1, Sigma1, mixmat1] = ...

mhmm_em(data, prior0, transmat0, mu0, Sigma0, mixmat0, 'max_iter', 2);

Since EM only finds a local optimum, good initialisation is crucial. The initialisation procedure illustrated above is very crude, and is probably not adequate for real applications... Click here for a real-world example of EM with mixtures of Gaussians using BNT.

What to do if the log-likelihood becomes positive?

It is possible for p(x) > 1 if p(x) is a probability density function, such as a Gaussian. (The requirements for a density are p(x)>0 for all x and int_x p(x) = 1.) In practice this usually means your covariance is shrinking to a point/delta function, so you should increase the width of the prior (see below), or constrain the matrix to be spherical or diagonal, or clamp it to a large fixed constant (not learn it at all). It is also very helpful to ensure the components of the data vectors have small and comparable magnitudes (use e.g., KPMstats/standardize).

This is a well-known pathology of maximum likelihood estimation for Gaussian mixtures: the global optimum may place one mixture component on a single data point, and give it 0 covariance, and hence infinite likelihood. One usually relies on the fact that EM cannot find the global optimum to avoid such pathologies.

What to do if the log-likelihood decreases during EM?

Since I implicitly add a prior to every covariance matrix (see below), what increases is loglik + log(prior), but what I print is just loglik, which may occasionally decrease. This suggests that one of your mixture components is not getting enough data. Try a better initialization or fewer clusters (states).

What to do if the covariance matrix becomes singular?

Estimates of the covariance matrix often become singular if you have too little data, or if too few points are assigned to a cluster center due to a bad initialization of the means. In this case, you should constrain the covariance to be spherical or diagonal, or adjust the prior (see below), or try a better initialization.

How do I add a prior to the covariance matrix?

Buried inside of KPMstats/mixgauss_Mstep you will see that cov_prior is initialized to 0.01*I. This is added to the maximum likelihood estimate after every M step. To change this, you will need to modify the mhmm_em function so it calls mixgauss_Mstep with a different value.

Sequence classification

To classify a sequence (e.g., of speech) into one of k classes (e.g., the digits 0-9), proceed as in the DHMM case above, but use the following procedure to compute likelihood:

loglik = mhmm_logprob(data, prior, transmat, mu, Sigma, mixmat);

Computing the most probable sequence (Viterbi)

First you need to evaluate B(t,i) = P(y_t | Q_t=i) for all t,i:

B = mixgauss_prob(data(:,:,ex), mu, Sigma, mixmat);

where data(:,:,ex) is OxT where O is the size of the observation vector. Finally, use

[path] = viterbi_path(prior, transmat, B);

三、具有高斯输出的HMM

This is just like the mixture of Gaussians case, except we have M=1, and hence there is no mixing matrix.

Online EM for discrete HMMs/ POMDPs

For some applications (e.g., reinforcement learning/ adaptive control), it is necessary to learn a model online. The script dhmm_em_online_demo gives an example of how to do this.

the nerd's backyard -- 伪技术杂谈

Google Translate

2015年4月28日星期二

HMM原理简述和使用说明【转】