WebRTC的语音降噪算法中实现了一个频点维度的语音概率估计器SpeechProbabilityEstimator,本质是一个多特征融合的线性分类器。统计计算以下三种特征,

  • LRT
  • Spectral Flatness 谱平坦度
  • Spectral Difference 谱差

通过tanh将特征变化映射到概率值,使用不同的width参数来调节敏感度,线性加权融合到得最终的语音概率。 接下来完整分析 WebRTC 中用于语音概率估计的三个核心特征(indicator0, indicator1, indicator2)

1
2
3
4
5
# 代码位置
modules/audio_processing/ns/speech_probability_estimator.h
modules/audio_processing/ns/speech_probability_estimator.cc
modules/audio_processing/ns/signal_model_estimator.cc
modules/audio_processing/ns/signal_model_estimator.h

indicator0: Likelihood Ratio Test (LRT)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
// Updates the log LRT measures.
void UpdateSpectralLrt(rtc::ArrayView<const float, kFftSizeBy2Plus1> prior_snr,
                       rtc::ArrayView<const float, kFftSizeBy2Plus1> post_snr,
                       rtc::ArrayView<float, kFftSizeBy2Plus1> avg_log_lrt,
                       float* lrt) {
  RTC_DCHECK(lrt);

  for (size_t i = 0; i < kFftSizeBy2Plus1; ++i) {
    float tmp1 = 1.f + 2.f * prior_snr[i];
    float tmp2 = 2.f * prior_snr[i] / (tmp1 + 0.0001f);
    float bessel_tmp = (post_snr[i] + 1.f) * tmp2;
    avg_log_lrt[i] +=
        .5f * (bessel_tmp - LogApproximation(tmp1) - avg_log_lrt[i]);
  }

  float log_lrt_time_avg_k_sum = 0.f;
  for (size_t i = 0; i < kFftSizeBy2Plus1; ++i) {
    log_lrt_time_avg_k_sum += avg_log_lrt[i];
  }
  *lrt = log_lrt_time_avg_k_sum * kOneByFftSizeBy2Plus1;
}

LRT衡量的是,当前观察到的频谱,更像是语音信号,还是更像噪声,是在两种假设之间进行比较:

  • $H_0$: 当前帧是纯噪声帧
  • $H_1$: 当前帧是语音 + 噪声帧

通过计算:

$$ \Lambda = \log \left( \frac{P(X | H_1)}{P(X | H_0)} \right) $$

其中:

  • $P(X|H_1)$:给定为语音的条件下,观测值 X 出现的概率;
  • $P(X|H_0)$:给定为噪声的条件下,观测值 X 出现的概率;

最终近似简化为(推导过程跳过了,找了一些资料,没看太懂🙈):

$$ \text{LRT}(k) \approx \frac{(1 + \gamma_k) \cdot 2 \cdot \xi_k}{1 + 2 \cdot \xi_k} - \log(1 + 2 \cdot \xi_k) $$

其中:

  • $\xi_k$:频点 k 的先验 SNR;
  • $\gamma_k$:频点 k 的后验 SNR;
  • LRT 越大,表明信号更像语音;
  • LRT 越小,说明更像噪声。

再通过tanh函数映射到[0, 1]的区间。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
  // Width parameter in sigmoid map for prior model.
  constexpr float kWidthPrior0 = 4.f;
  // Width for pause region: lower range, so increase width in tanh map.
  constexpr float kWidthPrior1 = 2.f * kWidthPrior0;

  // Average LRT feature: use larger width in tanh map for pause regions.
  float width_prior = model.lrt < prior_model.lrt ? kWidthPrior1 : kWidthPrior0;

  // Compute indicator function: sigmoid map.
  float indicator0 =
      0.5f * (tanh(width_prior * (model.lrt - prior_model.lrt)) + 1.f);

indicator1: Spectral Flatness 谱平坦度

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Updates the spectral flatness based on the input spectrum.
void UpdateSpectralFlatness(
    rtc::ArrayView<const float, kFftSizeBy2Plus1> signal_spectrum,
    float signal_spectral_sum,
    float* spectral_flatness) {
  RTC_DCHECK(spectral_flatness);

  // Compute log of ratio of the geometric to arithmetic mean (handle the log(0)
  // separately).
  constexpr float kAveraging = 0.3f;
  float avg_spect_flatness_num = 0.f;
  for (size_t i = 1; i < kFftSizeBy2Plus1; ++i) {
    if (signal_spectrum[i] == 0.f) {
      *spectral_flatness -= kAveraging * (*spectral_flatness);
      return;
    }
  }

  for (size_t i = 1; i < kFftSizeBy2Plus1; ++i) {
    avg_spect_flatness_num += LogApproximation(signal_spectrum[i]);
  }

  float avg_spect_flatness_denom = signal_spectral_sum - signal_spectrum[0];

  avg_spect_flatness_denom = avg_spect_flatness_denom * kOneByFftSizeBy2Plus1;
  avg_spect_flatness_num = avg_spect_flatness_num * kOneByFftSizeBy2Plus1;

  float spectral_tmp =
      ExpApproximation(avg_spect_flatness_num) / avg_spect_flatness_denom;

  // Time-avg update of spectral flatness feature.
  *spectral_flatness += kAveraging * (spectral_tmp - *spectral_flatness);
}
  • 谱平坦度的定义 $$ \text{SFM} = \frac{\left( \prod_{i=1}^{N} X_i \right)^{1/N}}{\frac{1}{N} \sum_{i=1}^{N} X_i} = \frac{\text{几何均值}}{\text{算术均值}} $$

  • 为什么谱平坦度可以区分语音和噪声 首先从纯数学角度,根据均值不等式有:几何均值 <= 算数均值

  1. 当所有元素相等时,两者相等
  2. 当元素差异越大时,几何均值相对算数均值越小,说明“越不平坦”

再对应到语音降噪应用:

  1. 噪声(尤其是白噪声):频带能量均匀,几何均值 ≈ 算数均值,Flatness ≈ 1
  2. 语音信号:存在能量集中(共振峰),几何均值 ≪ 算数均值,Flatness 接近 0

因此谱平坦度 = 能量分布的“均匀性量尺” → 能直接用来做语音/噪声分类特征!

  • 代码实现
  1. 计算几何均值
1
2
3
4
  for (size_t i = 1; i < kFftSizeBy2Plus1; ++i) {
    avg_spect_flatness_num += LogApproximation(signal_spectrum[i]);
  }
  avg_spect_flatness_num = avg_spect_flatness_num * kOneByFftSizeBy2Plus1;

等价于下面的式子,先算log版本

$$ \exp\left( \frac{1}{N} \sum \log(X_i) \right) $$

再做指数运算还原为几何均值

1
ExpApproximation(avg_spect_flatness_num)
  1. 计算算数均值(去掉DC分量)
1
2
3
  float avg_spect_flatness_denom = signal_spectral_sum - signal_spectrum[0];

  avg_spect_flatness_denom = avg_spect_flatness_denom * kOneByFftSizeBy2Plus1;
  1. 得到谱平坦度
1
2
  float spectral_tmp =
      ExpApproximation(avg_spect_flatness_num) / avg_spect_flatness_denom;
  1. 平滑更新
1
2
  // Time-avg update of spectral flatness feature.
  *spectral_flatness += kAveraging * (spectral_tmp - *spectral_flatness);

indicator2: Spectral Difference 谱模板差异

Spectral Difference频谱差异是用于衡量当前帧的频谱与已学习噪声模板之间的差异程度。其基本思想是:

如果当前帧的频谱结构与噪声模板相似,则可能是噪声;如果差异大,则可能是语音。

  1. 总体计算公式 $$ \text{SpectralDiff} = \text{Var}(X) - \frac{[\text{Cov}(X, Y)]^2}{\text{Var}(Y)} $$

其中: • X:当前帧的 信号频谱; • Y:历史平均的 噪声频谱(称为 conservative noise spectrum); • $\mathrm{Var}$:方差(描述“起伏程度”); • $\mathrm{Cov}$:协方差(描述“是否联动”)。

  1. 为什么它能衡量相似程序?

从统计角度看,Var(X) - Cov(X, Y)^2 / Var(Y) 是当前帧中 与过去模板不一致的能量。如果:

  • 如果 signal ≈ noise(噪声帧):→ covariance² / noise_variance ≈ signal_variance → spectral_diff ≈ 0
  • 如果 signal 包含语音成分(结构和噪声不一样):→ covariance 小,spectral_diff 增大

这个公式本质上等价于:

Var(Residual) = Var(Signal) - Var(ProjectedNoiseComponent)

即:当前帧中除了可以用噪声解释的部分,剩下有多少“异常能量”

多特征融合

1
2
3
4
  // Combine the indicator function with the feature weights.
  float ind_prior = prior_model.lrt_weighting * indicator0 +
                    prior_model.flatness_weighting * indicator1 +
                    prior_model.difference_weighting * indicator2;

最终组合这三个指标:

  • 每个特征都有独立的权重
  • 互补性强,提升稳健性

举例子:

情况LRTFlatnessSpectral Diff判断结果
短促辅音 [k]高(像噪声)低(像模板)不能仅靠 flatness 判断,indicator2 弥补
背景突发噪声indicator2 抑制误判
语音暂停期三项均为低,VAD 静音

计算频点的后验语音概率

  • 平滑更新先验语音概率
1
2
3
4
5
  // Compute the prior probability.
  prior_speech_prob_ += 0.1f * (ind_prior - prior_speech_prob_); 

  // Make sure probabilities are within range: keep floor to 0.01.
  prior_speech_prob_ = std::max(std::min(prior_speech_prob_, 1.f), 0.01f);
  • 计算后验语音概率 因为prior_speech_prob_是通过历史信息估计的当前帧语音概率,因此这个概率称为先验语音概率。实际使用时我们不需要我们观察到当帧后给出的概率,即后验语音概率。

贝叶斯定理给出后验概率公式:

$$ P(H_1 \mid X) = \frac{P(H_1) \cdot P(X \mid H_1)}{P(H_1) \cdot P(X \mid H_1) + P(H_0) \cdot P(X \mid H_0)} $$

我们引入: • $\text{Prior} = P(H_1)$ • $\text{Gain} = \frac{1 - \text{Prior}}{\text{Prior}}$ • $\text{LRT} = \frac{P(X \mid H_1)}{P(X \mid H_0)}$

可得后验语音概率(简化推导):

$$ P(H_1 \mid X) = \frac{1}{1 + \text{Gain} \cdot \frac{1}{\text{LRT}}} $$

这正是代码中这段的实现

1
2
3
4
5
6
7
8
9
  // Final speech probability: combine prior model with LR factor:.
  float gain_prior =
      (1.f - prior_speech_prob_) / (prior_speech_prob_ + 0.0001f);

  std::array<float, kFftSizeBy2Plus1> inv_lrt;
  ExpApproximationSignFlip(model.avg_log_lrt, inv_lrt);
  for (size_t i = 0; i < kFftSizeBy2Plus1; ++i) {
    speech_probability_[i] = 1.f / (1.f + gain_prior * inv_lrt[i]);
  }

其中: • gain_prior = (1 - prior_speech_prob_) / (prior_speech_prob_ + ε) • inv_lrt[i] = e^{-avg_log_lrt[i]} ≈ 1 / LRT (指数近似)

利用语音概率辅助更新噪声谱

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
void NoiseEstimator::PostUpdate(
    rtc::ArrayView<const float> speech_probability,
    rtc::ArrayView<const float, kFftSizeBy2Plus1> signal_spectrum) {
  // Time-avg parameter for noise_spectrum update.
  constexpr float kNoiseUpdate = 0.9f;

  float gamma = kNoiseUpdate;
  for (size_t i = 0; i < kFftSizeBy2Plus1; ++i) {
    const float prob_speech = speech_probability[i];
    const float prob_non_speech = 1.f - prob_speech;

    // Temporary noise update used for speech frames if update value is less
    // than previous.
    float noise_update_tmp =
        gamma * prev_noise_spectrum_[i] +
        (1.f - gamma) * (prob_non_speech * signal_spectrum[i] +
                         prob_speech * prev_noise_spectrum_[i]);

    // Time-constant based on speech/noise_spectrum state.
    float gamma_old = gamma;

    // Increase gamma for frame likely to be seech.
    constexpr float kProbRange = .2f;
    gamma = prob_speech > kProbRange ? .99f : kNoiseUpdate;

    // Conservative noise_spectrum update.
    if (prob_speech < kProbRange) {
      conservative_noise_spectrum_[i] +=
          0.05f * (signal_spectrum[i] - conservative_noise_spectrum_[i]);
    }

    // Noise_spectrum update.
    if (gamma == gamma_old) {
      noise_spectrum_[i] = noise_update_tmp;
    } else {
      noise_spectrum_[i] =
          gamma * prev_noise_spectrum_[i] +
          (1.f - gamma) * (prob_non_speech * signal_spectrum[i] +
                           prob_speech * prev_noise_spectrum_[i]);
      // Allow for noise_spectrum update downwards: If noise_spectrum update
      // decreases the noise_spectrum, it is safe, so allow it to happen.
      noise_spectrum_[i] = std::min(noise_spectrum_[i], noise_update_tmp);
    }
  }
}

上面这段代码是webrtc中结合语音概率,对之前基于分位数估计得到的噪声谱,进行进一步修正的过程。

那为什么已经有了基于分位数的噪声估计,还需要在PostUpdate()中进行进一步修正呢?

如下表,我们对比与初始分位数估计的关系和区别

特征PreUpdate() 中的分位数估计PostUpdate() 中的时间平均更新
原理统计过往帧的底噪分布(log-domain)利用当前帧的语音概率进行时间递归更新
更新维度横向(跨帧分布)纵向(帧内时间平滑)
响应特性对背景缓慢变化有响应,对突发语音稳健在语音帧期间抑制更新,非语音帧中轻微修正
对应变量quantile_noise_estimator_.Estimate(…) → noise_spectrum_[]初值 noise_spectrum_[] → 平滑动态追踪修正
目的建立初步噪声模型细化并动态追踪噪声谱

总结:为什么需要 PostUpdate?

  • 分位数估计(PreUpdate)很强健,但慢。
  • PostUpdate 提供 快速、平滑、概率驱动 的动态调整机制。
  • 防止语音能量污染噪声估计;
  • 保持噪声谱能持续跟踪 非平稳噪声(如空调开关、风声变化);
  • 为后续 Wiener 滤波器提供更可靠的噪声谱输入。

Ok! 到这里WebRTC就真正完成了噪声谱的估计,接下继续分享WebRTC语音降噪代码.

To Be Continue!!!