Why

Conformer has become the default ASR encoder, but its self-attention scales quadratically with sequence length — a real bottleneck for streaming and long-form recognition. HyperMixer is a linear-complexity alternative to attention that had been shown to work well on NLP tasks; we asked whether it generalizes to speech.

What we did

Replaced Conformer’s self-attention with a multi-head HyperMixer block, keeping the convolutional module and macaron feedforward layers unchanged. Trained on Librispeech and CommonVoice with SpeechBrain.

Results

  • On par with Conformer on Librispeech (same WER, no regression)
  • Better than Conformer on limited-data settings (CommonVoice)
  • Linear time complexity — meaningful wins for longer utterances

Status

Merged into SpeechBrain recipes. Interspeech 2023.