Abstract
Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent approach formulates speech separation as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. Over the past decade, many supervised separation algorithms have been put forward. In particular, the recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance. This paper provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We first introduce the background of speech separation and the formulation of supervised separation. Then, we discuss three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the overview is on separation algorithms where we review monaural methods, including speech enhancement (speech-nonspeech separation), speaker separation (multitalker separation), and speech dereverberation, as well as multimicrophone techniques. The important issue of generalization, unique to supervised learning, is discussed. This overview provides a historical perspective on how advances are made. In addition, we discuss a number of conceptual issues, including what constitutes the target source.
Keywords
Affiliated Institutions
Related Publications
Improving deep neural networks for LVCSR using rectified linear units and dropout
Recently, pre-trained deep neural networks (DNNs) have outperformed traditional acoustic models based on Gaussian mixture models (GMMs) on a variety of large vocabulary speech r...
Prewhitening for intelligibility gain in hearing aid arrays
In this article, prewhitening multimicrophone data prior to use in an optimum spatial-filter preprocessor for a monaural hearing aid is considered. Considering preprocessor capa...
Investigation of full-sequence training of deep belief networks for speech recognition
Recently, Deep Belief Networks (DBNs) have been proposed for phone recognition and were found to achieve highly competitive performance. In the original DBNs, only framelevel in...
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
Self-supervised learning (SSL) achieves great success in speech recognition,\nwhile limited exploration has been attempted for other speech processing tasks.\nAs speech signal c...
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition...
Publication Info
- Year
- 2018
- Type
- article
- Volume
- 26
- Issue
- 10
- Pages
- 1702-1726
- Citations
- 1453
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1109/taslp.2018.2842159
- PMID
- 31223631
- PMCID
- PMC6586438
- arXiv
- 1708.07524