Abstract

Contextual word embedding models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) have dramatically improved performance for many natural language processing (NLP) tasks in recent months. However, these models have been minimally explored on specialty corpora, such as clinical text; moreover, in the clinical domain, no publicly-available pre-trained BERT models yet exist. In this work, we address this need by exploring and releasing BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically. We demonstrate that using a domain-specific model yields performance improvements on three common clinical NLP tasks as compared to nonspecific embeddings. These domain-specific models are not as performant on two clinical de-identification tasks, and argue that this is a natural consequence of the differences between de-identified source text and synthetically non de-identified task text.

Keywords

Computer scienceNatural language processingTask (project management)Artificial intelligenceEmbeddingDomain (mathematical analysis)Identification (biology)Language modelWord embeddingWord (group theory)Named-entity recognitionLinguisticsBiology

Affiliated Institutions

Related Publications

Universal Sentence Encoder

We present models for encoding sentences into embedding vectors that specifically target transfer learning to other NLP tasks. The models are efficient and result in accurate pe...

2018 arXiv (Cornell University) 1289 citations

Publication Info

Year
2019
Type
preprint
Citations
715
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

715
OpenAlex

Cite This

Emily Alsentzer, John R. Murphy, Willie Boag et al. (2019). Publicly Available Clinical BERT Embeddings. arXiv (Cornell University) . https://doi.org/10.48550/arxiv.1904.03323

Identifiers

DOI
10.48550/arxiv.1904.03323