Abstract

Contextual word embedding models such as ELMo and BERT have dramatically improved performance for many natural language processing (NLP) tasks in recent months. However, these models have been minimally explored on specialty corpora, such as clinical text; moreover, in the clinical domain, no publicly-available pre-trained BERT models yet exist. In this work, we address this need by exploring and releasing BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically. We demonstrate that using a domain-specific model yields performance improvements on 3/5 clinical NLP tasks, establishing a new state-of-the-art on the MedNLI dataset. We find that these domain-specific models are not as performant on 2 clinical de-identification tasks, and argue that this is a natural consequence of the differences between de-identified source text and synthetically non de-identified task text.

Keywords

Computer scienceNatural language processingTask (project management)Artificial intelligenceDomain (mathematical analysis)Language modelIdentification (biology)Word embeddingEmbeddingBiomedical text miningWord (group theory)Named-entity recognitionText miningLinguistics

Related Publications

Publication Info

Year
2019
Type
article
Citations
1422
Access
Closed

Social Impact

Altmetric

Social media, news, blog, policy document mentions

Citation Metrics

1422
OpenAlex
307
Influential
1052
CrossRef

Cite This

Emily Alsentzer, John R. Murphy, William Boag et al. (2019). Publicly Available Clinical. Proceedings of the 2nd Clinical Natural Language Processing Workshop . https://doi.org/10.18653/v1/w19-1909

Identifiers

DOI
10.18653/v1/w19-1909
arXiv
1904.03323

Data Quality

Data completeness: 79%