Papers
arxiv:2402.13963

Towards Building Multilingual Language Model for Medicine

Published on Feb 21, 2024
Authors:
,
,
,
,
,
,
,

Abstract

In this paper, we aim to develop an open-source, multilingual language model for medicine, that the benefits a wider, linguistically diverse audience from different regions. In general, we present the contribution from the following aspects: first, for multilingual medical-specific adaptation, we construct a new multilingual medical corpus, that contains approximately 25.5B tokens encompassing 6 main languages, termed as MMedC, that enables auto-regressive training for existing general LLMs. second, to monitor the development of multilingual LLMs in medicine, we propose a new multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench; third, we have assessed a number of popular, opensource large language models (LLMs) on our benchmark, along with those further auto-regressive trained on MMedC, as a result, our final model, termed as MMedLM 2, with only 7B parameters, achieves superior performance compared to all other open-source models, even rivaling GPT-4 on MMedBench. We will make the resources publicly available, including code, model weights, and datasets.

Community

Sign up or log in to comment

Models citing this paper 10

Browse 10 models citing this paper

Datasets citing this paper 4

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.13963 in a Space README.md to link it from this page.

Collections including this paper 1