A research team at the Tokyo Institute of Technology and the National Institute of Advanced Industrial Science and Technology have released Swallow, a large-scale language model that is the foundation of a generative AI with excellent Japanese language proficiency. It is the largest large-scale language model that supports Japanese, and is open and available for commercial use.
In recent years, research and development of large-scale language models, such as OpenAI's ChatGPT and GPT-4, and Google's PaLM 2 and Gemini, have progressed rapidly. Although progress is being made in the development of large-scale language models that are strong in Japanese, there have been few open and high-performance large-scale language models.
The Llama 2 series developed by Meta AI shows high performance in English, but is weak in reading and writing Japanese. Therefore, the research team built a large-scale language model ``Swallow'' based on several models of Llama 2. A method of additional pre-training (continuous pre-training) to a trained large-scale language model showed high performance for Japanese.
In addition, since Llama 2 is an English-focused model, major Japanese words and characters are not included in the vocabulary, the text is divided into unnatural units (tokens), and the text is expressed with more tokens. Learning and generation efficiency decreases and computational cost increases. By adding vocabulary such as Japanese characters and words (a set of tokens that can be handled by a language model), the token length of Japanese text was reduced to 56.2%.
Furthermore, the research team independently extracted and refined Japanese texts from archives distributed by the non-profit organization Common Crawl, and built a Japanese web corpus consisting of approximately 3,121 billion characters (approximately 1.73 million pages). This is the largest commercially available training corpus of Japanese language models.
The introduction of a large-scale language model that is strong and open to Japanese will further promote the research, development, and utilization of large-scale language models in Japan, leading to further product development and technological innovation.