Kunat Pipatanakul, Phatrasek Jirabovonvisut, Potsawee Manakul, Sittipong Sripaisarnmongkol, Ruangsak Patomwong, Pathomporn Chokchainant, Kasima Tharnpipitchai『Typhoon: Thai Large Language Models』を読みつつ、この「Typhoon-7B」のトークナイザがどうなっているのか気になった。というのも、論文には In this work, we base our tokenizer on Mistral-7B tokenizer, but we further train an additional Thai subword tokenizer with 5,000 tokens and integrate