data.statmt.org[B!]新着記事・評価 - はてなブックマーク

『data.statmt.org』

CC-100: Monolingual Datasets from Web Crawl Data
13 users
data.statmt.org

This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated b
- テクノロジー
- 2020/11/02 17:44
- 自然言語処理
- dataset

キーボードショートカット一覧

j次のブックマーク

k前のブックマーク

lあとで読む

eコメント一覧を開く

oページを開く

設定を変更しましたx