[B! dataset] Ryobotのブックマーク

Ryobot id:Ryobot

datasetに関するRyobotのブックマーク (15)

The C4 Multilingual Dataset · allenai/allennlp · Discussion #5265
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert
Ryobot 2021/08/01
The C4 Multilingual Dataset

dataset
リンク
CoVoST V2: Expanding the largest, most diverse multilingual speech-to-text translation dataset
CoVoST V2: Expanding the largest, most diverse multilingual speech-to-text translation dataset What the research is: CoVoST V2 expands on our CoVoST dataset, a speech-to-text translation (ST) corpus targeted at multilingual translation. This new release makes available the largest multilingual ST dataset to date. CoVoST V2 will facilitate translating 21 languages into English, as well as English i
Ryobot 2021/07/31
“多言語翻訳を対象とした音声からテキストへの翻訳（ST）コーパス。CoVoST V2は、21の言語から英語への翻訳、および英語から15の言語への翻訳を容易にします。”

dataset
リンク
VoxCeleb
VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube 7,000 + speakers VoxCeleb contains speech from speakers spanning a wide range of different ethnicities, accents, professions and ages. Utterance Lengths 1 million + utterances All speaking face-tracks are captured "in the wild", with background chatter, laughter, overl
Ryobot 2021/07/31
“YouTubeにアップロードされたインタビュー動画から抽出された、人間のスピーチの短いクリップで構成される視聴覚データセットです。”7,000話者以上，100万発話以上，2,000時間以上

YouTube

dataset
リンク
AVSpeech: Audio Visual Speech dataset
Ryobot 2021/07/31
“シングルスピーカーのビデオクリップの大規模なコレクションです。このデータセットは、公開されているYouTubeの教育ビデオに基づいており、そこから3〜10秒の短いクリップが自動的に抽出されました。”

YouTube

dataset
リンク
Mozilla Common Voice
Ryobot 2021/06/06
"英語版Common Voiceデータベースは、自由にアクセス可能な音声データベースとしては、LibriSpeechに次ぐ規模である", "2020年12月現在、60言語、9,283時間の音声記録がデータベースに蓄積"

dataset

ASR
リンク
A new open data set for multilingual speech research
What it is: Facebook AI is releasing Multilingual LibriSpeech (MLS), a large-scale, open source data set designed to help advance research in automatic speech recognition (ASR). MLS is designed to help the speech research community’s work in languages beyond just English so people around the world can benefit from improvements in a wide range of AI-powered services. MLS provides more than 50,000 h
Ryobot 2021/06/06
多言語版LibriSpeech．“8 つの言語で 50,000 時間以上のオーディオを提供します。また、言語モデルのトレーニングデータと事前トレーニング済みの言語モデルをベースラインとともに提供”

dataset

ASR
リンク
GitHub - robvanvolt/DALLE-datasets: This is a summary of easily available datasets for generalized DALLE-pytorch training.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert
Ryobot 2021/06/04
“general datasets: Conceptual Images 12m, Wikipedia, Filtered yfcc100m, Open Images”

dataset
リンク
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
The milestone improvements brought about by deep representation learning and pre-training techniques have led to large performance gains across downstream NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage large high-quality visio-linguistic datasets for learning complementary information (across image and text modalities). In this paper, we introduce the Wikipedia-based Imag
Ryobot 2021/06/04
“WIT は、108 の Wikipedia 言語にまたがる 1150 万の固有の画像を含む、3760 万のエンティティの豊富な画像テキストの例の精選されたセットで構成されています。”

dataset
リンク
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
- 2 users
- arxiv.org
- 学び
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have system
Ryobot 2021/06/04
“5 つの主要な公開データセット (CCAligned、ParaCrawl、WikiMatrix、OSCAR、mC4) でリリースされた 205 の言語固有コーパスの品質を手動で監査し、6 番目 (JW300) の言語コードの正確性を監査します。”

dataset
リンク
A Repository of Conversational Datasets
Ryobot 2019/04/17
“collection of conversational datasets consisting of hundreds of millions of examples”

dataset

chatbot
リンク
A Repository of Conversational Datasets
Ryobot 2019/04/17
“conversational datasets consisting of hundreds of millions of examples”

dataset

chatbot
リンク
Danbooru2021: A Large-Scale Crowdsourced & Tagged Anime Illustration Dataset · Gwern.net
Danbooru2021: A Large-Scale Crowdsourced & Tagged Anime Illustration Dataset Danbooru2021 is a large-scale anime image database with 4.9m+ images annotated with 162m+ tags; it can be useful for machine learning purposes such as image recognition and generation. Deep learning for computer revision relies on large annotated datasets. Classification/categorization has benefited from the creation of I
Ryobot 2019/02/17
~2.5tb of 3.33m images with 92.7m tag instances (of 365k defined tags, ~27.8/image)

dataset

GAN
リンク
Danbooru: Anime Image Board
Ryobot 2019/02/17
dataset

GAN
リンク
ALAGIN 言語資源・音声資源サイト - JPOコーパス概要
Ryobot 2017/11/13
347,954,067 日英対訳文！

dataset
リンク
Wikispaces
We are sorry, but the site you are looking for no longer exists Wikispaces was founded in 2005 and has since been used by educators, companies and individuals across the globe. Unfortunately, the time has come where we have had to make the difficult business decision to end the Wikispaces service. We first announced the site closure in January 2018, through a site-wide banner that appeared to all
Ryobot 2017/02/17
dataset
リンク
1