[B! speech] imyutaroのブックマーク

imyutaro id:imyutaro

speechに関するimyutaroのブックマーク (42)

クロスモーダル表現学習の研究動向: 音声関連を中心として
[KDD2023論文読み会] BERT4CTR: An Efficient Framework to Combine Pre-trained Language Model with Non-textual Features for CTR Prediction / KDD2023 LY Tech Reading
imyutaro 2024/03/12
speech

research_paper
リンク
音声変換と生成AI：開発者視点からの1.5年の振り返り
imyutaro 2024/01/30
speech

generative_model
リンク
ESPNetを使ってtext to speechのファインチューニングをした時のメモ
text to speechをファインチューニングして、自由に音声を作れるようにしたかったのでやってみました。基本的に次のqiitaの記事を見てやりましたが、少し詰まってしまったところや、工夫したところなどメモしていきます。これを参考にする ESPNetでとりあえず、text to speech まずは、テキストを入れて音声に変換できることを確認しました。 https://github.com/espnet/espnet colaboratoryで動かす、Real-time TTS demo with ESPnet2がsepnetのgithubにあるのでそれを動かす pyopenjtalkは何故かcolaboにpip installできない Google Colabでpyopenjtalkがインストールできない !pip install pyopenjtalk --no-build-i
imyutaro 2023/07/12
speech

generative_model

ai_vtuber
リンク
RVCの構造についてのメモ
はじめにこんにちは、nadareです。機械学習エンジニアで、普段はレコメンド・検索関連のお仕事をしています。いろんな競技プログラミングが好きです。 Retrieval-based-Voice-Conversion(以下RVC)という技術に関心を持ち、本家Retrieval-based-Voice-Conversion-WebUIやddPn08さん版RVC-WebUI、VC ClientにPR投げつつ勉強しています。最近は自分でRVCのモデル構造弄って遊んでいます。最近勉強した技術の実験場みたいな感じなので本家にPRださないとは思いますが、その過程でRVCの学習回りについていろいろ分かってきたので自分用にまとめたいと思います。 RVCの構成 RVCはTTS(text to speech)とVC(Voice Convertaion)のモデルであるVITSをベースに、VCに特化させ『模倣対
imyutaro 2023/06/18
speech
リンク
GitHub - buriburisuri/speech-to-text-wavenet: Speech-to-Text-WaveNet : End-to-end sentence level English speech recognition based on DeepMind's WaveNet and tensorflow
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert
imyutaro 2023/06/16
research_paper

speech
リンク
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
- 2 users
- arxiv.org
- 学び
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments
imyutaro 2023/06/16
research_paper

speech
リンク
wav2vec: Unsupervised Pre-training for Speech Recognition
- 1 user
- arxiv.org
- 学び
imyutaro 2023/06/16
research_paper

speech
リンク
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
- 4 users
- arxiv.org
- 学び
Self-supervised approaches for speech representation learning are challenged by three unique probl ems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three probl ems, we propose the Hidden-Unit BERT (HuBERT) approach for
imyutaro 2023/06/16
research_paper

speech
リンク
DDSP-SVCでリアルタイム音声合成をするのだ。
こんなツイートをしますた。ということでDDSPを推していきます（？） DDSP-SVCとは 👆これです。 Real-time end-to-end singing voice conversion system based on DDSP (Differentiable Digital Signal Processing). らしいです。 RVC並みの学習速度でかつ品質はRVCより上（多分）っていうやつです。さらにリアルタイム音声変換ではRVCよりもレイテンシが低いらしい。拡散モデルと組み合わせればさらに質がよくなるようです。なんか面白そうですよね。ということで触っていきましょう。インストール
imyutaro 2023/06/12
speech
リンク
【図解】超高性能AIボイスチェンジャー「RVC」のしくみ・コツ
はじめに ↑に貼った動画は話題の高性能ボイスチェンジャー「RVC」の変換例です（Creative Commonsで配布・改変が可能なライセンスの音声データを学習させたものです。BOOTHで無料配布中です）。今回の記事では動画の4つ+1の計5モデルを作成する中で見えてきたRVCの仕組みや使用時・モデル生成時のコツを紹介したいと思います。 BOOTHで学習済みモデル無料配布中(ライセンスはそれぞれ異なり、元データに準拠します) 注意事項本編に入る前にいくつか前提を明らかにしておきます。私自身、RVCや音声認識についての専門知識はほぼありません。RVCについて初めて知ったのは約2週間前で、そのレベルは初心者の域を出ないと思っていただければと思います。さらに、RVCのモデルであるHuBERTやトランスフォーマーに関する知識もあまりありません(論文もまともに読んでません)。したがって、この記事の
imyutaro 2023/05/24
speech

dl
リンク
Introducing speech-to-text, text-to-speech, and more for 1,100+ languages
Introducing speech-to-text, text-to-speech, and more for 1,100+ languages Equipping machines with the ability to recognize and produce speech can make information accessible to many more people, including those who rely entirely on voice to access information. However, producing good-quality machine learning models for these tasks requires large amounts of labeled data — in this case, many thousan
imyutaro 2023/05/23
speech

dl

research_paper
リンク
音声合成における話者・スタイル表現手法の調査 / A survey of speaker and style representation methods in speech synthesis
橘健太郎（LINE株式会社）音声合成における話者・スタイル表現手法の調査 Tokyo BISH Bash #04での発表資料です（2021/03/30） https://tokyo-bish-bash.connpass.com/event/205884/
imyutaro 2023/05/11
speech

research_paper
リンク
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers
- 1 user
- arxiv.org
- 学び
imyutaro 2023/04/21
research_paper

speech
リンク
Retrieval-based-Voice-Conversion-WebUI/README_en.md at main · liujing04/Retrieval-based-Voice-Conversion-WebUI
imyutaro 2023/04/11
speech
リンク
最近のAIボイスチェンジャー(RVC、so-vits-svc)
私は趣味で機械学習を学ぶ初学者であり、説明に間違いや勘違いがある可能性があります。そういった点がありましたらコメントで指摘していただけると助かります。また、so-vits-svcやRVCは論文ベースでの技術発表が無いため、以下はコードや周辺情報からの想像を含みます。修正履歴 2023/04/15 RVCの動作について誤りがあったので修正しました。nadare🌱さんご指摘ありがとうございます。 AIボイスチェンジャーとはある発話音声の入力を特定の話者が発話したような声質の発話音声に変換するための、深層学習を使用したアプローチがそう呼ばれている印象です。以前から、深層学習を用いたリアルタイムボイスチェンジャーはMMVCなどが存在していました。最近(2022年11月頃から2023年4月頃)では、Retrieval-based-Voice-Conversion 通称RVC や、Soft
imyutaro 2023/04/11
speech
リンク
GitHub - reazon-research/ReazonSpeech: Massive open Japanese speech corpus
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert
imyutaro 2023/04/04
speech

pretrained_model
リンク
GitHub - mmorise/World: A high-quality speech analysis, manipulation and synthesis system
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert
imyutaro 2023/04/04
speech
リンク
HuBERTで音声言語モデルの性能を改善
ヤフー株式会社は、2023年10月1日にLINEヤフー株式会社になりました。LINEヤフー株式会社の新しいブログはこちらです。LINEヤフー Tech Blog こんにちは。ヤフーの音声認識エンジン「YJVOICE」の研究開発を担当している前角です。この記事ではヤフーにおける音声処理技術の研究開発の最新の取り組みの中から、自己教師あり学習を用いた音声言語モデルの改善手法について取り上げます。今回は音声向けの表現学習モデル「HuBERT」を用いたところ、学習データが不足する状況下でも、すべての評価指標において前回提案した手法を上回る性能を達成できました。なお、今回の内容は前回紹介した「ラベルなしの音声データを用いて言語理解が可能に？音声言語モデルの性能改善手法のご紹介」の続きですので、そちらも合わせてご覧いただければと思います。また、本研究は米国カーネギーメロン大学の渡部晋治准教授との共
imyutaro 2023/03/29
speech
リンク
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
- 2 users
- arxiv.org
- 学び
Motivated by the success of T5 (Text-To-Text Transfer Transf ormer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After prepro
imyutaro 2023/03/27
dl

speech

research_paper
リンク
GitHub - ggerganov/whisper.cpp: Port of OpenAI's Whisper model in C/C++
Stable: v1.5.4 / Roadmap | F.A.Q. High-performance inference of OpenAI's Whisper automatic speech recognition (ASR) model: Plain C/C++ implementation without dependencies Apple Silicon first-class citizen - optimized via ARM NEON, Accelerate framework, Metal and Core ML AVX intrinsics support for x86 architectures VSX intrinsics support for POWER architectures Mixed F16 / F32 precision 4-bit and 5
imyutaro 2023/03/23
speech

dl
リンク
1 2 3 次のページ

お知らせ

もっと読む

公式Twitter

@HatenaBookmark
リリース、障害情報などのサービスのお知らせ
@hatebu
最新の人気エントリーの配信

キーボードショートカット一覧

j次のブックマーク

k前のブックマーク

lあとで読む

eコメント一覧を開く

oページを開く

設定を変更しましたx