secondlifeのブックマーク - はてなブックマーク

datasets / llm-jp-corpus-v2 · GitLab

GitLab Community Edition

secondlife 2024/06/23

LLM-jp で利用しているコーパス

リンク

2024年度サイバーエージェント新卒社内研修の「データモデリング」の資料公開 | CyberAgent Developers Blog

協業リテールメディアdivでデータエンジニアをしている千葉です。本日は、先日弊社内で実施をしたAI事業本部新人研修の一部である「データモデリング」について記載をします。同じく講師として登壇をした yassun7010 も「データベースの歴史」について、ブログとして公開をしているため、合わせて見ていただけると嬉しいです。 ※今回の記事作成に合わせて一部加筆修正をしています。基幹系と情報系今回の研修では、データモデリングを扱うシステムを基幹系情報系に分けて説明をしています。というのも基幹系と情報系では、そもそもデータの扱われ方やシステムの特性が異なります。基幹系システムではOLTPと呼ばれる処理システムになっており、オンラインでかつリアルタイムにデータを追加更新します。そのため、重要となってくるのが多くのトランザクション（処理数）を正確にさばくことです。代表例としては銀行の

secondlife 2024/06/22

リンク

GitHub - shisa-ai/shaberi: Lightblue LLM Eval Framework: tengu, elyza100, ja-mtbench, rakuda

secondlife 2024/06/22

日本語LLM評価

リンク

Intelligent RAG, Fetch Surrounding Chunks — Elastic Search Labs

secondlife 2024/06/22

ELSERでチャンク分割。文脈の拡張：より包括的な文脈を提供するため、隣接するチャンク（n-1とn+1）も検索される。そのチャンクがその章の最後のチャンクであれば、n-1とn-2が取得され、最初のチャンクであれば、n+1とn+2

リンク

GitHub - RUC-NLPIR/FlashRAG: ⚡FlashRAG: A Python Toolkit for Efficient RAG Research

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

secondlife 2024/06/22

RAG 色々toolkit。アルゴリズム評価も。

リンク

FineWeb: decanting the web for the finest text data at scale - a Hugging Face Space by HuggingFaceFW

Discover amazing ML apps made by the community

secondlife 2024/06/21

重複を避けてのテキスト抽出手法。fasttext, minhash でのフィルタリング、等々。

リンク

GitHub - huggingface/text-clustering: Easily embed, cluster and semantically label text datasets

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

secondlife 2024/06/21

リンク

Introducing Claude 3.5 Sonnet

Today, we’re launching Claude 3.5 Sonnet—our first release in the forthcoming Claude 3.5 model family. Claude 3.5 Sonnet rai ses the industry bar for intelligence, outperforming competitor models and Claude 3 Opus on a wide range of evaluations, with the speed and cost of our mid-tier model, Claude 3 Sonnet. Claude 3.5 Sonnet is now available for free on Claude.ai and the Claude iOS app, while Clau

secondlife 2024/06/21

リンク

megagonlabs/roberta-long-japanese · Hugging Face

YAML Metadata Error: "datasets[0]" with value "mC4 Japanese" is not valid. If possible, use a dataset id from https://hf.co/datasets. roberta-long-japanese (jumanpp + sentencepiece, mC4 Japanese) This is the longer input version of RoBERTa Japanese model pretrained on approximately 200M Japanese sentences. max_position_embeddings has been increased to 1282, allowing it to handle much longer inputs

secondlife 2024/06/21

max_token=1280 サイズで学習した日本語roberta

リンク

Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents

secondlife 2024/06/16

8192 token でのエンコーダモデルの学習手法, 主にALiBiというテクニックを使ってる

リンク

GitHub - huggingface/text-embeddings-inference: A blazing fast inference solution for text embeddings models

secondlife 2024/06/13

高速にembeddings / reranker モデルを推論するAPI。rustで書かれていてflash-attention2等も利用。dockerで動く。

リンク

Real-time Whisper WebGPU - a Hugging Face Space by Xenova

Discover amazing ML apps made by the community

secondlife 2024/06/12

YouTubeから学習しているのか、日本語だと「ご視聴ありがとうございました」が、やたら出やすいのが面白いな。

リンク

GenAI Handbook

William Brown @willccbb | willcb.com v0.1 (June 5, 2024) Introduction This document aims to serve as a handbook for learning the key concepts underlying modern artificial intelligence systems. Given the speed of recent development in AI, there really isn’t a good textbook-style source for getting up-to-speed on the latest-and-greatest innovations in LLMs or other generative models, yet there is an

secondlife 2024/06/11

“ Generative AI Handbook ”　包括的に各章エッセンスのみ書かれていて、さらっと全体像を理解しやすい。興味がある場所はさらにリンク先に飛んだりして学べる。

リンク

GitHub - transitive-bullshit/agentic: AI agent stdlib that works with any LLM and TypeScript AI SDK.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

secondlife 2024/06/11

“The goal of this project is to create a set of standard AI functions / tools which are optimized for both normal TS-usage as well as LLM-based apps and that work with all of the major AI SDKs (LangChain, LlamaIndex, Vercel AI SDK, OpenAI SDK, etc). ”

リンク

GitHub - lllyasviel/Omost: Your image is almost there!

Omost is a project to convert LLM's coding capability to image generation (or more accurately, image composing) capability. The name Omost (pronunciation: almost) has two meanings: 1) everytime after you use Omost, your image is almost there; 2) the O mean "omni" (multi-modal) and most means we want to get the most out of it. Omost provides LLMs models that will write codes to compose image visual

secondlife 2024/06/07

イリヤ先生の新作。一度LLMで細かな指示の中間指示書に変換してからattention弄って画像生成かな。発想も実装力も相変わらずすごい・・・。

リンク

GitHub - lllyasviel/Omost: Your image is almost there!

Omost is a project to convert LLM's coding capability to image generation (or more accurately, image composing) capability. The name Omost (pronunciation: almost) has two meanings: 1) everytime after you use Omost, your image is almost there; 2) the O mean "omni" (multi-modal) and most means we want to get the most out of it. Omost provides LLMs models that will write codes to compose image visual

secondlife 2024/06/07

イリヤ先生の新作。一度LLMで細かな指示の中間指示書に変換してから画像生成。発想も実装力も相変わらずすごい・・・。

リンク

xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token

This paper introduces xRAG, an innovative context compression method tailored for retrieval-augmented generation. xRAG reinterprets document embeddings in dense retrieval--traditionally used solely for retrieval--as features from the retrieval modality. By employing a modality fusion methodology, xRAG seamlessly integrates these embeddings into the language model representation space, effectively

secondlife 2024/06/04

RAG で利用する検索データの token 圧縮

リンク

Training and Finetuning Embedding Models with Sentence Transformers v3

Additionally, you can use SequentialEvaluator to combine multiple evaluators into one, which can then be passed to the SentenceTransf ormerTrainer. If you don't have the necessary evaluation data but still want to track the model's performance on common benchmarks, you can use these evaluators with data from Hugging Face: EmbeddingSimilarityEvaluator with STSb The STS Benchmark (a.k.a. STSb) is a c