[B! evaluation] mkusakaのブックマーク

mkusaka id:mkusaka

evaluationに関するmkusakaのブックマーク (3)

LangWatch: AI Agent Testing and LLM Evaluation Platform
Turn production traces into evals, compare prompts and models, simulate end-to-end agentic systems and improve quality with every release.
mkusaka 2025/10/29
AI agentsのTraces/Evaluations/Agent Simulation等でテスト・監視・最適化。Self-Hosted対応、5分で導入。

AI要約

LangWatch

LLMOps

evaluation

tracing

OSS
リンク
Writing effective tools for AI agents—using AI agents
Published Sep 11, 2025 Agents are only as effective as the tools we give them. We share how to write high-quality tools and evaluations, and how you can boost performance by using Claude to optimize its tools for itself. The Model Context Protocol (MCP) can empower LLM agents with potentially hundreds of tools to solve real-world tasks. But how do we make those tools maximally effective? In this p
mkusaka 2025/09/12
LLMエージェント向けツールの作り方と評価手法を解説、Claude Codeで自動最適化可能（2025年9月11日公開）、MCP活用やトークン効率化、名前空間設計のベストプラクティスも紹介

AI要約

tooling

チュートリアル

AIエージェント

Claude

MCP

evaluation

LLM
リンク
Findings from a Pilot Anthropic - OpenAI Alignment Evaluation Exercise
Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise Samuel R. Bowman, Megha Srivastava, Jon Kutasov, Rowan Wang, Trenton Bricken, Benjamin Wright, Ethan Perez, and Nicholas Carlini tl;dr In early summer 2025, Anthropic and OpenAI agreed to evaluate each other's public models using in-house misalignment-related evaluations. We are now releasing our findings in parallel. The evaluat
mkusaka 2025/08/28
AI要約

AI

OpenAI

Claude

AIエージェント

evaluation

misalign
リンク
1

お知らせ

もっと読む

公式Twitter

@HatenaBookmark
リリース、障害情報などのサービスのお知らせ
@hatebu
最新の人気エントリーの配信

キーボードショートカット一覧

j次のブックマーク

k前のブックマーク

lあとで読む

eコメント一覧を開く

oページを開く

設定を変更しましたx