Turn production traces into evals, compare prompts and models, simulate end-to-end agentic systems and improve quality with every release.
Published Sep 11, 2025 Agents are only as effective as the tools we give them. We share how to write high-quality tools and evaluations, and how you can boost performance by using Claude to optimize its tools for itself. The Model Context Protocol (MCP) can empower LLM agents with potentially hundreds of tools to solve real-world tasks. But how do we make those tools maximally effective? In this p
Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise Samuel R. Bowman, Megha Srivastava, Jon Kutasov, Rowan Wang, Trenton Bricken, Benjamin Wright, Ethan Perez, and Nicholas Carlini tl;dr In early summer 2025, Anthropic and OpenAI agreed to evaluate each other's public models using in-house misalignment-related evaluations. We are now releasing our findings in parallel. The evaluat
リリース、障害情報などのサービスのお知らせ
最新の人気エントリーの配信
処理を実行中です
j次のブックマーク
k前のブックマーク
lあとで読む
eコメント一覧を開く
oページを開く