並び順

ブックマーク数

期間指定

  • から
  • まで

1 - 40 件 / 51件

新着順 人気順

evaluationの検索結果1 - 40 件 / 51件

タグ検索の該当結果が少ないため、タイトル検索結果を表示しています。

evaluationに関するエントリは51件あります。 LLM組織マネジメント などが関連タグです。 人気エントリには 『エンジニア組織30人の壁を超えるための 評価システムとマネジメントのスケール / Scaling evaluation system and management』などがあります。
  • エンジニア組織30人の壁を超えるための 評価システムとマネジメントのスケール / Scaling evaluation system and management

    2024夏のジンジニアMeetup! 〜みんなで学ぼう!開発組織の評価制度と運用〜 https://jinjineer.connpass.com/event/323746/

      エンジニア組織30人の壁を超えるための 評価システムとマネジメントのスケール / Scaling evaluation system and management
    • 新米マネージャーの初めての目標設定と評価 / New manager's first goal setting and evaluation

      2024/03/01: EMゆるミートアップ vol.6 〜LT会〜 https://em-yuru-meetup.connpass.com/event/308552/ 新米マネージャーの初めての目標設定と評価 倉澤 直弘 EM

        新米マネージャーの初めての目標設定と評価 / New manager's first goal setting and evaluation
      • 定量データと定性評価を用いた技術戦略の組織的実践 / Systematic implementation of technology strategies using quantitative data and qualitative evaluation

        CNDS2024 https://event.cloudnativedays.jp/cnds2024/

          定量データと定性評価を用いた技術戦略の組織的実践 / Systematic implementation of technology strategies using quantitative data and qualitative evaluation
        • Best Practices for LLM Evaluation of RAG Applications

          Unified governance for all data, analytics and AI assets

            Best Practices for LLM Evaluation of RAG Applications
          • 作るだけなら簡単なLLMを“より優れたもの”にするには 「Pretraining」「Fine-Tuning」「Evaluation & Analysis」構築のポイント

            オープンLLMの開発をリードする現場の視点から、開発の実情や直面する課題について発表したのは、Stability AI Japan株式会社の秋葉拓哉氏。Weights & Biasesのユーザーカンファレンス「W&Bカンファレンス」で、LLM開発のポイントを紹介しました。全2記事。前半は、より優れたLLMを作るために必要なこと。前回はこちら。 より優れたLLMを作るために必要なこと 秋葉拓哉氏:めでたくFine-Tuningもできた。これけっこう、びっくりするかもしれません。コードはさすがにゼロとはいかないと思いますが、ほとんど書かずに実はLLMは作れます。 「さすがにこんなんじゃゴミみたいなモデルしかできないだろう」と思われるかもしれませんが、おそらく余計なことをしなければこれだけでも、まあまあそれっぽいLLMにはなるかなと思います。 なので、ちょっと、先ほどの鈴木先生(鈴木潤氏)の話と

              作るだけなら簡単なLLMを“より優れたもの”にするには 「Pretraining」「Fine-Tuning」「Evaluation & Analysis」構築のポイント
            • Off-Policy Evaluationの基礎とZOZOTOWN大規模公開実データおよびパッケージ紹介 - ZOZO TECH BLOG

              ※AMP表示の場合、数式が正しく表示されません。数式を確認する場合は通常表示版をご覧ください ※2020年11月7日に、「Open Bandit Pipelineの使い方」の節に修正を加えました。修正では、パッケージの更新に伴って、実装例を新たなバージョンに対応させました。詳しくは対応するrelease noteをご確認ください。今後、データセット・パッケージ・論文などの更新情報はGoogle Groupにて随時周知する予定です。こちらも良ければフォローしてみてください。また新たに「国際会議ワークショップでの反応」という章を追記しました。 ZOZO研究所と共同研究をしている東京工業大学の齋藤優太です。普段は、反実仮想機械学習の理論と応用をつなぐような研究をしています。反実仮想機械学習に関しては、拙著のサーベイ記事をご覧ください。 本記事では、機械学習に基づいて作られた意思決定の性能をオフラ

                Off-Policy Evaluationの基礎とZOZOTOWN大規模公開実データおよびパッケージ紹介 - ZOZO TECH BLOG
              • GitHub - yahoojapan/JGLUE: JGLUE: Japanese General Language Understanding Evaluation

                You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                  GitHub - yahoojapan/JGLUE: JGLUE: Japanese General Language Understanding Evaluation
                • GitHub - Stability-AI/lm-evaluation-harness: A framework for few-shot evaluation of autoregressive language models.

                  You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                    GitHub - Stability-AI/lm-evaluation-harness: A framework for few-shot evaluation of autoregressive language models.
                  • GitHub - Arize-ai/phoenix: AI Observability & Evaluation

                    Phoenix provides MLOps and LLMOps insights at lightning speed with zero-config observability. Phoenix provides a notebook-first experience for monitoring your models and LLM Applications by providing: LLM Traces - Trace through the execution of your LLM Application to understand the internals of your LLM Application and to troubleshoot problems related to things like retrieval and tool execution.

                      GitHub - Arize-ai/phoenix: AI Observability & Evaluation
                    • GitHub - confident-ai/deepeval: The LLM Evaluation Framework

                      DeepEval is a simple-to-use, open-source LLM evaluation framework. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs locally on your machine for evaluation. Whether your applicatio

                        GitHub - confident-ai/deepeval: The LLM Evaluation Framework
                      • COVID-19 vaccine efficacy summary | Institute for Health Metrics and Evaluation

                          COVID-19 vaccine efficacy summary | Institute for Health Metrics and Evaluation
                        • GitHub - explodinggradients/ragas: Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines

                          You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                            GitHub - explodinggradients/ragas: Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
                          • PMスキル・評価制度を導入し、アウトカムを生み出すプロダクトマネジメント集団へ進化する道のりの共有 / How we introduced the PM skills and evaluation system and evolved into a product management group that produces outcomes

                            pmconf2021登壇スライド。 Rettyがプロジェクトマネジメント一辺倒な組織から、アウトカムドリブンな開発ができるプロダクトマネジメントが根付いた組織に至るまでの成長の経過について。 具体的な取り組みの一例を挙げると、私たちは力のあるプロダクトマネージャーを育てるために、PMのスキル…

                              PMスキル・評価制度を導入し、アウトカムを生み出すプロダクトマネジメント集団へ進化する道のりの共有 / How we introduced the PM skills and evaluation system and evolved into a product management group that produces outcomes
                            • Estimation of total and excess mortality due to COVID-19 | Institute for Health Metrics and Evaluation

                              Estimation of total and excess mortality due to COVID-19 Published October 15, 2021 This page was updated on October 15, 2021 to reflect changes in our modeling strategy. View our previous methods published May 13, 2021 here. In our October 15 release, we introduced three major changes. First, we have very substantially updated the data and methods used to estimate excess mortality related to the

                                Estimation of total and excess mortality due to COVID-19 | Institute for Health Metrics and Evaluation
                              • Evaluation of science advice during the COVID-19 pandemic in Sweden - Humanities and Social Sciences Communications

                                Sweden was well equipped to prevent the pandemic of COVID-19 from becoming serious. Over 280 years of collaboration between political bodies, authorities, and the scientific community had yielded many successes in preventive medicine. Sweden’s population is literate and has a high level of trust in authorities and those in power. During 2020, however, Sweden had ten times higher COVID-19 death rat

                                  Evaluation of science advice during the COVID-19 pandemic in Sweden - Humanities and Social Sciences Communications
                                • GitHub - st-tech/zr-obp: Open Bandit Pipeline: a python library for bandit algorithms and off-policy evaluation

                                  You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                                    GitHub - st-tech/zr-obp: Open Bandit Pipeline: a python library for bandit algorithms and off-policy evaluation
                                  • Top Evaluation Metrics for RAG Failures

                                    Figure 1: Root Cause Workflows for LLM RAG Applications (flowchart created by author) If you have been experimenting with large language models (LLMs) for search and retrieval tasks, you have likely come across retrieval augmented generation (RAG) as a technique to add relevant contextual information to LLM generated responses. By connecting an LLM to private data, RAG can enable a better response

                                      Top Evaluation Metrics for RAG Failures
                                    • 雰囲気で理解するtidy evaluation(1): tidy evaluationの導入 - Qiita

                                      Rユーザの皆さん、rlangパッケージないしtidy evaluation (tidy eval)についてどれだけご存知でしょうか。rlangパッケージは昨日バージョン0.4.2がリリースされました。まだ1.0.0には至ってはいませんが、CRANに登録されて2年以上経つので、本腰を入れて学んでいきたいと思っているところです。 今回から数回、そんな私自身のrlang、rlangによるtidy evalの学習ついでに、やんわりとした解説をしようという試みで記事を書きます。本来であれば用語の定義や背景についての解説をしなければならないと思います。しかしここでは、まずは、tidy evalを学ぶことでどのようなことが可能になるのか、rlangパッケージを使うとどのような利点があるのか、その雰囲気の理解に重きを置きます。詳細を知りたくなった方はぜひドキュメントや参考資料を読んでください。 私もまだ道

                                        雰囲気で理解するtidy evaluation(1): tidy evaluationの導入 - Qiita
                                      • GitHub - pfnet-research/japanese-lm-fin-harness: Japanese Language Model Financial Evaluation Harness

                                        You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                                          GitHub - pfnet-research/japanese-lm-fin-harness: Japanese Language Model Financial Evaluation Harness
                                        • LLM Evaluation Tutorial

                                          Grounding and Evaluation for Large Language Models (Tutorial) With the ongoing rapid adoption of Artificial Intelligence (AI) based systems in high-stakes domains such as financial services, healthcare and life sciences, hiring and human resources, education, societal infrastructure, and national security, it is crucial to develop and deploy the underlying AI models and systems in a responsible ma

                                          • 論文紹介 Towards a Fair Marketplace: Counterfactual Evaluation of the trade-off between Relevance, Fairness & Satisfaction in Recommendation Systems

                                            社内論文読み会の資料です Mehrotra, Rishabh, et al. "Towards a fair marketplace: Counterfactual evaluation of the trade-off between relevance, fairness & satisfac…

                                              論文紹介 Towards a Fair Marketplace: Counterfactual Evaluation of the trade-off between Relevance, Fairness & Satisfaction in Recommendation Systems
                                            • Evaluation method of UX “The User Experience Honeycomb” | blog / bookslope

                                              ウェブサイトを評価する・レビューする方法にはさまざまな視点が必要になると思いますが、市場の流れから考えて「UX」視点が必要だとする見方があります。以前から、利用者視点というものを評価方法として加えている調査会社であれば、当然の流れといえますが、そうした場合のUXの評価とはユーザーテストを実施して実際に被験者に利用してもらうことが多いと思います。 ユーザテストのシナリオ作成においては、もっぱらそうした検討がされていると思いますが、評価方法としてUXを考える場合、「UXハニカム構造」がベースになるように思いました。 User Experience Design – Semantic Studios この記事に「The User Experience Honeycomb」というものがあり、これを「UXハニカム構造」と呼んでいるわけですが、UXを構成する要素には、Useful (役に立つ)・Usa

                                                Evaluation method of UX “The User Experience Honeycomb” | blog / bookslope
                                              • GitHub - FreedomIntelligence/LLMZoo: ⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡

                                                You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                                                  GitHub - FreedomIntelligence/LLMZoo: ⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡
                                                • Humanloop: Collaboration and evaluation for LLM applications

                                                  A shared workspace where PMs, Engineers and Domain Experts collaborate on building AI features Humanloop is the first platform that combines software best practices with the needs of LLMs in a unified platform. Empowering your whole team to drive AI improvement.

                                                    Humanloop: Collaboration and evaluation for LLM applications
                                                  • 長期の評価に最適なWindows 10/11 Enterprise Evaluationともっと長く付き合う“裏ワザ”

                                                    長期の評価に最適なWindows 10/11 Enterprise Evaluationともっと長く付き合う“裏ワザ”:山市良のうぃんどうず日記(277) Windows 10/11 Enterpriseには、90日無料で評価できる「Evaluation」エディションがあります。ライセンスを購入しなくても、企業向けWindows 10/11をテスト、評価できるので、筆者はよく利用しています。そんなEvaluationエディションを可能な限り長く利用する裏ワザを幾つか紹介します。

                                                      長期の評価に最適なWindows 10/11 Enterprise Evaluationともっと長く付き合う“裏ワザ”
                                                    • Misplaced trust: When trust in science fosters belief in pseudoscience and the benefits of critical evaluation

                                                      At a time when pseudoscience threatens the survival of communities, understanding this vulnerability, and how to reduce it, is paramount. Four preregistered experiments (N = 532, N = 472, N = 605, N = 382) with online U.S. samples introduced false claims concerning a (fictional) virus created as a bioweapon, mirroring conspiracy theories about COVID-19, and carcinogenic effects of GMOs (Geneticall

                                                        Misplaced trust: When trust in science fosters belief in pseudoscience and the benefits of critical evaluation
                                                      • OpenTofu 1.8.0 is out with Early Evaluation, Provider Mocking, and a Coder-Friendly Future | OpenTofu

                                                        July 29, 2024OpenTofu 1.8.0 is out with Early Evaluation, Provider Mocking, and a Coder-Friendly Future Since the 1.7 release, the OpenTofu community and core team have been hard at work on much-requested features, making .tf code easier to write, reducing unnecessary boilerplate, improving performance, and more. We are happy to announce the immediate availability of OpenTofu 1.8 with the followin

                                                          OpenTofu 1.8.0 is out with Early Evaluation, Provider Mocking, and a Coder-Friendly Future | OpenTofu
                                                        • 【ML Tech RPT. 】第11回 機械学習のモデルの評価方法 (Evaluation Metrics) を学ぶ (2) - Sansan Tech Blog

                                                          DSOC研究員の吉村です. 弊社には「よいこ」という社内の部活のような社内制度があり, 私はその中のテニス部に所属しています. 月一程度で活動をしているのですが, 最近は新たに入社された部員も増えてきて新しい風を感じています. さて, 今回も前回に引き続き「機械学習のモデルの評価方法 (Evaluation Metrics)」に焦点を当てていきます. (今回も前回同様, "モデル" という言葉を機械学習のモデルという意味で用います.) 前回は, モデルを評価する観点や注意事項について確認しました. 今回からは, 各種問題設定ごとにどのような評価指標が存在し, それらが何を意味するのかについて見ていこうと思います. 今回は二値分類問題を取り扱います. 前回の記事の最後で, 多クラス (マルチクラス) 分類・回帰問題についても本記事で取り扱うと書きましたが, 量が多くなりすぎてしまったため,

                                                            【ML Tech RPT. 】第11回 機械学習のモデルの評価方法 (Evaluation Metrics) を学ぶ (2) - Sansan Tech Blog
                                                          • International evaluation of an AI system for breast cancer screening - Nature

                                                            Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

                                                              International evaluation of an AI system for breast cancer screening - Nature
                                                            • The Generative AI Evaluation Company - Galileo

                                                              Evaluate, observe, and protect your GenAI applications Go beyond ‘vibe checks’ and asking GPT with the first end-to-end GenAI Stack, powered by Evaluation Foundation Models.

                                                                The Generative AI Evaluation Company - Galileo
                                                              • Windows Server 2022 | Microsoft Evaluation Center

                                                                In addition to your trial experience of Windows Server 2022, you can more easily add and manage languages and Features on Demand with the new Languages and Optional Features ISO. Download this ISO. This ISO is only available on Windows Server 2022 and combines the previously separate Features on Demand and Language Packs ISOs, and can be used as a FOD and Language pack repository. To learn about F

                                                                • 「Microsoft Evaluation Center」に障害、評価版ソフトがダウンロード不能に/コミュニティサイトでダウンロードリンクを案内中

                                                                    「Microsoft Evaluation Center」に障害、評価版ソフトがダウンロード不能に/コミュニティサイトでダウンロードリンクを案内中
                                                                  • Evaluation of Retrieval-Augmented Generation: A Survey

                                                                    Retrieval-Augmented Generation (RAG) has recently gained traction in natural language processing. Numerous studies and real-world applications are leveraging its ability to enhance generative models through external information retrieval. Evaluating these RAG systems, however, poses unique challenges due to their hybrid structure and reliance on dynamic knowledge sources. To better understand thes

                                                                    • Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation

                                                                      Off-policy evaluation (OPE) aims to estimate the performance of hypothetical policies using data generated by a different policy. Because of its huge potential impact in practice, there has been growing research interest in this field. There is, however, no real-world public dataset that enables the evaluation of OPE, making its experimental studies unrealistic and irreproducible. With the goal of

                                                                      • Mandoline: Model Evaluation under Distribution Shift

                                                                        Machine learning models are often deployed in different settings than they were trained and validated on, posing a challenge to practitioners who wish to predict how well the deployed model will perform on a target distribution. If an unlabeled sample from the target distribution is available, along with a labeled sample from a possibly different source distribution, standard approaches such as im

                                                                        • Terms of Evaluation

                                                                          Terms of Evaluation for HashiCorp SoftwareBefore you download and/or use our enterprise software for evaluation purposes, you will need to agree to a special set of terms (“Agreement”), which will be applicable for your use of the HashiCorp, Inc.’s (“HashiCorp”, “we”, or “us”) enterprise software. PLEASE READ THIS AGREEMENT CAREFULLY BEFORE INSTALLING OR USING THE SOFTWARE. THESE TERMS AND CONDITI

                                                                            Terms of Evaluation
                                                                          • 論文紹介:ChatGPT で情報抽出タスクは解けるのか?�Is information extraction solved by ChatGPT? �An analysis of performance, evaluation criteria, robustness and errors

                                                                            論文紹介:ChatGPT で情報抽出タスクは解けるのか?�Is information extraction solved by ChatGPT? �An analysis of performance, evaluation criteria, robustness and errors

                                                                              論文紹介:ChatGPT で情報抽出タスクは解けるのか?�Is information extraction solved by ChatGPT? �An analysis of performance, evaluation criteria, robustness and errors
                                                                            • GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models.

                                                                              You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                                                                                GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models.
                                                                              • U.S. AI Safety Institute Signs Agreements Regarding AI Safety Research, Testing and Evaluation With Anthropic and OpenAI

                                                                                GAITHERSBURG, Md. — Today, the U.S. Artificial Intelligence Safety Institute at the U.S. Department of Commerce’s National Institute of Standards and Technology (NIST) announced agreements that enable formal collaboration on AI safety research, testing and evaluation with both Anthropic and OpenAI. Each company’s Memorandum of Understanding establishes the framework for the U.S. AI Safety Institut

                                                                                  U.S. AI Safety Institute Signs Agreements Regarding AI Safety Research, Testing and Evaluation With Anthropic and OpenAI
                                                                                • CAE (Continuous Access Evaluation: 継続的アクセス評価)

                                                                                  こんにちは。Azure Identity チームの金森です。 みなさんは CAE (Continuous Access Evaluation: 継続的アクセス評価) という機能をご存知でしょうか。 2021 年 11 月現在、以下のようなお知らせがあり、目にされた方も多いのではないかと思います。 Microsoft 365 管理ポータルのメッセージ センターに MC255540 (Continuous access evaluation on by default) として情報が公開 送信元 : Microsoft Azure azure-noreply@microsoft.com から TRACKING ID: 5T93-LTG として以下の件名のメールでお知らせ -> Continuous access evaluation will be enabled in premium Azu

                                                                                    CAE (Continuous Access Evaluation: 継続的アクセス評価)

                                                                                  新着記事