[B! Dataset] xefのブックマーク

Google Dataset Search

‫العربية‬‪Deutsch‬‪English‬‪Español (España)‬‪Español (Latinoamérica)‬‪Français‬‪Italiano‬‪日本語‬‪한국어‬‪Nederlands‬Polski‬‪Português‬‪Русский‬‪ไทย‬‪Türkçe‬‪简体中文‬‪中文（香港）‬‪繁體中文‬

xef 2024/06/13

Dataset

リンク

Wikipediaを用いた日本語の固有表現抽出データセットの公開

ML事業部の近江崇宏です。ストックマークではプロダクトで様々な自然言語処理の技術を用いていますが、その中のコア技術の一つに固有表現抽出があります。固有表現抽出はテキストの中から固有表現（固有名詞）を抽出する技術で、例えば「Astrategy」というプロダクトでは、固有表現抽出を用いてニュース記事の中から企業名を抽出しています。（企業名抽出については過去のブログ記事を参考にしてください。）一般に、固有表現抽出を行うためには、大量のテキストに固有表現をアノテーションした学習データをもとに機械学習モデルの学習を行います。今回、ストックマークは固有表現抽出のための日本語の学習データセットを公開いたします！ご自由にお使いいただければと思います！レポジトリ：https://github.com/stockmarkteam/ner-wikipedia-dataset 固有表現をハイライトしたサンプ

xef 2020/12/18

NLP
Dataset

リンク

Wikipediaの前処理はもうやめて「Wiki-40B」を使う - Ahogrammer

最近の自然言語処理では、大規模なテキストから単語の分散表現や言語モデルを学習させて使っています。学習する際のテキストとしては、分量や利用しやすさの都合からWikipediaが選ばれることが多いですが、その前処理は意外と面倒で時間のかかる作業です。そこで、本記事では比較的最近リリースされた前処理済みのデータセット「Wiki-40B」とその使い方を紹介します。 Wiki-40Bとは？ Wiki-40Bは、40言語以上のWikipediaを前処理して作られたデータセットです。このデータセットは言語ごとに学習/検証/テスト用に分かれているので、単語分散表現や言語モデルの学習・評価に使えます。言語ごとの対応状況については、以下のページを参照するとよいでしょう。 wiki40b | TensorFlow Datasets 前処理としては、大きくは以下の2つに分けられます。ページのフィルタリングペー

xef 2020/09/26

NLP
Dataset

リンク

無料で使える「住所マスターデータ」公開、表記統一や緯度経度への変換に活用可能　全国の町丁目レベル18万9540件の住所データを記録

xef 2020/08/22

Dataset

リンク

3DDB Viewer の公開について | 研究チーム | 人工知能研究センター

近年、社会活動や企業活動の一部として世界的に三次元データの利用が拡大しており、多種多様なデータを容易に検索/閲覧できるシステムが、データの提供者と利用者の双方から求められています。3DDB Viewer は、産総研の3Dデータベース用に開発された Web ユーザインタフェースで、様々な三次元データ（点群／メッシュ／構造物等）を検索／表示／ダウンロードすることができます。マニュアルはこちら。

xef 2020/08/14

Dataset

リンク

TechCrunch | Startup and Technology News

Line Man Wongnai, an on-demand food delivery service in Thailand, is considering an initial public offering on a Thai exchange or the U.S. in 2025.

xef 2020/08/07

Dataset

リンク

How a File Format Led to a Crossword Scandal - Saul Pwanson

xef 2020/06/16

リンク

KMNIST／Kuzushiji-MNIST：日本古典籍くずし字（手書き文字）データセット

KMNIST／Kuzushiji-MNIST：日本古典籍くずし字（手書き文字）データセット：AI・機械学習のデータセット辞典データセット「KMNIST」について説明。7万枚の手書き文字（くずし字）の「画像＋ラベル」データが無料でダウンロードでき、画像認識などのディープラーニングに利用できる。データセットをダウンロードできるPythonファイルについても紹介。

xef 2020/01/27

Dataset

リンク

朝日新聞単語ベクトル

朝日新聞メディアラボ・朝日新聞単語ベクトル本サイトは移転しました。5秒後にジャンプします。ジャンプしない場合は、以下のURLをクリックしてください。移転先のページ

xef 2017/11/07

リンク

ROIS-DS人文学オープンデータ共同利用センター(CODH)

ROIS-DS人文学オープンデータ共同利用センター（Center for Open Data in the Humanities / CODH）は、情報学・統計学の最新技術を用いて人文学資料（史料）を分析する「データ駆動型人文学」や、人文学研究の成果に基づき構築したデータセットを超学際的に活用する「人文学ビッグデータ」など、オープンサイエンス時代の新しい人文学研究を展開します。[もっと詳しく..][CODHパンフレット..] 重要なお知らせ 2023-11-02 歴史×技術の出会いの場として、第1回ヒストリーテック勉強会を11月22日にオンラインで開催します。参加無料です。 2023-10-18 歴史的地名の「行政区画変遷」を大規模オープンデータ化～『日本歴史地名大系』を平凡社地図出版との協働により機械可読データとして強化～ >> お知らせ一覧 X (Twitter) - Timeline

xef 2016/11/20

リンク

Yahoo Releases the Largest-ever Machine Learning... | Yahoo Labs

Yahoo Releases the Largest-ever Machine Learning Dataset for Researchers By Suju Rajan Data is the lifeblood of research in machine learning. However, access to truly large-scale datasets is a privilege that has been traditionally reserved for machine learning researchers and data scientists working at large companies – and out of reach for most academic researchers. Research scientists at Yahoo L

xef 2016/01/15

Dataset

リンク

Reddit - Dive into anything

I have every publ icly available Reddit comment for research. ~ 1.7 billion comments @ 250 GB compressed. Any interest in this? I am currently doing a massive analysis of Reddit's entire publ icly available comment dataset. The dataset is ~1.7 billion JSON objects complete with the comment, score, author, subreddit, position in comment tree and other fields that are available through Reddit's API. I

xef 2015/07/12

リンク

Data Science for Research - Microsoft Research

Microsoft Research provides a continuously refreshed collection of free datasets, tools, and resources designed to advance academic research in many areas of computer science, such as natural language processing and computer vision. Access these datasets at https://msropendata.com. Our programs over the years have supported academics to push the state-of-the-art with data science and cloud: NSF Bi

xef 2015/07/07

Dataset

リンク

クックパッドのデータを研究者に公開します - クックパッド開発者ブログ

こんにちは。検索・編成部の原島です。大学の研究者にお会いすると、「クックパッドのデータを研究に使用したいんですが...」と相談されることがあります。料理に関する研究をしているけれど、実際のデータがないため、なかなか研究が進まないという相談です。料理に関する研究が進まないのは、クックパッドにとっても残念なことです。これらの研究は、クックパッドのサービスを改善するための「芽」でもあります。データがないだけで芽が育たないのは、非常に悲しい話です。このような現状を打破するため、本日から、クックパッドのデータを研究者に公開します。このエントリでは、我々が準備してきたデータ公開の仕様について QA 形式で解説します。誰が利用できるの？申請していただいた研究者です。ただし、公的機関（e.g. 大学、独立行政法人）の研究者に限ります。申請時には、クックパッドと国立情報学研究所（後述）による審査が

xef 2015/02/25

Dataset

リンク

生活定点1992-2018｜博報堂生活総研

生活定点とは？ 1992年から隔年で実施している生活者の意識調査です。同じ質問を繰り返し投げ掛け、その回答の変化を定点観測しています。

xef 2014/10/23

Dataset

リンク

Twitter Natural Language Processing -- Noah's ARK

We provide a tokenizer, a part-of-speech tagger, hierarchical word clusters, and a dependency parser for tweets, along with annotated corpora and web-based annotation tools. Contributors: Archna Bhatia, Dipanjan Das, Chris Dyer, Jacob Eisenstein, Jeffrey Flanigan, Kevin Gimpel, Michael Heilman, Lingpeng Kong, Daniel Mills, Brendan O'Connor, Olutobi Owoputi, Nathan Schneider, Noah Smith, Swabha Swa

xef 2014/10/22

NLP
Dataset

リンク

The Data Incubator is Now Pragmatic Data | Pragmatic Institute

Updated January 2024 We are excited to announce a series of semi-technical data courses and two new data certification programs from Pragmatic Institute. Available in 2024, these courses are designed for data professionals aiming to sharpen their skills and beginners eager to break into the data science field. Learn About Pragmatic Data Welcome to Pragmatic Data In 2019, The Data Incubator officia

xef 2014/10/18

Dataset

リンク

The Big Mac Index — Historical Data from the Economist’s Big Mac Index

The Economist has been publishing the Big Mac Index since 1986. In grad school I was studying Purchasing Power Parity and decided to use data from the Big Mac Index as part of my final paper. The probl em was that the data available online only went back to the year 2000 and the years 1986 through 2000 were nowhere to be found. I went to the University of Michigan library and spent about 14 hours l

xef 2014/10/12

Dataset

リンク

Stanford Large Network Dataset Collection

Open research positions in SNAP group are available at undergraduate, graduate and postdoctoral levels. Social networks : online social networks, edges represent interactions between people Networks with ground-truth communities : ground-truth network communities in social and information networks Communication networks : em ail communication networks with edges representing communication Citation

xef 2014/07/22

Dataset

リンク

A collection of datasets originally distributed in various R packages

Rdatasets is a collection of 2279 datasets which were originally distributed alongside the statistical software environment R and some of its add-on packages. The goal is to make these data more broadly accessible for teaching and statistical software development. What is included? The list of available datasets (csv and docs) is available here: HTML index CSV index On the github repository you wi

xef 2014/05/31

R
Dataset

リンク

はてなブックマーク

タグ

関連タグで絞り込む (15)

Datasetに関するxefのブックマーク (53)

お知らせ

今週のはてなブックマーク数ランキング（2024年7月第2週）

はてなブックマーク透明性レポート（2024年 2月-2024年4月）

今週のはてなブックマーク数ランキング（2024年7月第1週）

公式Twitter

キーボードショートカット一覧

はてなブックマーク

公式Twitter

はてなのサービス