[B! crawler] kaorunのブックマーク

kaorun id:kaorun

crawlerに関するkaorunのブックマーク (5)

OpenAI、Webデータ収集クローラー「GPTBot」のブロック方法を説明
米OpenAIは、Webサイト管理者が同社のWebクローラー「GPTBot」によるサイトのデータ収集を回避する方法を紹介した。紹介する文書に日付はないが、米AI専用オンラインメディアMaginativeなどが8月7日（現地時間）、文書を見つけて報じた。 GPTBotは、同社のAIモデルをトレーニングするために公開データを収集するためのWebクローラー。OpenAIはこの文書で、GPTBotのクローリングをブロックする手順を説明している。 Webオーナーがrobots.txtにGPTBotを追加したり、IPアドレスを直接ブロックしたりしないと、ユーザーがWebサイトに入力するデータを含むWebサイトのデータがAIモデルのトレーニングデータとして収集される。ブロックしなくても、ペイウォールアクセスを必要とするソース、個人を特定できる情報を収集することが知られているソース、ポリシーに違反するテ
kaorun 2023/08/09
openai

chatgpt

llm

internet

crawler

bot
リンク
トップ100万ウェブサイトのrobots.txtを解析した人とその結果
An Analysis of the World's Leading robots.txt Files(世界のリーダーたちの robots.txt ファイル)というブログで、世界の上位100万サイトの robo […] An Analysis of the World's Leading robots.txt Files(世界のリーダーたちの robots.txt ファイル)というブログで、世界の上位100万サイトの robots.txt を解析したベン・フレデリクソンさん(Ben Frederickson)の話が出ていました。フレデリクソンさんは、解析結果から、3つの面白い気づきを紹介してくれています。 Googlebot にしか見せないサイト Googleボット以外のすべてのボットを拒否する、という設定のサイトは意外に多いそうです。大手サイトでは例えば、フェイスブック(robots
kaorun 2017/11/24
robots.txt

crawler

webcrawler
リンク
Phone + Cloud Series: Polling Stock Quotes with an Azure Worker Role | Game Theory
kaorun 2016/03/30
azure

crawler

bot

notification

workerrole
リンク
How to Write a Web Crawler in C# - ericsowell.com
A few months ago I drastically changed how the urls on my site were built. I moved to using the ASP.NET 2.0 virtual path provider to make more friendly urls. See the discussions in April 2007 if you’re interested. There were several posts that month about it. One probl em with a change like this is that it can wreak havoc on your urls, especially your relative ones. Using the url rewriting features
kaorun 2016/03/30
c#

bot

crawler
リンク
Webクローリング＆スクレイピングの最前線公開用
MySQLとPostgreSQLと日本語全文検索 - Azure DatabaseでMroonga・PGroongaを使いたいですよね！？Kouhei Sutou
kaorun 2013/06/28
scraping

crawler

spider

robot
リンク
1

お知らせ

もっと読む

公式Twitter

@HatenaBookmark
リリース、障害情報などのサービスのお知らせ
@hatebu
最新の人気エントリーの配信

キーボードショートカット一覧

j次のブックマーク

k前のブックマーク

lあとで読む

eコメント一覧を開く

oページを開く

設定を変更しましたx