[B! scraping] manboubirdのブックマーク

manboubird id:manboubird

scrapingに関するmanboubirdのブックマーク (82)

Twitterの親会社であるXが「Twitterでデータスクレイピングを行い損害を与えた」として4人を提訴、1億3000万円超の損害賠償を求める|au Webポータル
manboubird 2023/11/03
law

sue

twitter

scraping

crawler
リンク
PlayWright Browser Toolkit で Webスクレイピングを試してみた | keywalker
はじめに PlayWright Browser Toolkitを紹介します。今回は、PlayWright Browser Toolkitを使って、簡単なWebスクレイピングを行いました。目次概要動作確認：pythonからPlayWrightを操作する PlayWright Browser Toolkit 簡単なWebスクレイピングまとめ参考情報 1.概要 PlayWrightとは PlayWright(リンク)は、Microsoft社が開発している、Webテストと自動化のためのフレームワークです。 PlayWrightを使うことで、Chrome、Firefox、WebKitをコマンドラインから操作することが可能になります。 PlayWright Browser Toolkit PlayWright Browser Toolkit(リンク)は、LangCahin(リンク)のAge
manboubird 2023/10/28
playwright

scraping

crawler

openAi

chatGpt

agent

langchain
リンク
GitHub - scrapinghub/article-extraction-benchmark: Article extraction benchmark: dataset and evaluation scripts
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert
manboubird 2023/08/28
scraping

crawler

benchmark

informationExtraction
リンク
GitHub - adbar/trafilatura: Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Trafilatura is a cutting-edge Python package and command-line tool designed to gather text on the Web and simplify the process of turning raw HTML into structured, meaningful data. It includes all necessary discovery and text processing components to perform web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is
manboubird 2023/08/28
crawler

scraping

python

trafilatura
リンク
shot-scraper
manboubird 2023/07/16
scraping

datasette

crawler

playwrite
リンク
datasette-scraper - a plugin for Datasette
manboubird 2023/07/16
datasette

scraping

plugin
リンク
AI を用いた情報抽出システムの試作 | keywalker - blog
本記事では LangChain を用いて任意の URL から情報を抽出するシステムの minimum viable product について紹介します。特定のページを対象に情報抽出を行ったところ、ベースラインとしてはある程度の抽出精度が期待できる結果となりました（多様なページに対する定量評価も今後行う予定です）。一方で一部のクエリに対して抽出誤りが見られました。電話番号や株価など抽出誤りが許容されない情報については、あくまで抽出支援として、人が介在する運用を検討する必要があると改めて感じました。結論としては、高精度に情報抽出できる従来のクローラと併せて、互いの苦手な領域を補っていく仕組みを整えていきたいなと思います。おことわり著者は自然言語処理エンジニアとして絶賛勉強中です。記事の誤り、推奨される方法等がありましたらご指摘いただけますと幸いです。本記事は読者層を明確に想定した上
manboubird 2023/06/23
keywaker

serpapi

scraping

langchain
リンク
GitHub - serpapi/google-search-results-python: Google Search Results via SERP API pip Python Package
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert
manboubird 2023/06/19
searchEngine

serpapi

google

python

scraping

crawler
リンク
Instantly create a GitHub repository to take screenshots of a web page
Instantly create a GitHub repository to take screenshots of a web page 14th March 2022 I just released shot-scraper-template, a GitHub repository template that helps you start taking automated screenshots of a web page by filling out a form. shot-scraper is my command line tool for taking screenshots of web pages and scraping data from them using JavaScript. One of its uses is to help create and m
manboubird 2023/06/11
scraping

datasette

github
リンク
Airtable | Everyone's app platform
Your browser version is not supported. Try our desktop apps!Alternatively, use the latest version of Chrome, Firefox, Safari, or Edge instead.
manboubird 2023/02/18
extractGpt

scraping

chatGpt

service
リンク
アイルランド、メタに制裁金380億円　5億人の情報流出で - 日本経済新聞
【ロンドン=佐竹実】アイルランドのデータ保護委員会（DPC）は28日、米メタ（旧フェイスブック）が個人情報を適切に扱っていなかったとして、2億6500万ユーロ（約380億円）の制裁金を科すと発表した。欧州連合（EU）の一般データ保護規則（GDPR）に違反すると判断した。英BBCによると、最大5億3300万人の利用者の個人情報がネット上で閲覧できる状態になっていた。DPCは2021年4月、18年
manboubird 2022/12/04
regulation

meta

instagram

gdpr

facebook

scraping
リンク
GitHub - twintproject/twint: An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert
manboubird 2021/11/06
twint

twitter

scraping

python
リンク
【悪用厳禁】Torを使ったスクレイピングでIPアドレスを分散させるテクニック – Python | Let's Hack Tech
Torをスクレイピングで使いやすくするPythonのモジュール作ってみた TorをPython スクレイピングに流用しやすくするためのモジュールを作成しました。Torをスクレイピングに流用することによって、IPによる制限を回避することが容易になります。 Torを使ったWebスクレイピング Webスクレイピングに、そのSocksプロキシを流用することで、簡単にIPアドレスを変更することが可能になります。つまり自分のIPではないIPを使って色んなWEBサイトにBOTアクセスすることが可能になります。 Torを使ったスクレイピングはどういった場合に便利なのか？ WEBアクセスの自動化、スクレイピングやBOTアクセスというのは年々、制限が厳しくなっているサイトが増えています。例えばブックオフオンラインというサイトで、20回ほど連続でF5ボタンを押してみてください。ブックオフオンラインは割と昔か
manboubird 2021/08/31
tor

scraping

crawler
リンク
「非倫理的」なAI訓練データセット、削除するだけでは不十分
人工知能（AI）の訓練用にはかつて、ネット上のデータを許可なく集められたものが使われた。後に批判され、データセットを撤回する例が相次いだがが、撤回するだけでは問題の解決にはならない。 by Karen Hao2021.08.23 39 3 19 2016年、マイクロソフトは、顔認識の進歩に拍車をかけることを期待して、世界最大の顔データベースを公開した。「MS-Celeb-1M」と呼ぶこのデータベースには、10万人の有名人の顔を撮影した1000万枚の画像が入っていた。しかし「有名人」といっても、その定義は曖昧なものだった。 3年後、研究者のアダム・ハーベイ（Adam Harvey）とジュール・ラプラス（Jules LaPlace）がこのデータセットを精査したところ、ジャーナリスト、アーティスト、活動家、学者など、仕事のためにネット上で活動している多くの一般人が見つかった。彼らはみな、データベ
manboubird 2021/08/28
artificialIntelligence

crawler

scraping

dataset

privacy

research
リンク
(PDF) Legality and Ethics of Web Scraping
manboubird 2021/07/01
scraping

paper
リンク
https://scrapfly.io/is-web-scraping-legal
manboubird 2021/07/01
scraping
リンク
Newspaper3k: Article scraping & curation — newspaper documentation
Newspaper3k: Article scraping & curation¶ Inspired by requests for its simplicity and powered by lxml for its speed: “Newspaper is an amazing python library for extracting & curating articles.” – tweeted by Kenneth Reitz, Author of requests “Newspaper delivers Instapaper style article extraction.” – The Changelog
manboubird 2020/02/23
newspapwer

scraping

python

crawler
リンク
GitHub - avidLearnerInProgress/python-automation-scripts: Simple yet powerful automation stuffs.
manboubird 2020/02/23
crawler

scraping

python

webdriver
リンク
GitHub - fhamborg/news-please: news-please - an integrated web crawler and information extractor for news that just works
manboubird 2020/02/23
crawler

scraping

python

news

newsPlease

scrapy
リンク
80legs – Customizable Web Scraping
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.OkRead our Privacy Policy
manboubird 2020/02/23
scraping

crawler

80legs
リンク
1 2 3 4 5 次のページ

お知らせ

もっと読む

公式Twitter

@HatenaBookmark
リリース、障害情報などのサービスのお知らせ
@hatebu
最新の人気エントリーの配信

キーボードショートカット一覧

j次のブックマーク

k前のブックマーク

lあとで読む

eコメント一覧を開く

oページを開く

設定を変更しましたx