本文「scraping」を検索 - はてなブックマーク

1 - 30 件 / 30件

新着順人気順

絞り込み

検索対象
ブックマーク数
期間
セーフサーチ

scrapingの検索結果1 - 30 件 / 30件

各国報道機関を装ったニュースサイトで親中派の偽情報を流す「PAPERWALL」作戦が展開されている
- 133 users
- gigazine.net
- 政治と経済
- 2024/03/16
中国企業が運営している、少なくとも123のウェブサイトネットワークが、30カ国の報道機関を装ったニュースサイトで親中派の偽情報や感情に訴えるような攻撃を流す「PAPERWALL」という作戦を行っていることが明らかになりました。 PAPERWALL: Chinese Websites Posing as Local News Outlets Target Global Audiences with Pro-Beijing Content - The Citizen Lab https://citizenlab.ca/2024/02/paperwall-chinese-websites-posing-as-local-news-outlets-with-pro-beijing-content/ 中国はオンライン、オフラインの両面から影響力を高めるための作戦を展開しています。その中の1つとみられ
- 中国
- fakenews
- 情報戦
- China
- セキュリティ
- politics
- world
Migrating to OpenTelemetry | Airplane
- 57 users
- www.airplane.dev
- テクノロジー
- 2023/11/17
At Airplane, we collect observability data from our own systems as well as remote “agents” that are running in our customers’ infrastructure. The associated outputs, which include the standard “three pillars of observability” (logs, metrics, and traces) are essential for us to monitor our infrastructure and also help customers debug problems in theirs. Over the last year, we’ve made a concerted ef
“リンク切れ”――インターネットは砂上の図書館である | p2ptk[.]org
- 49 users
- p2ptk.org
- テクノロジー
- 2024/05/23
以下の文章は、コリイ・ドクトロウの「Linkrot」という記事を翻訳したものである。 Pluralistic 過小評価されている認知的美徳がある。それは「対象の永続性（object permanence）」、つまり以前に物事をどのように認識したかを継続的に記憶していることだ。ライリー・クインがしばしば思い出させてくれるように、左派は「対象の永続性」のイデオロギーだ。左派であるということは、CIAが一時的にトランプを苦しめている時でさえも、CIAを嫌い、信用しないことであり、あるいは、かつて労働者が自分の賃金で家族を養えていたことを覚えていることだ。 https://pluralistic.net/2023/10/27/six-sells/#youre-holding-it-wrong 問題は、対象の永続性が難しいということだ。光陰矢の如し。事実を覚えておくのは難しく、それらの事実がどの順番
LogLog Games
- 36 users
- loglog.games
- テクノロジー
- 2024/04/27
The article is also available in Chinese. Disclaimer: This post is a very long collection of thoughts and problems I've had over the years, and also addresses some of the arguments I've been repeatedly told. This post expresses my opinion the has been formed over using Rust for gamedev for many thousands of hours over many years, and multiple finished games. This isn't meant to brag or indicate su
「PerplexityのAIがクローラーをブロックするrobots.txtを無視している」との指摘に対しCEOが「無視しているわけではないがサードパーティーのクローラーに依存している」と主張
- 30 users
- gigazine.net
- テクノロジー
- 2024/06/24
生成AIを利用した検索エンジンの「Perplexity」に対して、検索エンジンやAIトレーニングなどのボット(クローラー)を制御できるテキストファイル「robots.txt」の指示を無視し、管理者がPerplexityの巡回を禁止したウェブサイトにもアクセスしていることが指摘されています。これに対し、Perplexityのアラヴィンド・スリニヴァスCEOが、「robots.txtの指示を無視しているわけではない」「自社のクローラーだけでなく、サードパーティーのクローラーにも依存している」と釈明しました。 Perplexity AI CEO Aravind Srinivas on plagiarism accusations - Fast Company https://www.fastcompany.com/91144894/perplexity-ai-ceo-aravind-sriniv
- AI
- 人工知能
- illust
- trouble
- search
- 検索
成果物のハッシュ値を保存・比較して余計なデプロイを行わないようにする for GitHub Actions
- 20 users
- zenn.dev/cybozu_ept
- テクノロジー
- 2023/12/01
タイトル通りです。GitHub Actions において、成果物のハッシュ値を保存・比較して余計なデプロイを行わないようにする方法を記します。 TL;DR 対象ビルド・デプロイを GitHub Actions で行っている余計なデプロイはしたくない静的サイトのビルド時に成果物のハッシュ値(sha256)を計算して、前回のデプロイ時と同じであればデプロイをスキップするファイル 1 つ 1 つのハッシュ値を計算し、全ハッシュ値からさらにハッシュ値を計算するコマンド find <成果物のあるディレクトリパス> -type f -print0 | sort --zero-terminated | xargs -0 sha256sum | cut -d ' ' -f 1 | sha256sum | cut -d ' ' -f 1 デプロイ時に計算したハッシュ値は GitHub Action
Solrのクラウド移行 -AWS ECS Fargateの事例- - LIVESENSE ENGINEER BLOG
- 20 users
- made.livesense.co.jp
- テクノロジー
- 2024/02/21
はじめに技術部インフラグループの春日です。 2024年現在、弊社が運営しているマッハバイトは一部を除いてオンプレからクラウドへの移行が完了しました。本記事では移行対象の1つであった Apache Solr に関する総括をします。今回のプロジェクトでは移行自体を最優先とするため、スコープを以下に定めていました。 Apache Solrから他の検索エンジンへは乗り換えないアプリケーション側の改修は向き先の変更だけに留める Apache Solr自体のバージョンUP対応はしない運用負荷を軽減できる形の構成変更を加える移行スピードと移行後の運用コストとの天秤新たに運用しないといけなくなるコンポーネントはなるべく増やさないモニタリングや監視の精度はなるべく落とさない上記を踏まえ、以降の節ではApache Solrのサービス内利用箇所の紹介から始め、インフラ構成・デプロイ・モニ
Deno で始めるスクレイピング講座
- 16 users
- zenn.dev/ame_x
- テクノロジー
- 2023/10/26
初めに皆様スクレイピングは知っていますか？スクレイピングの定義はこうです。ウェブスクレイピングとは、ウェブサイトから情報を抽出するコンピュータソフトウェア技術のこと。通常このようなソフトウェアプログラムは低レベルのHTTPを実装することで、もしくはウェブブラウザを埋め込むことによって、WWWのコンテンツを取得する。要するにブラウザからFetch等で取得するのではなく、 Python や Cpp でHTTPリクエストを送信し、レスポンスを解析することでサイトの情報を取得する事です。 Python では BeautifulSoup や Requests 、Selenium等が有ります。レスポンスのHTMLソースをDOM解析して情報を取得することが出来ます。 Deno は言わずと知れた JavaScriptランタイムの大御所です。 PythonよりもDOM解析に優れています。アプロ
- プログラミング
- あとで読む
逆ケンタウルス化問題：人間はAIを監視し続けられるのか（無理） | p2ptk[.]org
- 10 users
- p2ptk.org
- テクノロジー
- 2024/04/19
以下の文章は、コリイ・ドクトロウの「Humans are not perfectly vigilant」という記事を翻訳したものである。 Pluralistic AIの面白い話をしよう。あるセキュリティ研究者が、大企業のAI生成のソースコードが存在しないライブラリを繰り返し参照していること（AIの”幻覚（ハルシネーション）”）に気づき、その名前をつけた悪意のある（無害な）ライブラリを作ってアップロードした。すると何千人もの開発者がそのコードをコンパイルする際に自動的にそのライブラリをダウンロードして組み込んでしまった。 https://www.theregister.com/2024/03/28/ai_bots_hallucinate_software_packages/ こうした”幻覚”は大規模言語モデルの拭い難い特徴だ。なぜなら、AIモデルは理解しているフリをしているだけで、実際には高
- ai
ニューヨーク・タイムズがAI学習のための記事利用を原則禁止に、OpenAIに対しては法的措置を検討
- 9 users
- gigazine.net
- テクノロジー
- 2023/08/18
アメリカの新聞大手、ニューヨーク・タイムズが2023年8月3日に利用規約を変更し、AI開発のために無断で記事や写真などを利用する事を原則として禁止することを決定しました。AIによる学習と著作権の侵害の議論が白熱する中で、ニューヨーク・タイムズはチャットAI「ChatGPT」の開発を行うOpenAIに対する法的措置を検討していることが報じられています。 Terms of Service – Help https://help.nytimes.com/hc/en-us/articles/115014893428-Terms-of-Service New York Times considers legal action against OpenAI as copyright tensions swirl : NPR https://www.npr.org/2023/08/16/11942025
- AI
- 法律と倫理
- OpenAI
- ChatGPT
- 人工知能
- GIGAZINE
- ニュース
Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material
- 9 users
- www.404media.co
- テクノロジー
- 2023/12/20
AI Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material The model is a massive part of the AI-ecosystem, used by Stable Diffusion and other major generative AI products. The removal follows discoveries made by Stanford researchers, who found thousands instances of suspected child sexual abuse material in the dataset. This piece is published with support from Th
- あとで読む
Top 12 OSINT Tools for the Dark Web
- 9 users
- infosecwriteups.com
- テクノロジー
- 2023/11/13
1) TORBOT This tool is an OSINT resource designed specifically for the dark web. Crafted using Python, its primary aim is to systematically gather comprehensive information using data mining algorithms. Its capabilities extend to meticulous data retrieval and the generation of a tree graph, enabling in-depth exploration. Operating as an Onion Crawler (.onion), it extracts page titles, site address
- あとで読む
画像生成AI・Midjourneyが「Stable Diffusion開発元のBOTによるプロンプトと画像の大量収集」を検知して当該アカウントを永久BAN
- 8 users
- gigazine.net
- テクノロジー
- 2024/03/12
画像生成AIを開発・運営するMidjourneyが、競合するAIのStable Diffusionを開発するStability AIの従業員が所有するアカウントを無期限で自社サービスから追放したと報じられています。Midjourneyは、Stability AIの従業員がBOTを使ってプロンプトと画像のペアを大量に取得するデータスクレイピングを行っていた疑いがあるからだと説明しています。 Midjourney bans all Stability AI employees over alleged data scraping - The Verge https://www.theverge.com/2024/3/11/24097495/midjourney-bans-stability-ai-employees-data-theft-outage Image-scraping Midjou
- trouble
- ai
- ダジャレ
Seleniumが本当にバレバレなのか試してみた - Qiita
- 8 users
- qiita.com/Guz9N9KLASTt
- テクノロジー
- 2023/08/12
目的以前こちらの記事にてスクレイピングはすぐにバレることを知った本当にそうなのか試してみたくなったので、実際に試してみた確認手順適当にWebページをつくるスクレイピングをして挙動を確認する環境構築なんでもいいんですが、試しにReactで環境構築します npx create-react-app check-scraping cd check-scraping code . npm run start import React, { useEffect } from 'react'; function App() { useEffect(() => { if (window.navigator.webdriver) { alert("Webdriverを検出しました"); } }, []); return ( <div className="App"> <h1>WebDriver
The state of HTTP clients, or why you should use httpx · honeyryder
- 8 users
- honeyryderchuck.gitlab.io
- テクノロジー
- 2023/10/17
The state of HTTP clients, or why you should use httpx 15 Oct 2023 TL;DR most http clients you’ve been using since the ruby heyday are either broken, unmaintained, or stale, and you should be using httpx nowadays. Every year, a few articles come out with a title similar to “the best ruby http clients of the year of our lord 20xx”. Most of the community dismisses them as clickbait, either because o
- ruby
- HTTP
OpenAIがインターネット上のコンテンツ収集に用いるウェブクローラー「GPTBot」をブロックする試みが進行中
- 7 users
- gigazine.net
- テクノロジー
- 2023/08/14
対話型AIのChatGPTを開発するOpenAIは2023年8月に、大規模言語モデルの学習に必要なデータセットをインターネット上から収集するためのウェブクローラー「GPTBot」に関する詳細を公開しました。GPTBotに関するオンラインドキュメントには、GPTBotによるコンテンツの収集を防ぐための方法も記載されており、一部のウェブサイトは早速GPTBotのブロックに乗り出していることが報じられています。 Now you can block OpenAI’s web crawler - The Verge https://www.theverge.com/2023/8/7/23823046/openai-data-scrape-block-ai OpenAI launches web crawling GPTBot, sparking blocking effort by website
- OpenAI
- 人工知能
- データ
- AI
GitHub - adbar/trafilatura: Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
- 6 users
- github.com/adbar
- テクノロジー
- 2023/08/15
Trafilatura is a cutting-edge Python package and command-line tool designed to gather text on the Web and simplify the process of turning raw HTML into structured, meaningful data. It includes all necessary discovery and text processing components to perform web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is
- Python
- OSS
- text
- tool
- web
コンテナ開発者向けの AWS Lambda | Amazon Web Services
- 5 users
- aws.amazon.com
- テクノロジー
- 2023/09/13
Amazon Web Services ブログコンテナ開発者向けの AWS Lambda この記事は「 AWS Lambda for the containers developer 」（記事公開日： 2023 年 5 月 9 日）の翻訳記事です。はじめに AWS 上でアプリケーションを構築する際、お客様が直面する一般的な決定事項の 1 つは、 AWS Lambda で構築するのか、あるいは Amazon Elastic Container Service (Amazon ECS) や Amazon Elastic Kubernetes Service (Amazon EKS) といったようなコンテナサービスで構築するのかということがあります。この決定を下すには、コスト、スケーリング特性、開発者がハードウェアオプションをどの程度制御できるかなど、考慮すべき多くの要素があります。ファン
PAPERWALL: Chinese Websites Posing as Local News Outlets Target Global Audiences with Pro-Beijing Content - The Citizen Lab
- 4 users
- citizenlab.ca
- テクノロジー
- 2024/02/11
Key Findings A network of at least 123 websites operated from within the People’s Republic of China while posing as local news outlets in 30 countries across Europe, Asia, and Latin America, disseminates pro-Beijing disinformation and ad hominem attacks within much larger volumes of commercial press releases. We name this campaign PAPERWALL. PAPERWALL has similarities with HaiEnergy, an influence
- あとで読む
How bad are search results? Let's compare Google, Bing, Marginalia, Kagi, Mwmbl, and ChatGPT
- 4 users
- danluu.com
- テクノロジー
- 2023/12/31
Marginalia does relatively well by sometimes providing decent but not great answers and then providing no answers or very obviously irrelevant answers to the questions it can't answer, with a relatively low rate of scams, lower than any other search engine (although, for these queries, ChatGPT returns zero scams and Marginalia returns some). Interestingly, Mwmbl lets users directly edit search res
Agents for Amazon BedrockでWeb上のブログやニュースを要約する - Qiita
- 4 users
- qiita.com/nasuvitz
- テクノロジー
- 2023/12/11
はじめに生成系AIを活用して長い文章を要約したり、Web上の記事 (ブログやニュース等) に対する見解を得る方法として、これまではOpenAIのFunction Callingを使用する方法がメジャーでしたが、AWS re:Invent 2023で「Agents for Amazon Bedrock」がリリースされたことで、殆ど同じ機能をAmazon Bedrockで完結して実現できるようになりました。今回は、タイトルの通り「Agents for Amazon Bedrock」を活用してWeb上のブログやニュースを要約する仕組みを作る方法を解説します。 Web上の記事やファイルを取得 (スクレイピング) する際は、引用元の著作者の権利を侵害したり、規約に抵触しないようにご注意ください。 <参考> https://pig-data.jp/blog_news/blog/scraping
Puppeteer in Node.js: Common Mistakes to Avoid | AppSignal Blog
- 4 users
- blog.appsignal.com
- テクノロジー
- 2023/08/10
Puppeteer is a powerful Node.js browser automation library for integration testing and web scraping. However, like any complex software, it comes with plenty of potential pitfalls. In this article, I'll discuss a variety of common Puppeteer mistakes I've encountered in personal and consulting projects, as well as when monitoring the Puppeteer tag on Stack Overflow. Once you're aware of these probl
- article
6億人以上のDiscordユーザーをスパイしていた「Spy Pet」が閉鎖される、Discordは法的措置を検討
- 3 users
- gigazine.net
- テクノロジー
- 2024/04/30
Discordから40億件以上のメッセージと、約6億2000万人のユーザーのデータを抜き取って販売していた「Spy.pet」が閉鎖されました。Discordは、Spy.petに関連したアカウントを停止させるとともに、法的措置を検討していると発表しています。 Discord Shuts Down ‘Spy Pet’ Bots That Scraped, Sold User Messages https://www.404media.co/discord-shuts-down-spy-pet-bots-that-scraped-sold-user-messages/ Discord drops the hammer on data-scraping 'Spy.pet' website, says it is 'considering appropriate legal action' | PC
Declare your AIndependence: block AI bots, scrapers and crawlers with a single click
- 3 users
- blog.cloudflare.com
- テクノロジー
- 2024/07/04
We see website operators completely block access to these AI crawlers using robots.txt. However, these blocks are reliant on the bot operator respecting robots.txt and adhering to RFC9309 (ensuring variations on user against all match the product token) to honestly identify who they are when they visit an Internet property, but user agents are trivial for bot operators to change. How we find AI bo
- 人工知能
AI“アート”の不気味さ――AIが“仕事を奪う”のではない | p2ptk[.]org
- 3 users
- p2ptk.org
- テクノロジー
- 2024/05/29
以下の文章は、コリイ・ドクトロウの「AI “art” and uncanniness」という記事を翻訳したものである。 Pluralistic AIアート（または「芸術」）に関して、クリエイティブ・ワーカーの労働権、表現の自由、著作権法の重要な例外と制限、そして美学を尊重する微妙なポジションを見つけるのは難しい。総合的には、私はAIアートには反対だが、その立場には重要な注意点がある。まず第一に、作品をスクレイピングしてモデルを訓練することが著作権侵害だと言うのは、法律上、明らかな間違いである。これは道徳的な立場からではなく（これについては後述）、むしろ技術的な立場からである。モデルの訓練手順を分解すると、これを著作権侵害と呼ぶのが技術的に間違いである理由はすぐに明らかになる。まず、一時的に作品のコピーを作成する行為は、たとえ数十億の作品であろうと、明らかにフェアユースだ。検索エンジンや
- 著作権
- ai
- 労働
- 企業
Nitter Instance Health
- 3 users
- status.d420.de
- 世の中
- 2023/08/15
About Please use the API for bots. Please do NOT use these instances for scraping, host nitter yourself. Last Updated 2024.04.07 03:27 UTC. Customize the visible columns down below. Instance Country Healthy Health History Average Time All Time % RSS Nitter Version Connectivity Points
Create an Azure OpenAI, LangChain, ChromaDB, and Chainlit chat app in AKS using Terraform
- 3 users
- techcommunity.microsoft.com
- テクノロジー
- 2024/01/09
In this sample, I demonstrate how to quickly build chat applications using Python and leveraging powerful technologies such as OpenAI ChatGPT models, Embedding models, LangChain framework, ChromaDB vector database, and Chainlit, an open-source Python package that is specifically designed to create user interfaces (UIs) for AI applications. These applications are hosted in an Azure Kubernetes Servi
- Azure
Understanding the Polyfill Attack (Polykill)
- 3 users
- pulse.latio.tech
- テクノロジー
- 2024/07/01
Supply chain threats are growing. Most concerningly, it seems more and more like we’re dealing with nation level threats taking over small unmaintained open source projects. Once again, I’ve got to start by talking about Tidelift being the only company focusing on the real problem here - helping companies treat maintainers like the contractors/vendors they are. If maintainers had any financial ben
- security
Introduction to Machine Learning
- 3 users
- www.ejable.com
- テクノロジー
- 2023/11/14
Machine Learning is making a buzz in the industry. And it’s the right time to get familiar with it. Let’s get the basics right. Let’s get started. What is Machine Learning What the heck is machine learning? If I had to quote it in a single sentence, I would say, ‘Machine Learning is a way to find a pattern in data to predict the future. The above is not the only definition of machine learning. The
GitHub - fr0gger/Awesome-GPT-Agents: A curated list of GPT agents for cybersecurity
- 3 users
- github.com/fr0gger
- テクノロジー
- 2024/02/01
The "Awesome GPTs (Agents) Repo" represents an initial effort to compile a comprehensive list of GPT agents focused on cybersecurity (offensive and defensive), created by the community. Please note, this repository is a community-driven project and may not list all existing GPT agents in cybersecurity. Contributions are welcome – feel free to add your own creations! Disclaimer: Users should exerci
- ai
- github