[B! crawler] jun_okunoのブックマーク

livedoor ReaderのクローラとStreaming APIなどの話

2024 Trend Updates: What Really Works In SEO & Content Marketing The future of SEO is trending toward a more human-first and user-centric approach, powered by AI intelligence and collaboration. Are you ready? Watch as we explore which SEO trends to prioritize to achieve sustainable growth and deliver reliable results. We’ll dive into best practices to adapt your strategy around industry-wide disru

jun_okuno 2011/10/18

api
crawler

リンク

クローラを作る技術と設計 (毎週のハンズオン勉強会資料)

記念すべき第１回PHPカンファレンスのプレゼン資料です。2000年の資料のため、技術的には賞味期限切れですが、単純に懐かしみたい方にどうぞ。

jun_okuno 2011/10/18

リンク

クローラーを作るためのフレームワーク·Anemone MOONGIFT

RSSフィードやWeb API、Mashupなどの単語が注目を集める中、Webクローラーを通じて外部のWebサイトにあるデータをかき集め、それを解析して別な形にするというのはよく見られるものになってきた。あるURLを指定し、そこからリンクされているURLを一覧表示できるそうした数々のシステムの中で、クローラーとなる基盤は大きな違いはない。Webサイトのデータを取得し、次のリンクを洗い出して取得していくようなものだ。そうした共通動作部分を切り出したフレームワークがAnemoneだ。今回紹介するオープンソース・ソフトウェアはAnemone、Webクローラを開発するためのフレームワークだ。 Anemoneは任意のWebサイトにアクセスし、その内容を解析するWebクローラーだ。例えばあるURLに付けられているリンクを一覧で取得するようなことも簡単にできる。外部サイトなのかどうかも区別できるの

jun_okuno 2009/07/08

リンク

https://labs.cybozu.co.jp/blog/kazuho/archives/2008/04/q4m_crawler.php

jun_okuno 2008/06/11

perl
crawler

リンク

クローラーも分散型コンピューティング·Grub MOONGIFT

Open Tech Press | 米Wikia：分散型ウェブ巡回ツールを買収、オープンソース化より。分散型コンピューティングという手法は面白い。古くはSETI@HOMEやUD Agent等があった。コンピュータが高性能化し、台数が急増している中、利用度はむしろ低くなっている可能性は否めない。そして、Web巡回を行うクローラーもまた、分散型コンピューティングに名乗りを上げた。今回紹介するオープンソース・ソフトウェアはGrub、分散型コンピューティングを利用したWebクローラーだ。尚、オープンソース化するとの事だが、現状配布されているバージョンではライセンスはLooksmartのものになっているのでご注意いただきたい。 GrubはWindows、Linux向けに提供されており、インストールするとタスクトレイに常駐する。そして、PCが利用されていない時にクローリングを行うソフトウェアだ。

jun_okuno 2008/06/11

[*?][windows][linux]クローラー。分散型コンピューティングを利用したWebクローラー。独自ライセンス注意！

crawler

リンク

Open Source Crawlers in Java

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for Web crawlers that browse and process Web pages automatically.

jun_okuno 2008/06/11

crawler

リンク

Grub | Help crawl it all

Royal Rumble: ‘Lord of the Rings: The War of the Rohirrim’ Unpacked

jun_okuno 2008/06/11

crawler

リンク

Hyper Estraier: a full-text search system for communities

Beautifully simple experience with RMM, remote support, help desk, billing and reporting in one affordable platform. Ideal for small to medium size MSPs, IT Support companies and VARs. Atera is an IT Management interface that provides the summit of solutions for MSPs. This leading-edge, cloud-based program offers Remote Monitoring & Management, Remote Access & Support, Technician-Based Pri

jun_okuno 2008/06/11

超弩級Wikipedia検索ですごさが実感できる。　http://athlon64.fsij.org/~mikio/wikipedia/estseek.cgi

crawler
oss

リンク

crawler: crawler.dev.java.net

Overview What is the Smart and Simple Web Crawler? Smart and easy framework thats crawls a web site Integrated Lucene support It's simple to integrate the framework in own applications The crawler can start from one or from a list of links Two crawling models available: Max Iterations: Crawls a web site through a limited number of links: Fast model with a small memory footprint and cpu usage. Max

jun_okuno 2008/06/11

crawler

リンク

RubyForge: Rcrawl: Project Info

Rcrawl is a web crawler written in ruby. Development Status: 3 - Alpha Environment: Console (Text Based) Intended Audience: Developers, System Administrators License: MIT/X Consortium License Natural Language: English Operating System: OS Independent Programming Language: Ruby Topic: Indexing/SearchRegistered: 2006-09-20 00:49 Activity Percentile: 0% View project activity statistics.

jun_okuno 2008/06/11

ruby
crawler

リンク

python の crawler 調査 — takanory.net

仕事でちょっと必要だったので、python で動く crawler(Web ページを集めまくるツール)を調べてみました。まずは Python Cheese Shop で crawler をキーワードに検索。すると以下のものがヒットしました。 HarvestMan 1.4.6 final Multithreaded Offline Browser/Web Crawler Orchid 1.0 Generic Multi Threaded Web Crawler spider.py 0.5 Multithreaded crawling, reporting, and mirroring for Web and FTP webstemmer 0.6.0 A web crawler and HTML layout analyzer SpideyAgent 0.75 Each use

jun_okuno 2008/06/11

リンク

Manageability - Open Source Web Crawlers Written in Java

You are here: Home » blog » stuff » Open Source Web Crawlers Written in Java I was recently quite pleased to learn that the Internet Archive's new crawler is written in Java. Coincindentally, I had in addition to put together a list of open source projects for full-text search engines, I put together a list of crawlers written in Java to complement that list. Here's the list: Heritrix - Heritr

jun_okuno 2008/06/11

リンク

mixi Engineers’ Blog » 新RSS Crawlerの裏側

このブログでは初めましての長野雅広(kazeburo)です。mixi開発部・運用グループでアプリケーションの運用を担当しています。 12月12日よりmixiのRSSのCrawlerが改善され、外部ブログの反映が今までと比べ格段にはやくなっているのに気付かれた方も多いかと思います。この改善されたRSS Crawlerの裏側について書きたいと思います以前のCrawlerについて以前のCrawlerは cronからbrokerと呼ばれるプログラムを起動 brokerはmember DBから全件、idをincrementしながら取得し、外部ブログが設定されていればcrawlerを起動(fork) crawlerはRSSを取得しDBに格納して終了このような設計になっていました。この設計の問題として、member DBを全件走査するという無駄な動作と、一件一件crawlerを起動するためオーバ

jun_okuno 2008/05/08

crawler

リンク

MOONGIFT: » Java製のクローリングシステム「InfoCrawler」:オープンソースを毎日紹介

Webサービスを作る上で、外部のデータを取得して何かしたいといったことは良くある。いや、外部に限ったものではない。ローカルのデータであっても取得して、それを検索したいという要望は良くあるものだ。ユーザ側の検索画面そうした時にクローラーを自作したりすると思うのだが、robots.txtの解釈や効率的なクローリング法を習得するのは大変なことだ。そこで試してみたいのがこれだ。今回紹介するオープンソース・ソフトウェアはInfoCrawler、Java製のWebクローラーだ。 InfoCrawlerは設定項目も数多く、クローリングシステムとして優秀なものになると思われる。複数サーバ設置して分散化もできるようだ。HTMLや画像、各種バイナリ等ファイル種別を指定してクローリングを行うか否かを指定できる。インデックスするファイルを指定する画面認証が必要なサーバにも対応し、言語によってフィルタリン