[B! crawler] Hayatoのブックマーク

Lambdaによるクラウド型言語の実装

AWSでAPI Gatewayから非同期でLambdaを起動してS3にファイルアップロードしようとしたらハマった話。

Hayato 2016/08/27

昔こういうクローラー作りたいと妄想したなあ。。

リンク

Rubyで複数並行なクローラをすっきりと書けるライブラリ「cosmicrawler」をgemとして公開した - koeだめ過去アーカイブ[〜2013-12-14]

http://rubygems.org/gems/cosmicrawler ソースは、https://github.com/bash0C7/cosmicrawler gem install cosmicrawlerするとか、Gemfileにgem 'cosmicrawler'と書いてbundle installするとかした後、 require 'cosmicrawler' Cosmicrawler.http_crawl(%w(http://example.com/1 http://example.com/2)) {|request| get = request.get puts get.response if get.response_header.status == 200 } という感じにブロックを渡すだけで、デフォルトでは８並行でババッとクロールすることができます。もしもっと平行数

Hayato 2013/03/13

crawler

リンク

Google Code Archive - Long-term storage for Google Code Project Hosting.

Code Archive Skip to content Google About Google Privacy Terms

Hayato 2010/02/12

リンク

全文検索システム: Fess - オープンソース全文検索サーバー Fess (フェス)

概要現在表示されているサイトは旧サイトです。新サイトは http://fess.codelibs.org/ja/ です。 Fess は「5 分で簡単に構築可能な全文検索サーバー」です。Java 実行環境があればどの OS でも実行可能です。Fess は Apache ライセンスで提供され、無料 (フリーソフト) でご利用いただけます。 Seasar2 ベースで構築され、検索エンジン部分には 2 億ドキュメントもインデックス可能と言われる Solr を利用しています。ドキュメントクロールには S2Robot を利用することで、Web やファイルシステムに対するクロールが可能になり、MS Office 系のドキュメントや zip などの圧縮ファイルも検索対象とすることができます。特徴 5 分で簡単に構築可能な全文検索サーバー Apache ライセンスで提供 (フリーソフト) OS 非依存

Hayato 2009/11/17

リンク

Fessで作るApache Solrベースの全文検索サーバー　～導入編

はじめにドキュメントは日々増えて続けています。ドキュメントの数が多くなるほど、目的の情報は見つけにくくなるため、それらのドキュメントを効率よく管理する方法が必要です。その解決策の一つとして、複数のドキュメント（ファイル）をまたいで検索することができる「全文検索サーバー」の導入が挙げられます。 Fessは簡単に導入できる、Javaベースのオープンソース全文検索サーバーです。Fessの検索エンジン部分にはApache Solrを利用しています。Solrは、2億ドキュメントもインデックス可能と言われる非常に高機能な検索エンジンです。一方で、Apache Solrで検索システムを構築しようとする場合、クローラ部分などを自分で実装する必要性があります。Fessではクローラ部分にSeasar Projectから提供されるS2Robotを利用して、ウェブやファイルシステム上の様々な種類のドキュメントを

Hayato 2009/11/14

s2robotなんてあったのか

solr
crawler

リンク

Overview - Anemone - chriskite

skip_links_like example was updated by Alex Johnson Friday Nov 18 #36 / new ticket New to Ruby: Need basic Toturial on How to use anemone to crawl a site? was updated by Alex Johnson 02:54 AM #35 / new ticket New to Ruby: Need basic Toturial on How to use anemone to crawl a site? was updated by Alex Johnson 02:52 AM #35 / new ticket New to Ruby: Need basic Toturial on How to use anemone to crawl a

Hayato 2009/09/08

リンク

GitHub - chriskite/anemone at master

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

Hayato 2009/09/08

ruby
crawler

リンク

fizx's robots at master - GitHub

A simple Ruby library to parse robots.txt. Usage: robots = Robots.new "Some User Agent" assert robots.allowed?("http://www.yelp.com/foo") assert !robots.allowed?("http://www.yelp.com/mail?foo=bar") robots.other_values("http://foo.com") # gets misc. key/values (i.e. sit emaps) If you want caching, you're on your own. I suggest marshalling an instance of the parser. Copyright (c) 2008 Kyle Maxwell, c

Hayato 2009/09/08

robots.txtパーサ

ruby
crawler

リンク

Anemone - Ruby Web-Spider Framework

An easy-to-use Ruby web spider framework What is it? Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site. The multi-threaded design makes Anemone fast. The API makes it simple. And the expressive

Hayato 2009/07/20

crawler
ruby

リンク

【ハウツー】JavaでWebブラウザをドライブ! WebDriverを使ってみよう (1) WebDriverとは | エンタープライズ | マイコミジャーナル

WebDriverとは WebDriverはWebブラウザを操作するためのJavaライブラリだ。WebアプリケーションのUIテストツールとして使用することが想定されており、JavaScriptを多用しリッチなUIを提供するアプリケーションのテストに効果を発揮する。サポートするブラウザはFirefox、Safari(MacOS Xのみ)、Internet Explorer(Windowsのみ)となっている。また、実際のブラウザは使わずHtmlUnitを使用することも可能だ。この場合、Rhino(Javaで実装されたJavaScriptエンジン)を使用してブラウザ上で動作するJavaScriptの動作もエミュレートすることもできる。また、試験的にiPhone用のドライバの実装も進められているようだ。同種のテストツールとしてはすでにSeleniumなどがあり、多くのユーザに利用されている。し

Hayato 2009/05/26

リンク

org.archive.crawler.extractor (Heritrix 1.15.5-201106092337)

Hayato 2009/05/26

Extractorの種類。css,word,html,http,js,pdf.......

リンク

Log in with Atlassian account

Hayato 2009/05/26

"It does scan script code for strings that appear likely to be absolute or relative URIs, and will treat these the same as other discovered outlinks."

リンク

Log in with Atlassian account

We tried to load scripts but something went wrong. Please make sure that your network settings allow you to download scripts from the following domain: https://id-frontend.prod-east.frontend.public.atl-paas.net

Hayato 2009/05/24

リンク

Rubyでクローラー - BitArts Blog

リンクだけじゃなく、フォーム、イメージ、フレームまでがっつり収集してくれるクローラーが欲しかったんだけどwgetではできないようなので自作することにした。フォームのフィールドを集めたりするの、ちょっと大変そうだな。。と思ったんだけど、WWW::Mechanizeというライブラリを使ったら超簡単だった。ビバMechanize！ require "rubygems" require "mechanize" class CrawlerListener def notify_begin end def pre_request end def notify_response(result) puts %Q{#{result[:method]} #{result[:uri]} #{result[:query] ? result[:query].inspect : ""}} end def post_

Hayato 2009/05/12

ruby
crawler

リンク

MOONGIFT: » Java製のクローリングシステム「InfoCrawler」:オープンソースを毎日紹介

Webサービスを作る上で、外部のデータを取得して何かしたいといったことは良くある。いや、外部に限ったものではない。ローカルのデータであっても取得して、それを検索したいという要望は良くあるものだ。ユーザ側の検索画面そうした時にクローラーを自作したりすると思うのだが、robots.txtの解釈や効率的なクローリング法を習得するのは大変なことだ。そこで試してみたいのがこれだ。今回紹介するオープンソース・ソフトウェアはInfoCrawler、Java製のWebクローラーだ。 InfoCrawlerは設定項目も数多く、クローリングシステムとして優秀なものになると思われる。複数サーバ設置して分散化もできるようだ。HTMLや画像、各種バイナリ等ファイル種別を指定してクローリングを行うか否かを指定できる。インデックスするファイルを指定する画面認証が必要なサーバにも対応し、言語によってフィルタリン

Hayato 2008/04/28

crawler

リンク

Heritrix レビュー MOONGIFT

Heritrix レビューログイン（クリックすると拡大します) インデックス（クリックすると拡大します) ジョブ（クリックすると拡大します) 名前入力（クリックすると拡大します) モジュール（クリックすると拡大します) サブモジュール（クリックすると拡大します) セッティング（クリックすると拡大します) オーバーライド（クリックすると拡大します) ジョブを作成しました。（クリックすると拡大します) エラー（クリックすると拡大します) 実行中（クリックすると拡大します) レポート（クリックすると拡大します) ログ（クリックすると拡大します) 404のみ抽出（クリックすると拡大します) Heritrix 紹介はこちら

Hayato 2008/01/14

リンク

Manageability - Open Source Web Crawlers Written in Java

You are here: Home » blog » stuff » Open Source Web Crawlers Written in Java I was recently quite pleased to learn that the Internet Archive's new crawler is written in Java. Coincindentally, I had in addition to put together a list of open source projects for full-text search engines, I put together a list of crawlers written in Java to complement that list. Here's the list: Heritrix - Heritr

Hayato 2007/07/10

crawler

リンク

Open Source Crawlers in Java

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for Web crawlers that browse and process Web pages automatically.

Hayato 2007/07/10

crawler

リンク

��ΰ渶 - ��⸦�漼��ٶ�� 쥸��

��http://blog.windy.ac/archives/onailab_crawler_study_meeting_part1.pdf ��http://blog.windy.ac/archives/onailab_crawler_study_meeting_part2.pdf �軰��http://blog.windy.ac/archives/onailab_crawler_study_meeting_part3.pdf ��Ͳ��http://blog.windy.ac/archives/onailab_crawler_study_meeting_part4.pdf ��޲��http://blog.windy.ac/archives/onailab_crawler_study_meeting_part5.pdf ��

Hayato 2007/07/10

crawler

リンク

はてなブックマーク

タグ

関連タグで絞り込む (11)

crawlerに関するHayatoのブックマーク (19)

お知らせ

今週のはてなブックマーク数ランキング（2024年9月第4週）

今週のはてなブックマーク数ランキング（2024年9月第3週）

今週のはてなブックマーク数ランキング（2024年9月第2週）

公式Twitter

キーボードショートカット一覧

はてなブックマーク

公式Twitter

はてなのサービス