[B! scraper] kamawadaのブックマーク

https://www.openvista.jp/archives/note/251/?251/

kamawada 2008/02/11

scraper
php

リンク

今日のCPANモジュール（跡地）目次

Redirecting… Click here if you are not redirected.

kamawada 2007/12/30

更新ktkr

perl
scraper

リンク

Journal of miyagawa (1653) - Web::Scraper with filters, and thought about Text filters

Web::Scraper with filters, and thought about Text filters A developer release of Web::Scraper is pushed to CPAN, with "filters" support. Let me explain how this filters stuff is useful for a bit.Since an early version, Web::Scraper has been having a callback mechanism which is pretty neat, so you can extract "data" out of HTML, not limited to the string.For instance, if you have an HTML

kamawada 2007/10/05

!

リンク

Web::Scraper使ってみた。 - 月日は百代の過客にして

というわけです。 #!/usr/bin/perl use Web::Scraper; use URI; my $t = scraper { process '//table[@summary="upinfo"]//tr', 'columns[]' => scraper { process '//td[2]', file_name => 'TEXT'; process '//td[3]', comment => 'TEXT'; process '//td[4]', file_size => 'TEXT'; process '//td[5]', date => 'TEXT'; process '//td[6]', mime => 'TEXT'; result qw/file_name comment file_size date mime/; }; result qw/columns/; };

kamawada 2007/10/01

リンク

はてなブログ | 無料ブログを作成しよう

台湾ひとり食事旅（前編）台湾へ行ってきた。チケットと宿を予めおさえていたものの、台湾地震の発生で予約を一度はキャンセル。その後の台湾観光庁の旅行に来て大丈夫だよ、という声明を確認してやはり行くことに。目的はシンプルで、台北周辺で美味しい食事をたくさん食べること。そして自宅…

kamawada 2007/09/28

リンク

Sbox Error

The sbox program encountered an error while processing this request. Please note the time of the error, anything you might have been doing at the time to trigger the probl em, and forward the information to this site's Webmaster (webmaster@www.ac.cyberhome.ne.jp).Stat failed. /usr/local/apache2/cgi-bin/~mattn: No such file or directory sbox version 1.10 $Id: sbox.c,v 1.16 2005/12/05 14:58:01 lstein

kamawada 2007/09/20

「 process '/tr/td[2]/a'」こうやればいいのか

リンク

hide-k.net#blog: Web::Scraper 0.15とcisco_scraper.pl

以前書いた Web::ScraperでCISCO RECORDSをスクレーピングという記事に対してBig Sky :: Web::Scraper 0.15で何が変わったのか...とおまけでWeb::Scraper 0.15での添削例として扱ってもらったので、さらにリプライ。 treeを壊さずやるとすれば、TextNodeを参照するのがいいかと思います。例えば、XPathのnode()を使い、番号指定で取得します。だた現状のWeb::ScraperではTextNodeはショートカットで参照出来ませんので、以下のようにstring_valueを返すように手を加えると上手く行きます。問題が一つ。添削してくださったパッチだと process '//li/node()[4]', 'title' => sub {$_->string_value;}; となっているのですが、4番目とは限らないんで

kamawada 2007/09/20

リンク

Sbox Error

The sbox program encountered an error while processing this request. Please note the time of the error, anything you might have been doing at the time to trigger the probl em, and forward the information to this site's Webmaster (webmaster@www.ac.cyberhome.ne.jp).Stat failed. /usr/local/apache2/cgi-bin/~mattn: No such file or directory sbox version 1.10 $Id: sbox.c,v 1.16 2005/12/05 14:58:01 lstein

kamawada 2007/09/18

リンク

ゆーすけべー日記

サキとは彼女の自宅近く、湘南台駅前のスーパーマーケットで待ち合わせをした。彼女は自転車で後から追いつくと言い、僕は大きなコインパーキングへ車を停めた。煙草を一本吸ってからスーパーマーケットへ向かうと、ひっきりなしに主婦的な女性かおばあちゃんが入り口を出たり入ったりしていた。時刻は午後5時になる。時計から目を上げると、待たせちゃったわねと大して悪びれてない様子でサキが手ぶらでやってきた。お礼に料理を作るとはいえ、サキの家には食材が十分足りていないらしく、こうしてスーパーマーケットに寄ることになった。サキは野菜コーナーから精肉コーナーまで、まるで優秀なカーナビに導かれるように無駄なく点検していった。欲しい食材があると、2秒間程度それらを凝視し、一度手に取ったじゃがいもやら豚肉やらを迷うことなく僕が持っているカゴに放り込んだ。最後にアルコール飲料が冷やされている棚の前へ行くと、私が飲むからとチ

kamawada 2007/09/16

miyagawaさん、どうもです

リンク

Web::ScraperでジャグラBBをスクレーピング

Web::ScraperでジャグラBBをスクレーピングスポンサードリンク Tweet Web::ScraperでジャグラBBのページをスクレーピングしたよ。スゲエ便利だね！ジャグラBB - 印刷業のためのWebラーニングサイト：HOME [www.jagra.or.jp] script:jagrabb.pl #!/usr/bin/perl use strict; use warnings; use Web::Scraper; use URI; my $uri = 'http://www.jagra.or.jp/jagrabb/home/top/'; my $scraper; $scraper->{'it em'} = scraper { process 'h3>a', title => 'TEXT', url => sub { return URI -> new_abs( $_->att

kamawada 2007/09/15

リンク

Journal of miyagawa (1653) - Web::Scraper 0.14

Web::Scraper 0.14 is released along with a couple of neat features.First of all, I incorpolated HTML::Tagset's linkElements hash into '@attr' accessor of elements, so if you do this: $s = scraper { process "a", "links[]" => '@href' }; $s->scrape(URI->new("http://www.example.com/")); because a@href is known to be link elements, they're automatically converted to absoltue URI using http://www.exampl

kamawada 2007/09/15

リンク

scraper CLI で遊ぶその２ - へたっぴ日記

pushing Web::Scraper 0.13 that has code generation and more examples in eg/ http://twitter.com/miyagawa/statuses/243570942 今度はコード生成だそうで。0.12 もチェックしていなかったので、あわせて新機能を確認。scraper CLI で遊ぶ - へたっぴ日記の続きっぽく。今日はスクエニ＠Yahoo!ファイナンスを題材に。 hetappi@violet ~ $ scraper 'http://quote.yahoo.co.jp/q?s=9684.t&d=t's コマンドで HTML ソースを表示。 scraper> s <html> <head> <title> Yahoo!ファイナン&#x30B9

kamawada 2007/09/15

scraper

リンク

Journal of miyagawa (1653) - Web::Scraper hacks #2: Extract javascript and css content

This is inspired by an em ail from Renée Bäcker asking how to get content inside javascript tag. Because Web::Scraper's 'TEXT' mapping calls as_text method of HTML::Element, it doesn't get the content inside script and style tag. Here's the code that works. It's kinda clumsy, and it'd be nice if there's much cleaner way to do this: #!/usr/bin/perl # extract Javascript code into 'code' use strict; u

kamawada 2007/09/10

scraper

リンク

Sbox Error

The sbox program encountered an error while processing this request. Please note the time of the error, anything you might have been doing at the time to trigger the probl em, and forward the information to this site's Webmaster (webmaster@www.ac.cyberhome.ne.jp).Stat failed. /usr/local/apache2/cgi-bin/~mattn: No such file or directory sbox version 1.10 $Id: sbox.c,v 1.16 2005/12/05 14:58:01 lstein

kamawada 2007/09/07

scraper

リンク

unwind-protect: last.fmのshoutboxをscrapeしてみた

何となく書いてみただけ。だからどうだってわけではない。それにしてもWeb::Scraper使うとeasyだなぁ。 use strict; use warnings; use Web::Scraper; use URI; use YAML; my $url = 'http://www.last.fm/user/saltyduck/shoutbox'; my $messages = scraper { process "li.hentry", 'message[]' => scraper { process "p.entry-content", 'message' => 'TEXT'; process "span.fn", 'from' => "TEXT"; result 'from', 'message'; }; }->scrape(URI->new($url)); print YAML::

kamawada 2007/09/04

last.fmはmicroformats対応してるからなー

scraper

リンク

scraper CLI で遊ぶ - へたっぴ日記

via Web::Scraper プレゼン＠YAPC::EU Web::Scraperにコマンドラインインタフェースが追加されたのでさっそく遊んでみた。お題は、オライリー・ジャパン発行書籍一覧から書籍情報の抽出。簡単杉…。 HTMLソースはこんなん。スクレイピング向きのきれいなソースだね。 ... <table class="booklist" width="100%" cellspacing="0" cellpadding="0" border="0"> <tr class="booklist defaultcolor"> ... </tr> <tr class="up"> <td class="booklistisbn"> <a name="4-87311-094-7" /> 4-87311-094-7 </td> <td class="booklisttitle"><a href="

kamawada 2007/09/04

scraper

リンク

B10[mg]: Scraping Yahoo! Search with Web::Scraper

Yet another non-informative, useless blog As seen on TV! Scraping websites is usually pretty boring and annoying, but for some reason it always comes back. Tatsuhiko Miyagawa comes to the rescue! His Web::Scraper makes scraping the web easy and fast. Since the documentation is scarce (there are the POD and the slides of a presentation I missed), I'll post this blog entry in which I'll show how to

kamawada 2007/09/03

scraper

リンク

Web::Scraper で XPath と CSS セレクタを混ぜて使う例 - Tociyuki::Diary

Web::Scraper はいたれりつくせりの仕掛けが仕込んであって、便利ですね。私が、割と良く使っている機能は以下 2 つです。 process の第一引数に、CSS セレクタだけでなく、XPath も指定できます。ただし、XPath を指定するときは先頭を必ずスラッシュ(/)で始めなければいけません。 process の第二引数以降の、値をどこから取得するかを指定する部分に、コード・リファレンスを置くこともできます。これを使うと、DOM ツリー中の値を加工して抽出することができます。具体例として、デイリーポータルZのアーカイブ一覧の中からべつやくれいさんのエントリを抽出してみることにします。まず、アーカイブ・ページのエントリ部分を取り出してやると、こうなっています。 <TD width="580" valign="top" class="tx12px"> <P> <B><FONT c