[B! scrape][html] ishideoのブックマーク

ishideo id:ishideo

scrapeとhtmlに関するishideoのブックマーク (11)

Under Construnction
ishideo 2010/04/06
feedparser

rss

python

scrape

html

email

encoding

gmail
リンク
NAL研卒業研究ノート:: Rubyモジュール ExtractContent をPythonに移植してみた
ExtractContent は、HTMLから本文を抽出するRubyモジュールです。 RubyForge: ExtractContent: Project Info Webページの本文抽出 (nakatani @ cybozu labs) Perl用の同名モジュールもありますが、今回はRubyモジュールを基にしてPythonへ移植してみました。 # -*- coding:utf-8 -*- import re import unicodedata class ExtractContent(object): # convert character to entity references CHARREF = { "nbsp" :" ", "lt" :"<", "gt" :">", "amp" :"&", "laquo":u"\xc2\xab", "raquo":u"\xc2\xbb", }
ishideo 2010/02/11
python

ExtractContent

ruby

scraper

scrape

html
リンク
ElementTree Tidy HTML Tree Builder
July 6, 2003 | Fredrik Lundh The TidyHTMLTreeBuilder parser can read (almost) arbitrary HTML files, and turn them into well-formed element trees. This parser uses a library version of Dave Raggett’s HTML Tidy utility to fix any probl ems with the HTML before converting it to XHTML (the XML version of HTML). Note: If you don’t want to (or cannot) install binary Python extensions, you can use the Tid
ishideo 2008/08/18
TidyHTMLTreeBuilder

python

scrape

html

dom

tidy

ElementTree

ElementTidy
リンク
Wrestling HTML
September 8, 2004 Uche Ogbuji Lately I've seen HTML parsing probl ems everywhere. One project needed a web crawler with specialized features provided through Python code that processed arbitrary HTML. There have also been several threads on mailing lists I frequent (including XML-SIG) featuring discussions of mechanisms for dealing with broken HTML by converting it to decent XHTML. This article foc
ishideo 2008/08/18
python

BeautifulSoup

module

easy_install

scrape

html

dom

tidy

ElementTree

ElementTidy
リンク
【インフォシーク】Infoseek ：楽天が運営するポータルサイト
日頃より楽天のサービスをご利用いただきましてありがとうございます。サービスをご利用いただいておりますところ大変申し訳ございませんが、現在、緊急メンテナンスを行わせていただいております。お客様には、緊急のメンテナンスにより、ご迷惑をおかけしており、誠に申し訳ございません。メンテナンスが終了次第、サービスを復旧いたしますので、今しばらくお待ちいただけますよう、お願い申し上げます。
ishideo 2008/08/07
xpath

scrape

xml

html

cheatsheet

dev

japanese

javascript

js

reference
リンク
XPather – Get this Extension for 🦊 Firefox (en-US)
Not compatible with Firefox QuantumNot compatible with Firefox Quantum Feature rich XPath generator, editor, inspector and simple extraction tool... Only with Firefox—Get Firefox Now
ishideo 2008/08/07
extension

firefox

xpath

XPather

add-on

plugin

scrape

xml

html
リンク
XPath Checker :: Firefox Add-ons
Add-ons extend Firefox, letting you personalize your browsing experience. Take a look around and make Firefox your own.
ishideo 2008/08/07
extension

firefox

xpath

XPath-Checker

add-on

plugin

scrape

xml

html
リンク
ruby のスクレイピングツールキット scrAPI - 川o・-・）＜2nd life
http://blog.labnotes.org/category/scrapi/ ruby でスクレイピングして web の情報を取得するのには、今まで正規表現か xpath でやってたので、わりと面倒でした。で、ふと scrAPI というスクレイピングツールキットを知ったのですが、これがかなり便利そう。このツールキットを使うと、CSS3 なセレクタを記述することで、要素を取得することができます。たとえばとあるサイトのリンクを全部取得したければ、 require 'rubygems' require 'scrapi' require 'open-uri' require 'nkf' require 'pp' $KCODE = 'u' links = Scraper.define do process "a[href]", "urls[]"=>"@href" result :urls e
ishideo 2007/01/24
ruby

scrAPI

module

scrape

html
リンク
Beautiful Soup: We called him Tortoise because he taught us.
You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects. Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful: Beautiful Soup provides a few simple methods and
ishideo 2007/01/24
python

BeautifulSoup

module

easy_install

scrape

html
リンク
Python で HTML ファイルから情報を取り出すには - 傀儡師の館.Python：楽天ブログ
2006.10.22 Python で HTML ファイルから情報を取り出すにはカテゴリ：Python 楽天ブログのアクセスログのページを ElementTree で処理しようと思ったのだが、 ExpatError: mismatched tag: line 244, column 2 のようなエラーが出て XML として解析することできない。ということで、あっさり別のやりかたを探すことにした。ちなみに Python の ElementTree は ruby の rexml より速いらしい。proto.xml の AbstractLightInfantry なユニットを調べる、というより ElementTree (Python) vs. REXML (Ruby)。REXML と ElementTree のパース時間。プリミティブにやるならば、標準ライブラリに含まれている SGMLPar
ishideo 2007/01/24
python

BeautifulSoup

module

easy_install

scrape

html

ElementTidy
リンク
sh1.2 pyblosxom : pythonでスクレイピング
HTML::Selector::XPath をリリース: blog.bulknews.net 川o・-・）＜2nd life - ruby のスクレイピングツールキット scrAPI を見て、pythonでもElementTreeを使ったらできるんじゃないかなと思ったけども、ちゃんとしたXMLじゃないとparse時にエラーになってしまう。じゃあ、ElementTreeに渡す前にHTMLをXHTMLに変換したらいいのかと思って標準ライブラリを探すも、どうやら標準でそういうことをするライブラリはないらしい。googleさんにお尋ねしてみた所下記のエントリを発見。 Python で HTML ファイルから情報を取り出すには - 傀儡師の館 - 楽天ブログ（Blog）まさに同じような悩みで色々探していらっしゃって、ここでBeautifulSoupを知りました。結構昔からあったモジュール
ishideo 2007/01/24
python

BeautifulSoup

module

easy_install

scrape

html
リンク
1

お知らせ

もっと読む

公式Twitter

@HatenaBookmark
リリース、障害情報などのサービスのお知らせ
@hatebu
最新の人気エントリーの配信

キーボードショートカット一覧

j次のブックマーク

k前のブックマーク

lあとで読む

eコメント一覧を開く

oページを開く

設定を変更しましたx