[B! python][html] ishideoのブックマーク

ishideo id:ishideo

pythonとhtmlに関するishideoのブックマーク (27)

GitHub - noptrix/httpgrep: Scans for HTTP servers and finds given strings in HTTP body and HTTP response headers.
ishideo 2023/10/09
httpgrep

grep

http

html

body

headers

python

cli

github

scan
リンク
GitHub - axcheron/httpq: Simple tool to get HTTP status and page title from a list of URLs
ishideo 2022/12/07
httpq

python

cli

parse

html

github
リンク
Scrapyでクロールし、S3へアップロードしたhtmlファイルを本文抽出して、Elasticsearchのインデックスへ保存したい。 | teratail
###環境: Mac OS 10.13.6, Python 3.8.5, Scrapy 2.2.1, botocore/2.0.0dev38, scrapy-s3pipeline 0.3.0, readability-lxml 0.8.1 前提・実現したいことクローリングフレームワークのScrapyを使用してAWS S3のバケットにアップロードしたクロール結果htmlファイルを Pythonプログラムから参照し、htmlから本文抽出して検索エンジンのElasticsearchにインデックスする正しい方法を教えていただきたいです。今回は以下の書籍の内容を組み合わせて、実験を行なっています。「Python クローリング&スクレイピングデータ収集・解析のための実践開発ガイド」 https://scraping-book.com/ 【クロール & S3へアップロード】はてなブックマークの
ishideo 2020/11/09
python

scrapy

s3

aws

teratail

html
リンク
Formasaurus — Formasaurus 0.10.0 documentation
ishideo 2019/09/03
html

formasaurus

doc

python

form

detect

field

api

github
リンク
RemoteStance.com is for sale | HugeDomains
Acquiring OakvilleM aids.com through HugeDomains was a good experience. We’ve operated our business on OakvilleM aids.ca for years, but securing the .com version was important to eliminate any customer confusion and strengthen our brand credibility. HugeDomains made the entire process easy! It was fast, clear, and hassle-free. We highly recommend them to any business looking to upgrade or protect th
ishideo 2019/05/30
python

regrex

html

re

compile
リンク
pytestによるテストをCircleCIで実行する - Qiita
はじめに今回は、CiecleCIを使ったPythonのテストについて解説します。この内容に決めた理由は、2つあります。 1つ目の理由は、CircleCIは1ヶ月1,000時間分まで無料で利用できるので、この事実をいろんな人に知って欲しかったからです。尚、無料なのは1並列でLinuxの自動テストをクラウド上で実行する場合に限ります。リポジトリのGitHubは法人でもprivateなリポジトリでも無料ですが、テストを並列で行ったり、MacOSでテストする場合は有料となります。 CircleCIは、リモート上のGitHubへプログラムを更新すると、すぐにCIを実施します。後で説明しますが、この手順には、CircleCIのアカウントを登録して、GitHub上のリポジトリを選んで、簡単な設定ファイルを準備するだけです。それだけで自分のPCのリソースを使わず、テストを実施することができます。私は
ishideo 2019/03/26
pytest

python

circleci

coverage

pytest-cov

qiita

html

report
リンク
[python]pandasでデータの読み込み方法まとめ - おじさんAのプログラムメモ
まずはインポート import pandas as pd CSV, TSV pd.reed_csv(filename, header=None, names=['A', 'B'], index_col='A', ...) # filename以外は省略可能 # pd.reed_table()というメソッドもある。これは、sep=""パラメーターで区切り文字を指定できる。デフォルトはタブ Excel xls = pd.ExcelFile(filename) df = xls.parse('sheet_name') JSON import json json_data = json.loads(json_text) name = json_data[0]['name'] XML from lxml import objectify parsed = objectify.parse(open(x
ishideo 2017/06/06
python

pandas

excel

csv

request

html

webapi

xml

dataframe

json
リンク
BeautifulSoupとhtml5lib - The jonki
BeautifulSoupを使ったパースプログラムでこんなエラーがでたことはないだろうか。 "HTMLParser.HTMLParseError: malformed start tag"要はBeautifulSoupのおつむじゃ理解出来ないHTMLタグがあるということ。どうやらスクリプトタグなんかを使ったちょっと複雑なものが理解出来ないみたい。 $ python hoge.py Traceback (most recent call last): File "hoge.py", line 78, in <module> main() File "hoge.py", line 40, in main soup = BeautifulSoup(html) File "build/bdist.linux-i686/egg/BeautifulSoup.py", line 1499, in __i
ishideo 2011/11/18
html5lib

BeautifulSoup

python

html

parse

module
リンク
ishideoのブックマーク / 2008年12月17日 - はてなブックマーク
A required part of this site couldn’t load. This may be due to a browser extension, network issues, or browser settings. Please check your connection, disable any ad blockers, or try using a different browser.
ishideo 2010/10/13
Text-MicroTemplate

cpan

template

html

perl
リンク
ishideoのブックマーク / 2009年1月21日 - はてなブックマーク
パブリックドメイン版 cflow には GNU cflow の -T, --tree * Draw ASCII art treeというオプションがないので、インデントで表された呼び出し構造（オフサイドルール？）をツリー形式に変換するコマンドを書いてみた。 #!/usr/bin/python import sys def getlevel(s): return len(s) - len(s.lstrip()) def parselist(lines): if len(lines) == 0: return [[], 0] tree = [] i = 0 currentlevel = getlevel(lines[0]) while i < len(lines): level = getlevel(lines[i]) if level > currentlevel: # Indent incr
ishideo 2010/10/13
python

cflow

indent2tree

indent

tree

unix
リンク
iPhoneやiPadでHTMLのソースを見るのつくった - Webtech Walker
i-sourceviewというのをGAE/Pythonでつくってみました。 i-sourceview hokaccha’s i-sourceview at master - GitHub 同じようなことはアプリとかJS(Bookmarklet)でできるんですけど、アプリだと別途立ち上げが必要だったり、JSだとDOCTYPEが取れなかったりシンタックスハイライトがなかったりします。なのでサーバー側からリクエストしてHTML取得してシンタックスハイライトしたり行番号もつけてみたりしました。それでつくってみたものの、認証がかかってるページがとれなかったり、JSで書き換えた後のソースが見たい場合もあるなあと思って結局JSで取得するのも用意しました。この二つを併用すれば大体ことたりるかなと。シンタックスとかはchromeのソースビューに合わせてみました。こんな感じになります。
ishideo 2010/09/28
bookmarklet

python

gae

chrome

iphone

ipad

html

dev
リンク
Nothing is impossible : 簡単！たった８行のコードで HTML取得＆解析をするPythonスクリプト
June 07, 201010:49 カテゴリwork 簡単！たった８行のコードで HTML取得＆解析をするPythonスクリプト簡単！たった１３行のコードで HTML取得＆解析をするPerlスクリプトを見てPythonならもっと簡単だなーと思ったので書いてみる。 import urllib2 from lxml import etree url = 'http://www.yahoo.co.jp' opener = urllib2.build_opener() opener.addheaders = [('User-agent','Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)')] tree = etree.parse(opener.open(url),parser=etree.HTMLParser()
ishideo 2010/07/26
python

urllib2

lxml

xpath

scraping

html
リンク
Under Construnction
ishideo 2010/04/06
feedparser

rss

python

scrape

html

email

encoding

gmail
リンク
NAL研卒業研究ノート:: Rubyモジュール ExtractContent をPythonに移植してみた
ExtractContent は、HTMLから本文を抽出するRubyモジュールです。 RubyForge: ExtractContent: Project Info Webページの本文抽出 (nakatani @ cybozu labs) Perl用の同名モジュールもありますが、今回はRubyモジュールを基にしてPythonへ移植してみました。 # -*- coding:utf-8 -*- import re import unicodedata class ExtractContent(object): # convert character to entity references CHARREF = { "nbsp" :" ", "lt" :"<", "gt" :">", "amp" :"&", "laquo":u"\xc2\xab", "raquo":u"\xc2\xbb", }
ishideo 2010/02/11
python

ExtractContent

ruby

scraper

scrape

html
リンク
Google Code Archive - Long-term storage for Google Code Project Hosting.
Code Archive Skip to content Google About Google Privacy Terms
ishideo 2009/12/02
embedded

python

pystachio

html

javascript

js
リンク
はてなブログ | 無料ブログを作成しよう
カブを後輩に譲った話僕がはてなブログを始めて最初の記事がこれ。当時は大阪に暮らしていたので大阪生活という名前でブログをやっていた。現在は社宅に居るから社宅生活。引っ越したらまた次のブログに引っ越すよ。ジムに通っていた頃には週4以上で乗っていたけれど、最近忙しくてここ一…
ishideo 2009/07/27
parse

html

python

scraping
リンク
htmltotext converter w/ tty support for bold/underline « Python recipes « ActiveState Code
ishideo 2009/01/21
python

htmltotext

convert

html

text
リンク
Python Package Index : pyquery 1.1
A required part of this site couldn’t load. This may be due to a browser extension, network issues, or browser settings. Please check your connection, disable any ad blockers, or try using a different browser.
ishideo 2008/12/17
python

jquery

pyquery

javascript

library

html
リンク
西尾泰和のブログ: Pythonで箇条書きをHTMLに変換
COREBlog使用時も生DTMLでエントリーを書き、tDiaryの勝手な整形よりMovableTypeでタグを手打ちする道を選んだ僕ですが、それでも箇条書きをHTMLで書くのは直感的でなさ過ぎるので嫌いです。そこで、MovableTypeに「Wikiっぽい記法で書いた箇条書きをHTMLに変換してくれるタグ」を追加しようと思い、とりあえずそのアルゴリズムをPythonで書いてみました。(Perlでいきなり書く自信がなかったので) あ、誤解を避けるために書いておくと、もちろんPythonで構造化テキストを扱いたいだけならStructuredTextとかreStructuredTextで検索して適切なライブラリを使う方が楽だと思います。今回はあくまで自分好みの箇条書きフォーマットをHTMLに変換するPerlのプログラムを作るためのプロトタイプってことです。特徴インデントの深さで階層構造を
ishideo 2008/09/25
python

parse

html
リンク
eGenix.com: mxTidy - HTML Tidy for Python
ishideo 2008/09/13
mxtidy

python

tidy

HTML-Tidy

html
リンク
1 2 次のページ

お知らせ

もっと読む

公式Twitter

@HatenaBookmark
リリース、障害情報などのサービスのお知らせ
@hatebu
最新の人気エントリーの配信

キーボードショートカット一覧

j次のブックマーク

k前のブックマーク

lあとで読む

eコメント一覧を開く

oページを開く

設定を変更しましたx