[B! コンテンツ抽出][tech] yuisekiのブックマーク

yuiseki id:yuiseki

コンテンツ抽出とtechに関するyuisekiのブックマーク (11)

GitHub - buriy/python-readability: fast python port of arc90's readability tool, updated to match latest readability.js!
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert
yuiseki 2015/08/04
tech

本文抽出

コンテンツ抽出
リンク
GitHub - grangier/python-goose: Html Content / Article Extractor, web scrapping lib in Python
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert
yuiseki 2015/08/04
tech

本文抽出

コンテンツ抽出
リンク
Webページの本文（記事）抽出エンジン - Boilerpipe Javaライブラリの特徴と使い方 : NETBUFFALO
Kindle, Programming, Network, Linux, iPhone/iPad/Apple TV, etc
yuiseki 2015/08/04
tech

本文抽出

コンテンツ抽出

boilerpipe
リンク
boilerpipeを使ってみる : mwSoft blog
■概要 HTMLから本文を抽出してSolrに登録する用事があったので、Javaの本文抽出ライブラリを探してみたところ、boilerpipeという子を見つけた。英語色が強そうだけど、そこそこに精度は出そうに見えたので使ってみた。 ■導入とりあえずjarをダウンロード http://code.google.com/p/boilerpipe/downloads/list もしくはMavenから http://mvnrepository.com/artifact/de.l3s.boilerpipe/boilerpipe ■本文抽出を実行してみる URL url = new URL("http://www.yahoo.co.jp/"); String text = DefaultExtractor.getInstance().getText(url); System.out.println(te
yuiseki 2015/08/04
tech

本文抽出

コンテンツ抽出

boilerpipe
リンク
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scala ble Web Document Classification Using Word2Vec
yuiseki 2015/08/04
smartnewsの機械学習を利用した本文抽出と文書分類のアルゴリズム

tech

本文抽出

コンテンツ抽出

content extraction
リンク
Discover Open Source Projects
Suggested keywords:Java DockerGit React NextJsSpring bootLaravel
yuiseki 2015/07/30
tech

本文抽出

コンテンツ抽出
リンク
GitHub - kohlschutter/boilerpipe: Work in progress transmit from Google Code
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert
yuiseki 2015/07/30
tech

本文抽出

コンテンツ抽出

boilerpipe
リンク
Makoto In's Reading List — Readability
All news articles published by Readability team can be found on this page.
yuiseki 2015/07/30
tech

本文抽出

コンテンツ抽出

api
リンク
Safari Reader Source Code
I’m totally in love with Safari’s Reader feature. But sometimes, on some web article, Reader doesn’t display anything (or Reader’s button is greyed). If you’re like me, and want to see why Reader doesn’t always work properly, there is a very simple way to get Safari Reader source code. The crazy thing is that the functionality is all Javascript based (maybe due to its grand parent Arc90 Readabilit
yuiseki 2015/07/30
tech

本文抽出

コンテンツ抽出
リンク
Safari Readerのコンテンツ抽出処理を調べる
Safari(とiOSのMobile Safari)にはReader機能というのがあって、ブログなどでコンテンツ部分だけを抜き出して表示してくれます。iOSにはあるのは知っていて、PC向けのページを読みやすくしてくれて便利なのでたまに活用していたのですが、PC版でもあるんですね。似た機能はPocketやReadabilityにもあります。でもこのリーダー機能、ボタンが出る時と出ない時があります。まあコンテンツ抽出ができない時は出ないんだろうなっていう推測はできるのですが、どのようにコンテンツ抽出しているのかなと。PerlのモジュールでHTML::ExtractContentというのがあるのですが、似たようなことやっているんだろうなって思っていましたが、しらべるとh1~h6の含まれるブロック要素で文字数が多いものが取られているっぽいとかブロックのサイズが云々とか色々観測結果が書かれていまし
yuiseki 2015/07/30
tech

本文抽出

コンテンツ抽出
リンク
CETR による HTML 文書からのテキスト抽出 - やた＠はてな日記
n-yo さんに教えていただいてから随分と経ってしまいましたが，CETR を実装してウェブサービス化してみました． HTML テキスト抽出（CETR） http://s-yata.jp/apps/nwc-toolkit/cetr-text-extractor CETR というのは "Content Extraction via Tag Ratios" の略で，HTML 文書の各行に含まれるタグの割合を利用してコンテンツを抽出する手法です．簡単な内容は以下のようになっています．コメント，スクリプト，スタイルを取り除きます．文書が 1 行のみで構成されている場合，65 文字ずつに分割します．修正（2010-11-10）各行に含まれるタグの割合（Ti）を求めます．タグの割合（Ti）を平滑化します（Ti'）． Ti' における近傍との差（Gi）を求めます． Gi を平滑化します（Gi'）．
yuiseki 2010/11/11
tech

本文抽出

コンテンツ抽出

Content Extraction

アルゴリズム
リンク
1

お知らせ

もっと読む

公式Twitter

@HatenaBookmark
リリース、障害情報などのサービスのお知らせ
@hatebu
最新の人気エントリーの配信

キーボードショートカット一覧

j次のブックマーク

k前のブックマーク

lあとで読む

eコメント一覧を開く

oページを開く

設定を変更しましたx