[B! LDA] xefのブックマーク

The Little Book of LDA

xef 2018/11/16

リンク

pythonで文章を分類して俺にメールしない． | @DataSci

文章を分類するメモ pythonのgensimというライブラリを使う LDAをつかいます． LDAの解説はberobero先生のここが超詳しいので割愛 Wikiデータを学習させて任意の文章を分類する．この記事を拝見して分類も出来ると便利だ！と思ったので！分類教師データのクレンジングと複合語による分かち書き結局公開するんかーいってことで，下記のスクリプトでクレンジングと分かち書きを一気に行います. # -*- coding: utf-8 -*- import MeCab import re import unicodedata class Cleanser(): def __init__(self): self.patUrl = re.compile("https?://[\w/:%#\$&\?~\.=\+\-]+") self.patXml = re.compile("<(\

xef 2014/11/17

リンク

Multicore LDA in Python: from over-night to over-lunch | RARE Technologies

Latent Dirichlet Allocation (LDA), one of the most used modules in gensim, has received a major performance revamp recently. Using all your machine cores at once now, chances are the new LdaMulticore class is limited by the speed you can feed it input data. Make sure your CPU fans are in working order! The person behind this implementation is Honza Zikeš. Honza kindly agreed to write a few words a

xef 2014/09/24

LDA
Python

リンク

Predicting what user reviews are about with LDA and gensim

I was rather impressed with the impressions and feedback I received for my Opinion phrases prototype - code repository here. So yesterday, I have decided to rewrite my previous post on topic prediction for short reviews using Latent Dirichlet Analysis and its implementation in gensim. I have previously worked with topic modeling for my MSc thesis but there I used the Semilar toolkit and a looot of

xef 2014/09/18

LDA
Gensim

リンク

These Are Your Tweets on LDA (Part I) – wellecks

How can we get a sense of what someone tweets about? One way would be to identify themes, or topics, that tend to occur in a user’s tweets. Perhaps we can look through the user’s profile, continually scrolling down and getting a feel for the different topics that they tweet about. But what if we could use machine learning to discover topics automatically, to measure how much each topic occurs, and

xef 2014/09/08

リンク

Finding the natural number of topics for <span class="caps">LDA</span> - Christopher Grainger

Update (July 13, 2014): I’ve been informed that I should be looking at hierarchical topic models (see Blei’s papers here and here). Thanks to Reddit users /u/GratefulTony and /u/EdwardRaff for bringing this to my attention. However, Redditor /u/NOTWorthless says HDPs do not provide a ‘posterior on the correct number of topics in any meaningful sense’. I’ll do more research and do a follow-up post.

xef 2014/07/16

リンク

第二回機械学習アルゴリズム実装会 - LDA

2. 自己紹介 • 礒部正幸（いそべまさゆき） • 職業：ソフトウェアエンジニア • 現在：アドファイブ（株）代表 http://www.adfive.net – 今のところ代表１名の会社です – アドテク、データドリブンマーケティング事業 • ソフトウェアコンサルティング及び受託開発 • 理系大学院卒 • インターネット活動 – TwitterID: @chiral – （ブログ：アドファイブ日記） http://d.hatena.ne.jp/isobe1978/ • 最近実装したアルゴリズム – カルマンフィルタ、粒子フィルタ、Restricted Boltzmann Machine、ベイズロジスティック回帰、uplift modeling, SCW, LDA 3. Topic Modelingとは • 主に文書データを想定したクラスタリング – クラスタリング＝教師なし分

xef 2014/06/17

リンク

トピックモデルシリーズ 4 LDA （Latent Dirichlet Allocation）

このシリーズのメインともいうべきLDA（[Blei+ 2003]）を説明します。前回のUMの不満点は、ある文書に1つのトピックだけを割り当てるのが明らかにもったいない場合や厳しい場合があります。そこでLDAでは文書を色々なトピックを混ぜあわせたものと考えましょーというのが大きな進歩です。さてこの記事の表記法は以下になります。前回のUMの場合と同一です。右2列は定数については数値を、そうでないものについてはR内の変数名を書いています。データは前の記事参照。グラフィカルモデルは以下になります（左: LDA, 右（参考）: 前回のUM）。　見ると四角のプレートがまで伸びてきただけです。しかしながらこれが曲者でUMからかなりのギャップがあります。以下の吹き出しの順に説明していきます。 ① ここではハイパーパラメータからディリクレ分布に従って『文書の数だけ』が生成されます。このは以下のような

xef 2014/03/08

リンク

教師なしLDAでTwitterのスパム判別をしてみる(予備実験編) - 病みつきエンジニアブログ

※普通は「教師なしLDA」という言い方はしないですモチベーション元々は、TwitterからURLつきのツイートを取りたかった。某ニュースアプリがTwitter上で(？)話題になっているニュース記事を(法的な是非があるとはいえ)配信しており、そんな感じのマイニングがしたかった。ただ、普通に「http,https」でTwitter上で検索すると、量が膨大だった。加えて、ほとんどがスパム。なーにが「このサイトすごすぎｗｗｗｗｗ」じゃ。ということで、検索の段階でスパミーなキーワードを取り除き、純度の高いURL投稿マイニングをしたいわけだが、キーワードは既知なものには限らない。例えば「無料」とか「アフィリエイト」とかがスパムなのはそうなんだけど、「パズドラ」とか「魔法石」とか、未知のキーワードとか出てきた時に対応できない。そこで、教師なし学習のアプローチを使って、スパムなキーワードを抽出す

xef 2014/02/18

リンク

Latent Dirichlet Allocation(LDA)を用いたニュース記事の分類 | SmartNews開発者ブログ

株式会社ゴクロの中路です。以前のベイズ分類をベースにしたSmartNewsのチャンネル判定で触れたように、SmartNewsで配信する記事を「スポーツ」「エンタメ」「コラム」のようなチャンネルに分類しているのは、人ではなく機械です。そのアルゴリズムとして前回ご紹介したのは「ナイーブベイズ分類器」ですが、記事の分類を行う手法は、他にも様々なものがあります。その中で今回はLatent Dirichlet Allocation(以下LDA)について、先日東京大学の博士課程の皆さんと、社内で合同勉強会を行った際に作成した資料をベースにご紹介します。 LDAでできることの例前回ご紹介したナイーブベイズ分類器を構築する際には、すでにトピックのラベルが付けられた文章を、学習データとして用意する必要がありました。一方、LDAの場合は、東京でサッカー大会が開催された。xx選手のゴールが圧巻であった。

xef 2013/08/20

リンク