BERTopic は、Transformersを用いて文書のトピックモデリングを行うためのPythonライブラリです。本記事では、自分がKaggleコンペの初手EDAによく使うコードをまとめました。 入出力のイメージ 入力: 文章のリスト (例:["I am sure some bashers of Pens fans ...", "My brother is in the market for a high-performance video card that supports VESA local bus with 1-2MB RAM. Does anyone hav...", ...]) 出力: 各文書の関係性を表した2次元座標図 ソースコード 以下にもあります Github Google colab import pandas as pd from umap import UMA
![KaggleのNLPコンペで初手に使える可視化 〜BERTopicを用いた文書クラスタリングと可視化〜](https://cdn-ak-scissors.b.st-hatena.com/image/square/a36e323b8c1ef295b8c0edf94b24f7054724587f/height=288;version=1;width=512/https%3A%2F%2Fres.cloudinary.com%2Fzenn%2Fimage%2Fupload%2Fs--pR28Xd-0--%2Fc_fit%252Cg_north_west%252Cl_text%3Anotosansjp-medium.otf_55%3AKaggle%2525E3%252581%2525AENLP%2525E3%252582%2525B3%2525E3%252583%2525B3%2525E3%252583%25259A%2525E3%252581%2525A7%2525E5%252588%25259D%2525E6%252589%25258B%2525E3%252581%2525AB%2525E4%2525BD%2525BF%2525E3%252581%252588%2525E3%252582%25258B%2525E5%25258F%2525AF%2525E8%2525A6%252596%2525E5%25258C%252596%252520%2525E3%252580%25259CBERTopic%2525E3%252582%252592%2525E7%252594%2525A8%2525E3%252581%252584%2525E3%252581%25259F%2525E6%252596%252587%2525E6%25259B%2525B8%2525E3%252582%2525AF%2525E3%252583%2525A9%2525E3%252582%2525B9%2525E3%252582%2525BF%2525E3%252583%2525AA%2525E3%252583%2525B3%2525E3%252582%2525B0%2525E3%252581%2525A8%2525E5%25258F%2525AF%2525E8%2525A6%252596%2525E5%25258C%252596%2525E3%252580%25259C%252Cw_1010%252Cx_90%252Cy_100%2Fg_south_west%252Cl_text%3Anotosansjp-medium.otf_37%3Anishimoto%252Cx_203%252Cy_121%2Fg_south_west%252Ch_90%252Cl_fetch%3AaHR0cHM6Ly9zdG9yYWdlLmdvb2dsZWFwaXMuY29tL3plbm4tdXNlci11cGxvYWQvYXZhdGFyLzZjZWNmNDMwYWMuanBlZw%3D%3D%252Cr_max%252Cw_90%252Cx_87%252Cy_95%2Fv1627283836%2Fdefault%2Fog-base-w1200-v2.png)