Skip to content

Commit

Permalink
lsa pr
Browse files Browse the repository at this point in the history
  • Loading branch information
SmirkCao committed Jun 12, 2019
1 parent e485f78 commit bc14ffd
Show file tree
Hide file tree
Showing 4 changed files with 30 additions and 3 deletions.
3 changes: 3 additions & 0 deletions CH17/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,9 @@ $$
去了停用词之后,做词频统计,得到了数据表。这个数据在[概率潜在语义分析](../CH18/README.md)部分的习题中再次引用了。
对应的这部分数据,实际上还可以做一些事情。可以尝试可视化下。

![fig_lsa_radar](assets/fig_lsa_radar.png)
上图中三个话题ABC,和不同单词的关系可以可以看出。也可以绘制单词-话题的雷达图。

这个例子里面书中给出的参考结果是按照V做了符号调整,保证V中每一行的最大值,符号为正。


Expand Down
2 changes: 2 additions & 0 deletions CH17/lsa.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ def __init__(self, n_components):
self.components = None
self.singular_values = None
self.explained_variance_ratio = None
self.u = None

def fit(self, x):
u, s, vh = np.linalg.svd(x, full_matrices=False)
Expand All @@ -30,5 +31,6 @@ def fit(self, x):
self.explained_variance = np.var(x_transformed, axis=0)
self.explained_variance_ratio = (self.explained_variance/self.explained_variance.sum())[:k]
self.explained_variance = self.explained_variance[:k]
self.u = u
return x_transformed

24 changes: 22 additions & 2 deletions CH17/unit_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,9 +124,29 @@ def test_lsa(self):
print(svd.explained_variance_ratio_.sum())

svd_1 = lsa_test(n_components=3)
svd_1.fit(x)
rst = svd_1.fit(x)
print("lsa test")
print(svd_1.components)
print(svd_1.singular_values)
print(svd_1.explained_variance)
print(svd_1.explained_variance_ratio)
print(svd_1.explained_variance_ratio)

import utils
labels = ["A", "B", "C", "D", "E", "F"][:3]
feas = ["Book", "Dads", "Dummies", "Estate", "Guide", "Investing",
"Market", "Real", "Rich", "Stock", "Value"]
radar = utils.Radar(feas=feas, labels=labels)
print(svd_1.u[:, :3].shape)
radar.plot(svd_1.u[:, :3].T)

def test_plot_radar(self):
import utils
import pandas as pd
import sys

base_path = sys.path[0]
data = pd.read_csv(base_path+"/data/cities_ranking.csv")
print(data.head())
feas = ["A", "B", "C", "D", "E", "F"]
radar = utils.Radar(feas=feas, labels=["SH", "BJ"])
radar.plot(data[feas].values)
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -261,12 +261,14 @@
## CH16 主成分分析

- 利用正交变换将线性相关变量表示的观测数据转换为少数几个由线性无关变量表示的数据,线性无关的变量称为**主成分**
- 主成分并不对应原始数据的某一个特征, 可以通过因子负荷量来观察主成分与原始特征之间的关系。
- 这部分内容,还没有提到**话题**这个概念,后面章节开始介绍了很多话题分析相关的内容,LSA,PLSA,LDA都是和话题有关,MCMC是在LDA中使用的一个工具。

## CH17 潜在语义分析

- 在sklearn的定义中,LSA就是截断奇异值分解。
- 注意体会LSA和PCA的区别。
- 注意体会LSA和PCA的区别,主要在于是不是去均值。
- 在LSA中,话题向量空间是$U$,DOC在话题向量空间的表示是$SV^\mathrm{T}$。但是在sklaern中,xtransformed是$U\mit\Sigma$

## CH18 概率潜在语义分析

Expand Down

0 comments on commit bc14ffd

Please sign in to comment.