SVD算法

"Python机器学习实战笔记"

Posted by jhljx on January 30, 2018

目录

1. 矩阵分解
2. 基于协同过滤的推荐系统
2.1. 相似度计算
2.2. 推荐未评分的物品
2.3. 利用SVD提高推荐效果
3. 基于SVD的图像压缩

矩阵分解

import numpy as np
U, Sigma, VT = np.linalg.svd([[1, 1], [7, 7]])
U
array([[-0.14142136, -0.98994949],
       [-0.98994949,  0.14142136]])
Sigma #Sigma是一个对角矩阵,这样返回是为了节省空间
array([  1.00000000e+01,   2.82797782e-16])
VT
array([[-0.70710678, -0.70710678],
       [ 0.70710678, -0.70710678]])
import svdRec
Data = svdRec.loadExData()
U, Sigma, VT = np.linalg.svd(Data)
Sigma
array([  9.72140007e+00,   5.29397912e+00,   6.84226362e-01,
         1.97142765e-16,   1.57042877e-16])

矩阵分解的原理如下:

Sigma3 = np.mat([[Sigma[0], 0, 0], [0, Sigma[1], 0], [0, 0, Sigma[2]]])
retMat = U[:,:3] * Sigma3 * VT[:3,:]
retMat[np.abs(retMat)<1e-8] = 0
retMat
matrix([[ 1.,  1.,  1.,  0.,  0.],
        [ 2.,  2.,  2.,  0.,  0.],
        [ 1.,  1.,  1.,  0.,  0.],
        [ 5.,  5.,  5.,  0.,  0.],
        [ 1.,  1.,  0.,  2.,  2.],
        [ 0.,  0.,  0.,  3.,  3.],
        [ 0.,  0.,  0.,  1.,  1.]])

基于协同过滤的推荐系统

相似度计算

reload(svdRec)
<module 'svdRec' from 'svdRec.pyc'>
myMat = np.mat(svdRec.loadExData())
svdRec.ecludSim(myMat[:,0], myMat[:,4])
0.13367660240019172
svdRec.ecludSim(myMat[:,0], myMat[:,0])
1.0
svdRec.cosSim(myMat[:,0], myMat[:,4])
0.54724555912615336
svdRec.cosSim(myMat[:,0], myMat[:,0])
0.99999999999999989
svdRec.pearsSim(myMat[:,0], myMat[:,4])
0.23768619407595826
svdRec.pearsSim(myMat[:,0], myMat[:,0])
1.0

推荐未评分的物品

reload(svdRec)
<module 'svdRec' from 'svdRec.pyc'>
myMat = np.mat(svdRec.loadExData())
myMat[0, 1] = myMat[0, 0] = myMat[1, 0] = myMat[2, 0] = 4
myMat[3, 2] = 2
myMat
matrix([[4, 4, 1, 0, 0],
        [4, 2, 2, 0, 0],
        [4, 1, 1, 0, 0],
        [5, 5, 2, 0, 0],
        [1, 1, 0, 2, 2],
        [0, 0, 0, 3, 3],
        [0, 0, 0, 1, 1]])
svdRec.recommend(myMat, 2)
[(3, 2.5), (4, 2.5)]
svdRec.recommend(myMat, 2, simMeas=svdRec.ecludSim)
[(3, 2.5), (4, 2.5)]
svdRec.recommend(myMat, 2, simMeas=svdRec.pearsSim)
[(3, 2.5), (4, 2.5)]

利用SVD提高推荐效果

reload(svdRec)
<module 'svdRec' from 'svdRec.pyc'>
myMat = np.mat(svdRec.loadExData2())
U, Sigma, VT = np.linalg.svd(myMat)
Sigma
array([ 15.77075346,  11.40670395,  11.03044558,   4.84639758,
         3.09292055,   2.58097379,   1.00413543,   0.72817072,
         0.43800353,   0.22082113,   0.07367823])
Sig2 = Sigma ** 2
Sig2
array([  2.48716665e+02,   1.30112895e+02,   1.21670730e+02,
         2.34875695e+01,   9.56615756e+00,   6.66142570e+00,
         1.00828796e+00,   5.30232598e-01,   1.91847092e-01,
         4.87619735e-02,   5.42848136e-03])
np.sum(Sig2)
541.99999999999955
sum(Sig2) * 0.9
487.79999999999961
sum(Sig2[:2])
378.82955951135796
sum(Sig2[:3])
500.50028912757932

所以我们可以将一个11维矩阵转换成一个3维矩阵,使得这三个元素包含的能量高于总能量的90%。svdEst函数中以下语句:

xformedItems = dataMat.T * U[:, :4] * Sig4.I  #构建转换后的物品

这条语句就是用来计算矩阵奇异值分解后的$V$矩阵,因此需要先将$Data_{m \times n}$转置成$Data_{m \times n}^{T}$的$n \times m$的矩阵。

reload(svdRec)
<module 'svdRec' from 'svdRec.pyc'>
svdRec.recommend(myMat, 1, estMethod=svdRec.svdEst)
the 0 and 3 similarity is : 0.490950
the 0 and 5 similarity is : 0.484274
the 0 and 10 similarity is : 0.512755
the 1 and 3 similarity is : 0.491294
the 1 and 5 similarity is : 0.481516
the 1 and 10 similarity is : 0.509709
the 2 and 3 similarity is : 0.491573
the 2 and 5 similarity is : 0.482346
the 2 and 10 similarity is : 0.510584
the 4 and 3 similarity is : 0.450495
the 4 and 5 similarity is : 0.506795
the 4 and 10 similarity is : 0.512896
the 6 and 3 similarity is : 0.743699
the 6 and 5 similarity is : 0.468366
the 6 and 10 similarity is : 0.439465
the 7 and 3 similarity is : 0.482175
the 7 and 5 similarity is : 0.494716
the 7 and 10 similarity is : 0.524970
the 8 and 3 similarity is : 0.491307
the 8 and 5 similarity is : 0.491228
the 8 and 10 similarity is : 0.520290
the 9 and 3 similarity is : 0.522379
the 9 and 5 similarity is : 0.496130
the 9 and 10 similarity is : 0.493617





[(4, 3.3447149384692283), (7, 3.3294020724526971), (9, 3.328100876390069)]
svdRec.recommend(myMat, 1, estMethod=svdRec.svdEst, simMeas=svdRec.pearsSim)
the 0 and 3 similarity is : 0.341942
the 0 and 5 similarity is : 0.124132
the 0 and 10 similarity is : 0.116698
the 1 and 3 similarity is : 0.345560
the 1 and 5 similarity is : 0.126456
the 1 and 10 similarity is : 0.118892
the 2 and 3 similarity is : 0.345149
the 2 and 5 similarity is : 0.126190
the 2 and 10 similarity is : 0.118640
the 4 and 3 similarity is : 0.450126
the 4 and 5 similarity is : 0.528504
the 4 and 10 similarity is : 0.544647
the 6 and 3 similarity is : 0.923822
the 6 and 5 similarity is : 0.724840
the 6 and 10 similarity is : 0.710896
the 7 and 3 similarity is : 0.319482
the 7 and 5 similarity is : 0.118324
the 7 and 10 similarity is : 0.113370
the 8 and 3 similarity is : 0.334910
the 8 and 5 similarity is : 0.119673
the 8 and 10 similarity is : 0.112497
the 9 and 3 similarity is : 0.566918
the 9 and 5 similarity is : 0.590049
the 9 and 10 similarity is : 0.602380





[(4, 3.3469521867021732), (9, 3.3353796573274699), (6, 3.307193027813037)]

上述代码的运行效率并不高,一个原因是不必在每次估计评分时都做SVD分解。此外,在上面的矩阵中,含有很多0元素,在实际系统中0元素更多。我们可以只存储非零元素来节省内存和计算开销。另一个潜在的计算资源浪费来自于相似度得分。解决方法可以是离线计算并保存相似度得分。

基于SVD的图像压缩

reload(svdRec)
<module 'svdRec' from 'svdRec.pyc'>
svdRec.imgCompress(2)
****origin matrix*****
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0  
0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0  
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0  
0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0  
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0  
0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0  
0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0  
****reconstructed matrix using 2 singular values******
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0