矩阵分解

In [1]:
import numpy as np
In [2]:
U, Sigma, VT = np.linalg.svd([[1, 1], [7, 7]])
In [3]:
U
Out[3]:
array([[-0.14142136, -0.98994949],
       [-0.98994949,  0.14142136]])
In [4]:
Sigma #Sigma是一个对角矩阵,这样返回是为了节省空间
Out[4]:
array([  1.00000000e+01,   2.82797782e-16])
In [5]:
VT
Out[5]:
array([[-0.70710678, -0.70710678],
       [ 0.70710678, -0.70710678]])
In [6]:
import svdRec
In [7]:
Data = svdRec.loadExData()
In [8]:
U, Sigma, VT = np.linalg.svd(Data)
In [9]:
Sigma
Out[9]:
array([  9.72140007e+00,   5.29397912e+00,   6.84226362e-01,
         1.97142765e-16,   1.57042877e-16])

矩阵分解的原理如下: $$Data_{m \times n} \approx U_{m \times 3} \Sigma_{3 \times 3} V_{3 \times n}^{T}$$

In [10]:
Sigma3 = np.mat([[Sigma[0], 0, 0], [0, Sigma[1], 0], [0, 0, Sigma[2]]])
In [11]:
retMat = U[:,:3] * Sigma3 * VT[:3,:]
In [12]:
retMat[np.abs(retMat)<1e-8] = 0
In [13]:
retMat
Out[13]:
matrix([[ 1.,  1.,  1.,  0.,  0.],
        [ 2.,  2.,  2.,  0.,  0.],
        [ 1.,  1.,  1.,  0.,  0.],
        [ 5.,  5.,  5.,  0.,  0.],
        [ 1.,  1.,  0.,  2.,  2.],
        [ 0.,  0.,  0.,  3.,  3.],
        [ 0.,  0.,  0.,  1.,  1.]])

基于协同过滤的推荐系统

相似度计算

In [14]:
reload(svdRec)
Out[14]:
<module 'svdRec' from 'svdRec.pyc'>
In [15]:
myMat = np.mat(svdRec.loadExData())
In [16]:
svdRec.ecludSim(myMat[:,0], myMat[:,4])
Out[16]:
0.13367660240019172
In [17]:
svdRec.ecludSim(myMat[:,0], myMat[:,0])
Out[17]:
1.0
In [18]:
svdRec.cosSim(myMat[:,0], myMat[:,4])
Out[18]:
0.54724555912615336
In [19]:
svdRec.cosSim(myMat[:,0], myMat[:,0])
Out[19]:
0.99999999999999989
In [20]:
svdRec.pearsSim(myMat[:,0], myMat[:,4])
Out[20]:
0.23768619407595826
In [21]:
svdRec.pearsSim(myMat[:,0], myMat[:,0])
Out[21]:
1.0

推荐未评分的物品

In [22]:
reload(svdRec)
Out[22]:
<module 'svdRec' from 'svdRec.pyc'>
In [23]:
myMat = np.mat(svdRec.loadExData())
In [24]:
myMat[0, 1] = myMat[0, 0] = myMat[1, 0] = myMat[2, 0] = 4
In [25]:
myMat[3, 2] = 2
In [26]:
myMat
Out[26]:
matrix([[4, 4, 1, 0, 0],
        [4, 2, 2, 0, 0],
        [4, 1, 1, 0, 0],
        [5, 5, 2, 0, 0],
        [1, 1, 0, 2, 2],
        [0, 0, 0, 3, 3],
        [0, 0, 0, 1, 1]])
In [27]:
svdRec.recommend(myMat, 2)
Out[27]:
[(3, 2.5), (4, 2.5)]
In [28]:
svdRec.recommend(myMat, 2, simMeas=svdRec.ecludSim)
Out[28]:
[(3, 2.5), (4, 2.5)]
In [29]:
svdRec.recommend(myMat, 2, simMeas=svdRec.pearsSim)
Out[29]:
[(3, 2.5), (4, 2.5)]

利用SVD提高推荐效果

In [30]:
reload(svdRec)
Out[30]:
<module 'svdRec' from 'svdRec.pyc'>
In [31]:
myMat = np.mat(svdRec.loadExData2())
In [32]:
U, Sigma, VT = np.linalg.svd(myMat)
In [33]:
Sigma
Out[33]:
array([ 15.77075346,  11.40670395,  11.03044558,   4.84639758,
         3.09292055,   2.58097379,   1.00413543,   0.72817072,
         0.43800353,   0.22082113,   0.07367823])
In [34]:
Sig2 = Sigma ** 2
In [35]:
Sig2
Out[35]:
array([  2.48716665e+02,   1.30112895e+02,   1.21670730e+02,
         2.34875695e+01,   9.56615756e+00,   6.66142570e+00,
         1.00828796e+00,   5.30232598e-01,   1.91847092e-01,
         4.87619735e-02,   5.42848136e-03])
In [36]:
np.sum(Sig2)
Out[36]:
541.99999999999955
In [37]:
sum(Sig2) * 0.9
Out[37]:
487.79999999999961
In [38]:
sum(Sig2[:2])
Out[38]:
378.82955951135796
In [39]:
sum(Sig2[:3])
Out[39]:
500.50028912757932

所以我们可以将一个11维矩阵转换成一个3维矩阵,使得这三个元素包含的能量高于总能量的90%。svdEst函数中以下语句:

xformedItems = dataMat.T * U[:, :4] * Sig4.I  #构建转换后的物品

这条语句就是用来计算矩阵奇异值分解后的$V$矩阵,因此需要先将$Data_{m \times n}$转置成$Data_{m \times n}^{T}$的$n \times m$的矩阵。

In [40]:
reload(svdRec)
Out[40]:
<module 'svdRec' from 'svdRec.pyc'>
In [41]:
svdRec.recommend(myMat, 1, estMethod=svdRec.svdEst)
the 0 and 3 similarity is : 0.490950
the 0 and 5 similarity is : 0.484274
the 0 and 10 similarity is : 0.512755
the 1 and 3 similarity is : 0.491294
the 1 and 5 similarity is : 0.481516
the 1 and 10 similarity is : 0.509709
the 2 and 3 similarity is : 0.491573
the 2 and 5 similarity is : 0.482346
the 2 and 10 similarity is : 0.510584
the 4 and 3 similarity is : 0.450495
the 4 and 5 similarity is : 0.506795
the 4 and 10 similarity is : 0.512896
the 6 and 3 similarity is : 0.743699
the 6 and 5 similarity is : 0.468366
the 6 and 10 similarity is : 0.439465
the 7 and 3 similarity is : 0.482175
the 7 and 5 similarity is : 0.494716
the 7 and 10 similarity is : 0.524970
the 8 and 3 similarity is : 0.491307
the 8 and 5 similarity is : 0.491228
the 8 and 10 similarity is : 0.520290
the 9 and 3 similarity is : 0.522379
the 9 and 5 similarity is : 0.496130
the 9 and 10 similarity is : 0.493617
Out[41]:
[(4, 3.3447149384692283), (7, 3.3294020724526971), (9, 3.328100876390069)]
In [42]:
svdRec.recommend(myMat, 1, estMethod=svdRec.svdEst, simMeas=svdRec.pearsSim)
the 0 and 3 similarity is : 0.341942
the 0 and 5 similarity is : 0.124132
the 0 and 10 similarity is : 0.116698
the 1 and 3 similarity is : 0.345560
the 1 and 5 similarity is : 0.126456
the 1 and 10 similarity is : 0.118892
the 2 and 3 similarity is : 0.345149
the 2 and 5 similarity is : 0.126190
the 2 and 10 similarity is : 0.118640
the 4 and 3 similarity is : 0.450126
the 4 and 5 similarity is : 0.528504
the 4 and 10 similarity is : 0.544647
the 6 and 3 similarity is : 0.923822
the 6 and 5 similarity is : 0.724840
the 6 and 10 similarity is : 0.710896
the 7 and 3 similarity is : 0.319482
the 7 and 5 similarity is : 0.118324
the 7 and 10 similarity is : 0.113370
the 8 and 3 similarity is : 0.334910
the 8 and 5 similarity is : 0.119673
the 8 and 10 similarity is : 0.112497
the 9 and 3 similarity is : 0.566918
the 9 and 5 similarity is : 0.590049
the 9 and 10 similarity is : 0.602380
Out[42]:
[(4, 3.3469521867021732), (9, 3.3353796573274699), (6, 3.307193027813037)]

上述代码的运行效率并不高,一个原因是不必在每次估计评分时都做SVD分解。此外,在上面的矩阵中,含有很多0元素,在实际系统中0元素更多。我们可以只存储非零元素来节省内存和计算开销。另一个潜在的计算资源浪费来自于相似度得分。解决方法可以是离线计算并保存相似度得分。

基于SVD的图像压缩

In [43]:
reload(svdRec)
Out[43]:
<module 'svdRec' from 'svdRec.pyc'>
In [44]:
svdRec.imgCompress(2)
****origin matrix*****
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0  
0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0  
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0  
0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0  
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0  
0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0  
0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0  
****reconstructed matrix using 2 singular values******
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0  
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0