Scikit-learn decomposition module includes many decomposition algorithms that can be used for dimensionality reduction. PCA and SVD are two algorithms that differ in their functional implementation in regards to their parameters. PCA uses its n_components
parameter in different ways depending on its type:
n_components int, float or ‘mle’, default=None
Notably:
If 0 < n_components < 1 and svd_solver == 'full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.
So if you set n_components
to .95 and svd_solver
to 'full', you will only get the components beyond that threshold. Great.
(See my wrapper function for PCA, get_min_pca.)
On the other hand, the n_components
parameter for TruncatedSVD is used only to set the output dimensions:
n_components int, default=2
Desired dimensionality of output data. Must be strictly less than the number of features. The default value is useful for visualisation. For LSA, a value of 100 is recommended.
We can set this parameter to any number between 1 (2?) and (1 - the number of features) without knowing if this number is optimal.
The function get_min_svd
defined below uses 2 calls to sklearn.decomposition.TruncatedSVD
, first with the maximum number of components, then with the number of components found to be above the given explained variance threshold, which are returned along with the number of components and the threshold.
def get_min_svd(data, min_var_explained=0.95,
show=True,
points_style='c--',
line_color='m'):
"""
Decompose `data` using singular value decomposition.
Return the number of components above `min_var_explained` threshold, the
threshold and the transformed data, along with the plot if `show` is True (default).
"""
from sklearn.decomposition import TruncatedSVD
# Use max number of components: data length - 1:
d = data.shape[1]
tsvd = TruncatedSVD(d - 1)
tsvd.fit(X)
# Cumulative explained variance
cum_var = tsvd.explained_variance_ratio_.cumsum()
# Number of components for % variance explained
comp_min = list(cum_var > min_var_explained).index(True) + 1
if show:
try:
fig = plt.figure()
except NameError:
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot()
x_vals = [i for i in range(1, d)]
ax.plot(x_vals, cum_var, points_style, markersize=8, linewidth=1)
ax.hlines(min_var_explained, 0, comp_min, colors=line_color)
ax.vlines(comp_min, cum_var.min(), min_var_explained, colors=line_color)
ax.plot(comp_min, cum_var.min(), 'k+', markersize=12,
label=f'Components for {min_var_explained:.0%} of\nvariance explained: {comp_min}')
ax.set_xlim(xmin=1-.2,xmax=d-.2)
ax.set(xlabel='svd components', ylabel='cumulative explained variance')
ax.legend(markerscale=0, handlelength=0)
return comp_min, min_var_explained, TruncatedSVD(n_components=comp_min).fit_transform(data)
# X = sklearn.datasets.load_digits().data / 255
n_comps, thresh, reduced = get_min_svd(X)