Dimensionality reduction

Principal Component Analysis

class DLL.MachineLearning.UnsupervisedLearning.DimensionalityReduction.PCA(n_components=2, epsilon=1e-10)[source]

Bases: object

Principal Component Analysis (PCA) class for dimensionality reduction.

Parameters:: n_components (int) – Number of principal components to keep. The number must be a positive integer.

components

Principal components extracted from the data.

Type:: torch.Tensor

explained_variance

Variance explained by the selected components.

Type:: torch.Tensor

fit(X, normalize=True, from_covariance=False)[source]

Fits the PCA model to the input data by calculating the principal components.

The input data is always centered and if normalize=True, also normalized so that the standard deviation is 1 along each axis.

Parameters:

X (torch.Tensor of shape (n_samples, n_features) or (n_features, n_features)) – The input data, where each row is a sample and each column is a feature if from_covariance is set to False and the covariance matrix otherwise.
normalize (bool, optional) – Whether to normalize the data before computing the PCA. Defaults to True. Is ignored if from_covariance is True.
from_covariance (bool, optional) – Determines if X is considered as the data matrix or the covariance matrix. Must be a boolean. Defaults to False.

Returns:

None

Raises:

TypeError – If the input matrix is not a PyTorch tensor or if the normalize parameter is not boolean.
ValueError – If the input matrix is not the correct shape.

fit_transform(X, normalize=True)[source]

First finds the principal components of X and then transforms X to fitted space.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – The input data to be transformed.
normalize (bool, optional) – Whether to normalize the data before computing the PCA. Defaults to True.

Returns:

The data transformed into the principal component space.

Return type:

X_new (torch.Tensor of shape (n_samples, n_components))

transform(X)[source]

Applies the fitted PCA model to the input data, transforming it into the reduced feature space. If covariance matrix was used when fitting, the input is assumed to be normalized appropriately.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – The input data to be transformed.

Returns:

The data transformed into the principal component space.

Return type:

X_new (torch.Tensor of shape (n_samples, n_components))

Raises:

NotFittedError – If the PCA model has not been fitted before transforming.
TypeError – If the input matrix is not a PyTorch tensor.
ValueError – If the input matrix is not the correct shape.

Robust Principal Component Analysis

class DLL.MachineLearning.UnsupervisedLearning.DimensionalityReduction.RobustPCA(n_components=2, method='mcd', epsilon=1e-10)[source]

Bases: object

Robust version of Principal Component Analysis (PCA).

Parameters:

n_components (int) – Number of principal components to keep. The number must be a positive integer.
method (str, optional) – The method used for the robust estimator. Must be one of “mcd” or “decomposition”. Defaults to “mcd”.

components

Principal components extracted from the data.

Type:: torch.Tensor

explained_variance

Variance explained by the selected components.

Type:: torch.Tensor

fit(X, epochs=1000, normalize=True, proportion=0.5, n_tries=20)[source]

Fits the RobustPCA model to the input data by calculating the principal components. If method is “decomposition”, finds matricies L and S such that X = L + S, where L is low rank and S is sparse. Then applies PCA to L. If method is mcd, tries to find a subset of the data that has the minimum covariance determinant and uses that as an estimate of the covariance.

The input data is always centered and if normalize=True, also normalized so that the standard deviation is 1 along each axis.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – The input data, where each row is a sample and each column is a feature.
epochs (int, optional) – Determines how many iterations are used for finding the L and S matricies. Defaults to 1000. Must be a positive integer.
normalize (bool, optional) – Whether to normalize the data before computing the PCA. Defaults to True.
proportion (float, optional) – The proportion of data, which is used to approximate the covariance matrix. Is ignored unless method is “mcd”. Must be in range (0, 1). Defaults to 0.5.
n_tries (int, optional) – The amount of attempts of randomly selecting the samples for the covariance estimate. Must be a positive integer. Defaults to 20.

fit_transform(X, epochs=1000, normalize=True)[source]

First finds the principal components of X and then transforms X to fitted space.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – The input data to be transformed.
epochs (int, optional) – Determines how many iterations are used for finding the L and S matricies. Defaults to 1000. Must be a positive integer.
normalize (bool, optional) – Whether to normalize the data before computing the PCA. Defaults to True.

Returns:

The data transformed into the principal component space.

Return type:

X_new (torch.Tensor of shape (n_samples, n_components))

transform(X)[source]

Applies the fitted RobustPCA model to the input data, transforming it into the reduced feature space.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – The input data to be transformed.

Returns:

The data transformed into the principal component space.

Return type:

X_new (torch.Tensor of shape (n_samples, n_components))

Raises:

NotFittedError – If the RobustPCA model has not been fitted before transforming.
TypeError – If the input matrix is not a PyTorch tensor.
ValueError – If the input matrix is not the correct shape.

t-Distributed Stochastic Neighbor Embedding

class DLL.MachineLearning.UnsupervisedLearning.DimensionalityReduction.TSNE(n_components=2, init='pca', p=2, early_exaggeration=12.0, perplexity=30.0, learning_rate='auto')[source]

Bases: object

T-distributed Stochastic Neighbor Embedding (T-SNE) class for dimensionality reduction. This implementation is based on this paper and this article. The main difference is that this implementation uses vectorized matrix operations making it considerably faster than the loop approach used in the article.

Parameters:

n_components (int) – Number of principal components to keep. The number must be a positive integer.
init (str, optional) – The method for initializing the embedding. Must be in ["pca", "random"]. Defaults to "pca".
p (int, optional) – The order of the chosen metric. Must be a positive integer. Defaults to 2, which corresponds to the Euclidian metric.
early_exaggeration (float | int, optional) – Determines how far apart the clusters are in the embedding space. Must be a positive real number. Defaults to 12.0.
perplexity (float | int, optional) – Determines how far can samples be from one another to be considered neighbors. Must be a positive real number. Defaults to 30.0. One should consider using something between 5 and 50 to begin with.
learning_rate (float | int, optional) – Determines how long steps do we take towards the gradient. Must be a positive real number. It is recommended to use a value between 10.0 and 1000.0. Defaults to "auto", where we use a value of max(n_samples / (4 * early_exaggeration), 50).

history

The history of KL-divergence loss function each epoch. Available after fitting the model.

Type:: list[float]

fit(X, epochs=100, verbose=False)[source]

Wrapper for the TSNE.fit_transform(X) method.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – The input data, where each row is a sample and each column is a feature.
epochs (int, optional) – The number of training epochs after early exaggeration. Must be a positive integer. Defaults to 100.
verbose (bool, optional) – Determines if the loss is printed each epoch. Must be a boolean. Defaults to False.

fit_transform(X, epochs=100, verbose=False)[source]

Fits the T-SNE model to the input data.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – The input data, where each row is a sample and each column is a feature.
epochs (int, optional) – The number of training epochs after early exaggeration. Must be a positive integer. Defaults to 100. Due to early exaggeration, the embedding is updated epochs + 250 times.
verbose (bool, optional) – Determines if the loss is printed each epoch. Must be a boolean. Defaults to False.

Returns:

The embedded samples of shape (n_samples, n_components).

Return type:

embedding (torch.tensor)

Uniform Manifold Approximation and Projection

class DLL.MachineLearning.UnsupervisedLearning.DimensionalityReduction.UMAP(n_components=2, init='spectral', p=2, n_neighbor=15, min_dist=0.25, learning_rate=1)[source]

Bases: object

Uniform Manifold Approximation and Projection (UMAP) class for dimensionality reduction. This implementation is based on this paper and this article.

Parameters:

n_components (int) – Number of principal components to keep. The number must be a positive integer.
init (str, optional) – The method for initializing the embedding. Must be in ["spectral", "pca", "random"]. Defaults to "spectral".
p (int, optional) – The order of the chosen metric. Must be a positive integer. Defaults to 2, which corresponds to the Euclidian metric.
n_neighbor (int, optional) – Controls how UMAP balances local and global structure in data. The larger this parameter is the better the global structure is conserved. A small value conserves fine details well, but may lose global structure. Must be a positive integer. Defaults to 15.
min_dist (float | int, optional) – Controls the minimum distance between samples in the low dimensional space. Must be a non-negative real number. Defaults to 0.25.
learning_rate (float | int, optional) – Determines how long steps do we take towards the gradient. Must be a positive real number. Defaults to 1.

history

The history of the cross entropy loss function each epoch. Available after fitting the model.

Type:: list[float]

fit(X, epochs=100, verbose=False)[source]

Wrapper for the TSNE.fit_transform(X) method.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – The input data, where each row is a sample and each column is a feature.
epochs (int, optional) – The number of training epochs after early exaggeration. Must be a positive integer. Defaults to 100.
verbose (bool, optional) – Determines if the loss is printed each epoch. Must be a boolean. Defaults to False.

fit_transform(X, epochs=100, verbose=False)[source]

Fits the UMAP model to the input data.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – The input data, where each row is a sample and each column is a feature.
epochs (int, optional) – The number of training epochs. Must be a positive integer. Defaults to 100.
verbose (bool, optional) – Determines if the loss is printed each epoch. Must be a boolean. Defaults to False.

Returns:

The embedded samples of shape (n_samples, n_components).

Return type:

embedding (torch.tensor)

Discriminant Analysis

Linear Discriminant Analysis

class DLL.MachineLearning.UnsupervisedLearning.DimensionalityReduction.LDA(n_components=2)[source]

Bases: object

Linear discriminant analysis (LDA) class for dimensionality reduction.

Parameters:: n_components (int) – Number of principal components to keep. The number must be a positive integer.

components

Components extracted from the data.

Type:: torch.Tensor

n_features

The number of features in the input.

Type:: int

fit(X, y)[source]

Fits the LDA model to the input data by calculating the components.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – The input data, where each row is a sample and each column is a feature.

Raises:

TypeError – If the input matrix is not a PyTorch tensor.
ValueError – If the input matrix is not the correct shape.

fit_transform(X, y)[source]

First finds the components of X and then transforms X to fitted space.

Parameters:: X (torch.Tensor of shape (n_samples, n_features)) – The input data to be transformed.
Returns:: The data transformed into the component space.
Return type:: X_new (torch.Tensor of shape (n_samples, n_components))

predict(X)[source]

Applies the fitted LDA model to the input data, predicting the correct classes.

Parameters:: X (torch.Tensor of shape (n_samples, n_features)) – The input data to be transformed.
Returns:: The predicted labels.
Return type:: y (torch.Tensor of shape (n_samples,))
Raises:: NotFittedError – If the LDA model has not been fitted before transforming.

transform(X)[source]

Applies the fitted LDA model to the input data, transforming it into the reduced feature space.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – The input data to be transformed.

Returns:

The data transformed into the component space.

Return type:

X_new (torch.Tensor of shape (n_samples, n_components))

Raises:

NotFittedError – If the LDA model has not been fitted before transforming.
TypeError – If the input matrix is not a PyTorch tensor.
ValueError – If the input matrix is not the correct shape.

Quadratic Discriminant Analysis

class DLL.MachineLearning.UnsupervisedLearning.DimensionalityReduction.QDA[source]

Bases: object

Quadratic discriminant analysis (LDA) class for classification.

n_features

The number of features in the input.

Type:: int

fit(X, y)[source]

Fits the QDA model to the input data by calculating the class means and covariances.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – The input data, where each row is a sample and each column is a feature.

Raises:

TypeError – If the input matrix is not a PyTorch tensor.
ValueError – If the input matrix is not the correct shape.

predict(X)[source]

Applies the fitted QDA model to the input data, predicting the correct classes.

Parameters:: X (torch.Tensor of shape (n_samples, n_features)) – The input data to be transformed.
Returns:: The predicted labels.
Return type:: y (torch.Tensor of shape (n_samples,))
Raises:: NotFittedError – If the QDA model has not been fitted before transforming.