Clustering
KMeans
- class DLL.MachineLearning.UnsupervisedLearning.Clustering.KMeansClustering(k=3, max_iters=100, init='kmeans++', n_init=10, tol=1e-05)[source]
Bases:
object
KMeansClustering implements the K-Means clustering algorithm, which partitions data points into k clusters.
- Parameters:
k (int, optional) – The number of clusters. Defaults to 3. Must be a positive integer.
max_iters (int, optional) – The maximum number of iterations for training the model. Defaults to 100. Must be a positive integer.
init ({kmeans++, random}, optional) – The method for initialising the centroids. Defaults to kmeans++.
n_init (int, optional) – The number of differently initialized centroids. Defaults to 10. Must be a positive integer.
- centroids
The final chosen centroids.
- Type:
torch.Tensor
- inertia
The total squared distance to the nearest centroid.
- Type:
float
- fit(X)[source]
Fits the KMeansClustering model to the input data by finding the best centroids.
- Parameters:
X (torch.Tensor of shape (n_samples, n_features)) – The input data, where each row is a sample and each column is a feature. The number of samples must be atleast k.
- Returns:
None
- Raises:
TypeError – If the input matrix is not a PyTorch tensor.
ValueError – If the input matrix is not the correct shape.
- predict(X)[source]
Applies the fitted KMeansClustering model to the input data, partitioning it to k clusters.
- Parameters:
X (torch.Tensor of shape (n_samples, n_features)) – The input data to be partitioned.
- Returns:
The cluster corresponding to each sample.
- Return type:
labels (torch.Tensor of shape (n_samples,))
- Raises:
NotFittedError – If the KMeansClustering model has not been fitted before predicting.
TypeError – If the input matrix is not a PyTorch tensor.
ValueError – If the input matrix is not the correct shape.
Gaussian mixture models
- class DLL.MachineLearning.UnsupervisedLearning.Clustering.GaussianMixture(k=3, max_iters=10, tol=1e-05)[source]
Bases:
object
Gaussian mixture model. Fits k Gaussian distributions onto the data using maximum likelihood estimation.
- Parameters:
k (int, optional) – The number of Gaussian distributions (clusters). Must be a positive integer. Defaults to 3.
max_iters (int, optional) – The maximum number of iterations. Must be a positive integer. Defaults to 10.
- fit(X, verbose=False)[source]
Fits the k gaussian distributions to the data using maximum likelihood estimation.
- Parameters:
X (torch.Tensor of shape (n_samples, n_features)) – The input data to be clustered.
verbose (bool, optional) – Determines if the likelihood should be calculated and printed during training. Must be a boolean. Defaults to False.
- predict(X)[source]
Predicts the clusters of the data according to the fitted distributions.
- Parameters:
X (torch.Tensor of shape (n_samples, n_features)) – The input data to be clustered.
- Returns:
A tensor of labels corresponding to classes.
- Return type:
torch.Tensor of shape (n_samples,)
- predict_proba(X)[source]
Predicts the probabilities of the data being in the fitted distributions.
- Parameters:
X (torch.Tensor of shape (n_samples, n_features)) – The input data to be clustered.
- Returns:
A tensor of probabilities of the data being in the fitted distributions.
- Return type:
torch.Tensor of shape (n_samples, k)
Spectral clustering
- class DLL.MachineLearning.UnsupervisedLearning.Clustering.SpectralClustering(kernel=<DLL.MachineLearning.SupervisedLearning.Kernels.RBF object>, k=3, max_iters=100, normalise=True, use_kmeans=True, **kwargs)[source]
Bases:
object
SpectralClustering implements the spectral clustering algorithm, which partitions data points into k clusters.
- Parameters:
kernel (Kernels, optional) – The similarity function for fitting the model. Defaults to RBF(correlation_length=0.1).
k (int, optional) – The number of clusters. Defaults to 3. Must be a positive integer.
max_iters (int, optional) – The maximum number of iterations for training the model. Defaults to 100. Must be a positive integer.
normalise (bool, optional) – Determines if the laplacian matrix is calculated using L = I - sqrt(inv(D)) A sqrt(inv(D)) or just L = D - A. Defaults to True.
use_kmeans (bool, optional) – Determines if the clustring in embedded space is done using kmeans or discretisation. Defaults to True.
**kwargs – Other arguments are passed into the KMeansClustering algorithm.
Note
The result depends heavily on the chosen kernel function. Especially the correlation_length parameter should be fine-tuned for optimal performance.
- fit(X)[source]
Fits the algorithm to the given data. Transforms the data into the embedding space using the kernel function and clusters the data in the space.
- Parameters:
X (torch.Tensor of shape (n_samples, n_features)) – The input data, where each row is a sample and each column is a feature. The number of samples must be atleast k.
- predict()[source]
Applies the fitted SpectralClustering model to the input data, partitioning it to k clusters.
- Returns:
The cluster corresponding to each sample.
- Return type:
labels (torch.Tensor of shape (n_samples,))
- Raises:
NotFittedError – If the SpectralClustering model has not been fitted before predicting.
Density based clustering
- class DLL.MachineLearning.UnsupervisedLearning.Clustering.DBScan(eps=0.5, min_samples=5)[source]
Bases:
object
Density-based spatial clustering of applications with noise (DBSCAN) algorithm.
- Parameters:
eps (float | int, optional) – The distance inside of which datapoints are considered to be neighbours. Must be a positive real number. Defaults to 0.5.
min_samples (int, optional) – The minimum number of neighbours for a non-leaf node. Must be a positive integer. Defaults to 5.
Note
The algorithm is very sensitive to changes in eps. One should fine-tune the value of eps for optimal results.
- fit(X)[source]
Fits the algorithm to the given data. Recursively finds clusters by connecting near-by samples into the same cluster.
- Parameters:
X (torch.Tensor of shape (n_samples, n_features)) – The input data, where each row is a sample and each column is a feature.
- predict()[source]
Applies the fitted DBScan model to the input data. Splits the training data into clusters.
- Returns:
The cluster corresponding to each sample. Label -1 indicates, that the algorithm considers that spesific samples as noise.
- Return type:
labels (torch.Tensor of shape (n_samples,))
- Raises:
NotFittedError – If the DBScan model has not been fitted before predicting.