Clustering

KMeans

class DLL.MachineLearning.UnsupervisedLearning.Clustering.KMeansClustering(k=3, max_iters=100, init='kmeans++', n_init=10, tol=1e-05)[source]

Bases: object

KMeansClustering implements the K-Means clustering algorithm, which partitions data points into k clusters.

Parameters:
  • k (int, optional) – The number of clusters. Defaults to 3. Must be a positive integer.

  • max_iters (int, optional) – The maximum number of iterations for training the model. Defaults to 100. Must be a positive integer.

  • init ({kmeans++, random}, optional) – The method for initialising the centroids. Defaults to kmeans++.

  • n_init (int, optional) – The number of differently initialized centroids. Defaults to 10. Must be a positive integer.

centroids

The final chosen centroids.

Type:

torch.Tensor

inertia

The total squared distance to the nearest centroid.

Type:

float

fit(X)[source]

Fits the KMeansClustering model to the input data by finding the best centroids.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – The input data, where each row is a sample and each column is a feature. The number of samples must be atleast k.

Returns:

None

Raises:
  • TypeError – If the input matrix is not a PyTorch tensor.

  • ValueError – If the input matrix is not the correct shape.

predict(X)[source]

Applies the fitted KMeansClustering model to the input data, partitioning it to k clusters.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – The input data to be partitioned.

Returns:

The cluster corresponding to each sample.

Return type:

labels (torch.Tensor of shape (n_samples,))

Raises:
  • NotFittedError – If the KMeansClustering model has not been fitted before predicting.

  • TypeError – If the input matrix is not a PyTorch tensor.

  • ValueError – If the input matrix is not the correct shape.

Gaussian mixture models

class DLL.MachineLearning.UnsupervisedLearning.Clustering.GaussianMixture(k=3, max_iters=10, tol=1e-05)[source]

Bases: object

Gaussian mixture model. Fits k Gaussian distributions onto the data using maximum likelihood estimation.

Parameters:
  • k (int, optional) – The number of Gaussian distributions (clusters). Must be a positive integer. Defaults to 3.

  • max_iters (int, optional) – The maximum number of iterations. Must be a positive integer. Defaults to 10.

fit(X, verbose=False)[source]

Fits the k gaussian distributions to the data using maximum likelihood estimation.

Parameters:
  • X (torch.Tensor of shape (n_samples, n_features)) – The input data to be clustered.

  • verbose (bool, optional) – Determines if the likelihood should be calculated and printed during training. Must be a boolean. Defaults to False.

predict(X)[source]

Predicts the clusters of the data according to the fitted distributions.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – The input data to be clustered.

Returns:

A tensor of labels corresponding to classes.

Return type:

torch.Tensor of shape (n_samples,)

predict_proba(X)[source]

Predicts the probabilities of the data being in the fitted distributions.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – The input data to be clustered.

Returns:

A tensor of probabilities of the data being in the fitted distributions.

Return type:

torch.Tensor of shape (n_samples, k)

Spectral clustering

class DLL.MachineLearning.UnsupervisedLearning.Clustering.SpectralClustering(kernel=<DLL.MachineLearning.SupervisedLearning.Kernels.RBF object>, k=3, max_iters=100, normalise=True, use_kmeans=True, **kwargs)[source]

Bases: object

SpectralClustering implements the spectral clustering algorithm, which partitions data points into k clusters.

Parameters:
  • kernel (Kernels, optional) – The similarity function for fitting the model. Defaults to RBF(correlation_length=0.1).

  • k (int, optional) – The number of clusters. Defaults to 3. Must be a positive integer.

  • max_iters (int, optional) – The maximum number of iterations for training the model. Defaults to 100. Must be a positive integer.

  • normalise (bool, optional) – Determines if the laplacian matrix is calculated using L = I - sqrt(inv(D)) A sqrt(inv(D)) or just L = D - A. Defaults to True.

  • use_kmeans (bool, optional) – Determines if the clustring in embedded space is done using kmeans or discretisation. Defaults to True.

  • **kwargs – Other arguments are passed into the KMeansClustering algorithm.

Note

The result depends heavily on the chosen kernel function. Especially the correlation_length parameter should be fine-tuned for optimal performance.

fit(X)[source]

Fits the algorithm to the given data. Transforms the data into the embedding space using the kernel function and clusters the data in the space.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – The input data, where each row is a sample and each column is a feature. The number of samples must be atleast k.

predict()[source]

Applies the fitted SpectralClustering model to the input data, partitioning it to k clusters.

Returns:

The cluster corresponding to each sample.

Return type:

labels (torch.Tensor of shape (n_samples,))

Raises:

NotFittedError – If the SpectralClustering model has not been fitted before predicting.

Density based clustering

class DLL.MachineLearning.UnsupervisedLearning.Clustering.DBScan(eps=0.5, min_samples=5)[source]

Bases: object

Density-based spatial clustering of applications with noise (DBSCAN) algorithm.

Parameters:
  • eps (float | int, optional) – The distance inside of which datapoints are considered to be neighbours. Must be a positive real number. Defaults to 0.5.

  • min_samples (int, optional) – The minimum number of neighbours for a non-leaf node. Must be a positive integer. Defaults to 5.

Note

The algorithm is very sensitive to changes in eps. One should fine-tune the value of eps for optimal results.

fit(X)[source]

Fits the algorithm to the given data. Recursively finds clusters by connecting near-by samples into the same cluster.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – The input data, where each row is a sample and each column is a feature.

predict()[source]

Applies the fitted DBScan model to the input data. Splits the training data into clusters.

Returns:

The cluster corresponding to each sample. Label -1 indicates, that the algorithm considers that spesific samples as noise.

Return type:

labels (torch.Tensor of shape (n_samples,))

Raises:

NotFittedError – If the DBScan model has not been fitted before predicting.