Outlier detection

IsolationForest

class DLL.MachineLearning.UnsupervisedLearning.OutlierDetection.IsolationForest(n_trees=10, max_depth=25, min_samples_split=2, bootstrap=False, threshold=4)[source]

Bases: object

IsolationForest implements an algorithm to detect outliers in the data by fitting may isolation trees to the data.

Parameters:
  • n_trees (int, optional) – The number of trees used for predictiong. Defaults to 10. Must be a positive integer.

  • max_depth (int, optional) – The maximum depth of the tree. Defaults to 10. Must be a positive integer.

  • min_samples_split (int, optional) – The minimum required samples in a leaf to make a split. Defaults to 2. Must be a positive integer.

  • bootstrap (bool, optional) – Determines if the samples for fitting are boostrapped from the given data. Must be a boolean. Defaults to False.

  • threshold (int | float, optional) – Determines how many standard deviations away from the mean score a datapoint must be to be considered an outlier. Must be a non-ngeative real number. Defaults to 4.

fit(X)[source]

Fits the IsolationTree model to the input data by generating a tree, which splits the data randomly.

Parameters:

X (torch.Tensor of shape (n_samples, n_features)) – The input data, where each row is a sample and each column is a feature.

Returns:

None

Raises:
  • TypeError – If the input matrix is not a PyTorch tensor.

  • ValueError – If the input matrix is not the correct shape.

fit_predict(X, return_scores=False)[source]

First fits the model to the input and then predicts, which of the inputs are outliers.

Parameters:
  • X (torch.Tensor of shape (n_samples, n_features)) – The input data, where each row is a sample and each column is a feature.

  • return_scores (bool, optional) – Determines if the scores of each datapoint are returned. Defaults to False.

predict(X, return_scores=False)[source]

Predicts the outliers in the input by considering scores, which are threshold standard deviations away from the mean.

Parameters:
  • X (torch.Tensor of shape (n_samples, n_features)) – The input data, where each row is a sample and each column is a feature.

  • return_scores (bool, optional) – Determines if the scores of each datapoint are returned. Defaults to False.