Outlier detection
IsolationForest
- class DLL.MachineLearning.UnsupervisedLearning.OutlierDetection.IsolationForest(n_trees=10, max_depth=25, min_samples_split=2, bootstrap=False, threshold=4)[source]
Bases:
object
IsolationForest implements an algorithm to detect outliers in the data by fitting may isolation trees to the data.
- Parameters:
n_trees (int, optional) – The number of trees used for predictiong. Defaults to 10. Must be a positive integer.
max_depth (int, optional) – The maximum depth of the tree. Defaults to 10. Must be a positive integer.
min_samples_split (int, optional) – The minimum required samples in a leaf to make a split. Defaults to 2. Must be a positive integer.
bootstrap (bool, optional) – Determines if the samples for fitting are boostrapped from the given data. Must be a boolean. Defaults to False.
threshold (int | float, optional) – Determines how many standard deviations away from the mean score a datapoint must be to be considered an outlier. Must be a non-ngeative real number. Defaults to 4.
- fit(X)[source]
Fits the IsolationTree model to the input data by generating a tree, which splits the data randomly.
- Parameters:
X (torch.Tensor of shape (n_samples, n_features)) – The input data, where each row is a sample and each column is a feature.
- Returns:
None
- Raises:
TypeError – If the input matrix is not a PyTorch tensor.
ValueError – If the input matrix is not the correct shape.
- fit_predict(X, return_scores=False)[source]
First fits the model to the input and then predicts, which of the inputs are outliers.
- Parameters:
X (torch.Tensor of shape (n_samples, n_features)) – The input data, where each row is a sample and each column is a feature.
return_scores (bool, optional) – Determines if the scores of each datapoint are returned. Defaults to False.
- predict(X, return_scores=False)[source]
Predicts the outliers in the input by considering scores, which are threshold standard deviations away from the mean.
- Parameters:
X (torch.Tensor of shape (n_samples, n_features)) – The input data, where each row is a sample and each column is a feature.
return_scores (bool, optional) – Determines if the scores of each datapoint are returned. Defaults to False.