lof

Calculates Local Outlier Factor (LOF) by calculating the Local Reachability Density (LRD) of each point based on the k-nearest neighbors and calculating the ratio of the local reachability density relative to the adjacent neighbors.

Syntax

lof [k=INT] FIELD, ... [by GRP_FIELD, ...]
Required Parameter
FIELD, ...
Fields that contain numeric data such as integers, real numbers, and dates. Use comma (,) as a separator.
Optional Parameter
k=INT

Number of adjacent nodes to be used for calculation (default: 10)

by GRP_FIELD_1, ...

Grouping fields in the aggregation with by directive, separated by a comma (,). This option MUST follow after FIELD, ....

If you want to calculate the scoring for each group by using the by clause, the number of records in each group must be greater than the number of adjacent nodes (the value specified by k=INT). If the number of records in the group is less than the number of adjacent nodes, the LOF in the _lof field is not calculated as intended.

Description

This calculates the LOF score on the _lof field for each record, and this value can be classified as follows:

  • If the value is greater than 1 (LOF(k) > 1): It is located outside the cluster. The greater it is than 1, the more likely it is to be an anomaly.
  • If the value is an approximation of 1 (LOF(k) ≈ 1): It is located at the boundary of the cluster.
  • If the value is less than 1 (LOF(k) < 1): It is located inside the cluster.

Usage

Calculate the anomaly based on the field values of sepal_length and sepal_width (download: https://raw.githubusercontent.com/illinois-cse/data-fa14/gh-pages/data/iris.csv).

wget url="https://raw.githubusercontent.com/illinois-cse/data-fa14/gh-pages/data/iris.csv" 
| eval line = split(line, "\n") 
| explode line 
| split sep="," sepal_length,sepal_width,petal_length,petal_width,species
| eval sepal_length = double(sepal_length), sepal_width = double(sepal_width)
| lof sepal_length, sepal_width
| search _lof > 2