kmeans

Classifies input data into k clusters using the K-Means++ algorithm based on Euclidean distance.

Command properties

PropertyValue
Command typeTransforming
Required permissionNone
License usageN/A
Parallel executionNot supported
Distributed executionNot supported

Syntax

kmeans [k=INT] [iter=INT] FIELD, ...

Options

k=INT
Number of clusters. The maximum value is 100. (Default: 3)
iter=INT
Number of calculation iterations. (Default: 10000)

Target

FIELD, ...
List of fields to use for clustering. Separate multiple fields with commas (,). Field values must be numeric. Records that contain a non-numeric value in any of the specified fields are excluded.

Output fields

FieldTypeDescription
_clusterintegerCluster number assigned to the record. Values range from 1 to k.

Error codes

Parsing errors
Error codeMessageDescription
40804머신러닝 라이선스가 필요합니다.Machine learning license is not available
too-large-k-k value exceeds the maximum (100)
missing-kmeans-fields-No target field was specified
Runtime errors

N/A

Description

The kmeans command collects all input records, then runs the K-Means++ algorithm using the numeric values of the specified fields to assign each record to its nearest cluster. The classification result is output in the _cluster field as a cluster number starting from 1.

Records that contain a non-numeric value in any of the specified fields are excluded from clustering. Up to 100,000 input records can be processed. If the number of valid input records exceeds 100,000, the query is terminated.

Examples

  1. Classify iris data into 3 clusters

    csvfile /opt/logpresso/iris.csv
    | eval sepal_length = double(sepal_length), sepal_width = double(sepal_width)
    | kmeans sepal_length, sepal_width
    

    Classifies records using the sepal_length and sepal_width fields with the default number of clusters (3).

  2. Specify the number of clusters and iterations

    csvfile /opt/logpresso/iris.csv
    | eval sepal_length = double(sepal_length), sepal_width = double(sepal_width)
    | kmeans k=4 iter=100000 sepal_length, sepal_width
    

    Classifies into 4 clusters with up to 100,000 iterations for improved convergence accuracy.

  3. Cluster network traffic data

    json "[{'src_ip': '192.0.2.1', 'bytes': 1024, 'pkts': 10}, {'src_ip': '192.0.2.2', 'bytes': 52000, 'pkts': 300}, {'src_ip': '192.0.2.3', 'bytes': 980, 'pkts': 8}, {'src_ip': '198.51.100.1', 'bytes': 48000, 'pkts': 280}]"
    | kmeans k=2 bytes, pkts
    

    Classifies traffic patterns into 2 clusters based on the bytes and pkts fields.