kmeans
Classifies input data into k clusters using the K-Means++ algorithm based on Euclidean distance.
Command properties
| Property | Value |
|---|---|
| Command type | Transforming |
| Required permission | None |
| License usage | N/A |
| Parallel execution | Not supported |
| Distributed execution | Not supported |
Syntax
Options
k=INT- Number of clusters. The maximum value is 100. (Default:
3) iter=INT- Number of calculation iterations. (Default:
10000)
Target
FIELD, ...- List of fields to use for clustering. Separate multiple fields with commas (
,). Field values must be numeric. Records that contain a non-numeric value in any of the specified fields are excluded.
Output fields
| Field | Type | Description |
|---|---|---|
| _cluster | integer | Cluster number assigned to the record. Values range from 1 to k. |
Error codes
Parsing errors
| Error code | Message | Description |
|---|---|---|
| 40804 | 머신러닝 라이선스가 필요합니다. | Machine learning license is not available |
| too-large-k | - | k value exceeds the maximum (100) |
| missing-kmeans-fields | - | No target field was specified |
Runtime errors
N/A
Description
The kmeans command collects all input records, then runs the K-Means++ algorithm using the numeric values of the specified fields to assign each record to its nearest cluster. The classification result is output in the _cluster field as a cluster number starting from 1.
Records that contain a non-numeric value in any of the specified fields are excluded from clustering. Up to 100,000 input records can be processed. If the number of valid input records exceeds 100,000, the query is terminated.
Examples
-
Classify iris data into 3 clusters
csvfile /opt/logpresso/iris.csv | eval sepal_length = double(sepal_length), sepal_width = double(sepal_width) | kmeans sepal_length, sepal_widthClassifies records using the
sepal_lengthandsepal_widthfields with the default number of clusters (3). -
Specify the number of clusters and iterations
csvfile /opt/logpresso/iris.csv | eval sepal_length = double(sepal_length), sepal_width = double(sepal_width) | kmeans k=4 iter=100000 sepal_length, sepal_widthClassifies into 4 clusters with up to 100,000 iterations for improved convergence accuracy.
-
Cluster network traffic data
json "[{'src_ip': '192.0.2.1', 'bytes': 1024, 'pkts': 10}, {'src_ip': '192.0.2.2', 'bytes': 52000, 'pkts': 300}, {'src_ip': '192.0.2.3', 'bytes': 980, 'pkts': 8}, {'src_ip': '198.51.100.1', 'bytes': 48000, 'pkts': 280}]" | kmeans k=2 bytes, pktsClassifies traffic patterns into 2 clusters based on the
bytesandpktsfields.