rforest
Uses a random forest model (an ensemble method that randomly trains multiple decision trees) to predict classification values for input data.
Command properties
| Item | Description |
|---|---|
| Command type | Processing query |
| Required permission | None |
| License usage | N/A |
| Parallel execution | Not supported |
| Distributed execution | Runs on Control Node (reducer) |
Syntax
To predict using a pre-trained model:
To train on data from a subquery and then predict:
Options
size=INT- Number of decision trees in the forest (default:
100) model=STR- Name of the random forest model. Machine learning models can be created and trained through the Logpresso Sonar web console or shell.
target=STR- Target field name. Specifies the field to use as the prediction target in random forest classification.
timeout=INT{s|m|h|d}- Subquery execution time limit. If the subquery does not complete within the specified time, it is cancelled.
Target
FIELD, ...- List of feature fields to use for random forest analysis, separated by commas (
,). Field values can be numbers, dates, IP addresses, or strings. String values are internally encoded for processing. [ SUBQUERY ]- Subquery that retrieves training data. The random forest model is built using the subquery results, then predictions are made for input records.
Output fields
| Field | Type | Description |
|---|---|---|
| _guess | string | Predicted classification value of the target field |
| _rforest_error | string | In subquery mode, contains the error message if an error occurs during training |
Error codes
Parse errors
| Error code | Message | Description |
|---|---|---|
| 40804 | A machine learning license is required. | No machine learning license |
| 41100 | Enter the machine learning model. | The model option value is empty |
| 41101 | The machine learning model cannot be found. | A model with the specified name does not exist |
| 41102 | The machine learning model handler cannot be found. | The model handler cannot be found |
| 90204 | [ is not paired. | Subquery brackets are not properly paired |
| 90206 | There is no subquery. | No subquery is specified and no model option is given |
Runtime errors
N/A
Description
The rforest command uses the random forest algorithm to predict the target field value for input records. Random forest is an ensemble learning method that randomly constructs multiple decision trees and combines each tree's prediction to produce the final classification.
Two modes are available:
- Pre-trained model: Specify a pre-trained model with the
modeloption. Input records are passed directly to the model and prediction results are assigned to the_guessfield. - Subquery training: Specify the
targetoption and a subquery. The random forest model is first trained on the subquery results, then predictions are made for input records.
Feature field values of null are treated as missing values. String values are internally encoded as integers for training and prediction.
In a distributed environment, model training and prediction are performed on the Control Node.
Examples
-
Predicting with a pre-trained model
table duration=1d test_data | rforest model=rforest_titanicUses the pre-trained
rforest_titanicmodel to predict classification values for input data. -
Predicting with a pre-trained model using a specified tree count
table duration=1d test_data | rforest size=200 model=rforest_titanicUses a model composed of 200 decision trees to predict.
-
Training and predicting with a subquery
csvfile /opt/logpresso/titanic_test.csv | rforest target=Survived Pclass, Sex, Age, Fare, Embarked [ csvfile /opt/logpresso/titanic_train.csv | eval Age = double(Age), Fare = double(Fare) ]Trains the model on data from
titanic_train.csvand predicts theSurvivedfield value for data intitanic_test.csv. -
Training with a subquery using a timeout
table duration=1d test_data | rforest timeout=30s target=category feature1, feature2, feature3 [ table duration=30d training_data ]Limits the subquery execution time to 30 seconds to retrieve training data and predicts the
categoryfield value.