rforest

Uses a random forest model (an ensemble method that randomly trains multiple decision trees) to predict classification values for input data.

Command properties

ItemDescription
Command typeProcessing query
Required permissionNone
License usageN/A
Parallel executionNot supported
Distributed executionRuns on Control Node (reducer)

Syntax

To predict using a pre-trained model:

rforest [size=INT] model=STR

To train on data from a subquery and then predict:

rforest [size=INT] [timeout=INT{s|m|h|d}] target=STR FIELD, ... [ SUBQUERY ]

Options

size=INT
Number of decision trees in the forest (default: 100)
model=STR
Name of the random forest model. Machine learning models can be created and trained through the Logpresso Sonar web console or shell.
target=STR
Target field name. Specifies the field to use as the prediction target in random forest classification.
timeout=INT{s|m|h|d}
Subquery execution time limit. If the subquery does not complete within the specified time, it is cancelled.

Target

FIELD, ...
List of feature fields to use for random forest analysis, separated by commas (,). Field values can be numbers, dates, IP addresses, or strings. String values are internally encoded for processing.
[ SUBQUERY ]
Subquery that retrieves training data. The random forest model is built using the subquery results, then predictions are made for input records.

Output fields

FieldTypeDescription
_guessstringPredicted classification value of the target field
_rforest_errorstringIn subquery mode, contains the error message if an error occurs during training

Error codes

Parse errors
Error codeMessageDescription
40804A machine learning license is required.No machine learning license
41100Enter the machine learning model.The model option value is empty
41101The machine learning model cannot be found.A model with the specified name does not exist
41102The machine learning model handler cannot be found.The model handler cannot be found
90204[ is not paired.Subquery brackets are not properly paired
90206There is no subquery.No subquery is specified and no model option is given
Runtime errors

N/A

Description

The rforest command uses the random forest algorithm to predict the target field value for input records. Random forest is an ensemble learning method that randomly constructs multiple decision trees and combines each tree's prediction to produce the final classification.

Two modes are available:

  • Pre-trained model: Specify a pre-trained model with the model option. Input records are passed directly to the model and prediction results are assigned to the _guess field.
  • Subquery training: Specify the target option and a subquery. The random forest model is first trained on the subquery results, then predictions are made for input records.

Feature field values of null are treated as missing values. String values are internally encoded as integers for training and prediction.

In a distributed environment, model training and prediction are performed on the Control Node.

Examples

  1. Predicting with a pre-trained model

    table duration=1d test_data
    | rforest model=rforest_titanic
    

    Uses the pre-trained rforest_titanic model to predict classification values for input data.

  2. Predicting with a pre-trained model using a specified tree count

    table duration=1d test_data
    | rforest size=200 model=rforest_titanic
    

    Uses a model composed of 200 decision trees to predict.

  3. Training and predicting with a subquery

    csvfile /opt/logpresso/titanic_test.csv
    | rforest target=Survived Pclass, Sex, Age, Fare, Embarked [
        csvfile /opt/logpresso/titanic_train.csv
        | eval Age = double(Age), Fare = double(Fare)
      ]
    

    Trains the model on data from titanic_train.csv and predicts the Survived field value for data in titanic_test.csv.

  4. Training with a subquery using a timeout

    table duration=1d test_data
    | rforest timeout=30s target=category feature1, feature2, feature3 [
        table duration=30d training_data
      ]
    

    Limits the subquery execution time to 30 seconds to retrieve training data and predicts the category field value.