ML Dataset

Overview

ML Datasets are a feature for preparing the log data used for training before creating a machine learning model. Under Policies > ML Datasets, security analysts can define a dataset and then consistently manage the input data that the model will train on.

In practice, you first explore the data you want to use in Analysis > Queries or Analysis > Pivots, retain only the necessary fields or preprocess them, and then save the result as an ML dataset. You then select the saved dataset in a machine learning model for training. If string fields are difficult to use directly for training, consider vectorizing them into numeric vectors using a query command such as tfidf.

All users, including administrators, can view the ML dataset list and its contents. Administrators and cluster administrators can add, edit, and delete ML datasets.

Considerations

Prepare ML datasets to reflect the data generated in your actual operating environment as closely as possible. Keep the following in mind when creating a dataset.

Include diverse information
Prepare data that includes a variety of characteristics found in the real operating environment, such as time, location, user information, event type, and success/failure status.
Select target variables for supervised learning
Supervised learning models such as Random Forest require a target variable field to predict. When creating a dataset for supervised learning, make sure to include the target variable.
Preprocess data
Remove or correct data errors and extreme values. If the value ranges between fields differ too greatly, apply normalization or scaling.
Vectorize strings
Because ordinary strings are difficult to train on directly, they need to be converted to numbers. Vectorize them using a query command such as tfidf if needed.
Data quality and quantity
Data must be accurate, complete, and consistent. It is advisable to secure a sufficient quantity so that both normal and abnormal patterns are adequately represented.
Prevent overfitting
Consider techniques such as cross-validation and regularization to avoid fitting too closely to specific training data.
Reflect changing patterns
Because security threat patterns can change over time, it is advisable to retrain operating models periodically.

View/search ML datasets

You can view or search the ML dataset list under Policies > ML Datasets.

ML dataset list

  • Name: Name of the ML dataset. Click the name to go to the dataset detail screen.
  • Description: Description of the purpose or structure of the ML dataset.
  • Count: Number of data entries currently in the ML dataset.
  • Modified At: Date the ML dataset was created or last modified.

To find a specific ML dataset in the list, use the search tool in the toolbar. The search tool finds ML datasets whose Name or Description contains the entered keyword, and is not case-sensitive.

Download list

To save the ML dataset list to your local PC, click Download in the toolbar and select the desired file format.

Refresh list

To reload the ML dataset list with the latest data, click Refresh in the toolbar.

Export ML datasets

To back up or move an ML dataset to another environment, you can export it as a file.

  1. Select the checkbox on the row of the ML dataset to export in the list.
  2. Click Export in the toolbar.
  3. In the Export ML Dataset dialog, set the file name and click OK.
Import ML datasets

To re-register a previously exported ML dataset file, use the import feature.

  1. Click Import in the toolbar.
  2. In the Import ML Dataset dialog, click Select File and choose the previously saved ML dataset file.
  3. After selecting the file, click OK.

Add an ML dataset

To add an ML dataset for reuse as input data for machine learning models, follow these steps.

  1. Under Policies > ML Datasets, click Add in the toolbar.

  2. On the Add ML Dataset screen, configure the settings.

    Add ML dataset

    • Name: Unique name to identify the ML dataset (up to 50 characters).
    • Description: Description of the dataset's purpose or structure (up to 2,000 characters).
    • Query Statement: Query to run when generating the ML dataset. The field values produced by this query are used when training the model (up to 10,000 characters).
  3. Review the information for accuracy and click OK.

Note
You cannot add a duplicate ML dataset name. Also, if an invalid query statement is entered or a data table that does not exist is specified and the query cannot be run, the ML dataset cannot be added.

View an ML dataset

To check what data is actually included in an ML dataset, view the dataset detail screen.

  1. Click the Name of the ML dataset to view in the list.

    View ML dataset

  2. Review the data on the detail screen. If needed, click Add to the right of Filter to add filtering conditions.

    Add ML dataset filter condition

Edit an ML dataset

To update the description of an ML dataset to match current operating standards, use the edit feature.

  1. Click the Name of the ML dataset to edit in the list.
  2. On the Edit ML Dataset screen, update the information and click OK. Only the Description can be edited.
Note
The ML dataset name and query statement cannot be changed on the edit screen.

Delete an ML dataset

To clean up ML datasets that are no longer in use, use the delete feature.

  1. Select the checkbox on the row of the ML dataset to delete in the list.
  2. Click Delete from the action menu that appears in the toolbar.
  3. In the Delete ML Dataset dialog, review the list of ML datasets to delete and click Delete. Click Cancel if you do not want to delete.
Note
ML datasets linked to a machine learning model cannot be deleted. To delete one, first remove the ML dataset setting from the corresponding machine learning model.