ML Dataset
Overview
ML Datasets are a feature for preparing the log data used for training before creating a machine learning model. Under Policies > ML Datasets, security analysts can define a dataset and then consistently manage the input data that the model will train on.
In practice, you first explore the data you want to use in Analysis > Queries or Analysis > Pivots, retain only the necessary fields or preprocess them, and then save the result as an ML dataset. You then select the saved dataset in a machine learning model for training. If string fields are difficult to use directly for training, consider vectorizing them into numeric vectors using a query command such as tfidf.
All users, including administrators, can view the ML dataset list and its contents. Administrators and cluster administrators can add, edit, and delete ML datasets.
Considerations
Prepare ML datasets to reflect the data generated in your actual operating environment as closely as possible. Keep the following in mind when creating a dataset.
- Include diverse information
- Prepare data that includes a variety of characteristics found in the real operating environment, such as time, location, user information, event type, and success/failure status.
- Select target variables for supervised learning
- Supervised learning models such as Random Forest require a target variable field to predict. When creating a dataset for supervised learning, make sure to include the target variable.
- Preprocess data
- Remove or correct data errors and extreme values. If the value ranges between fields differ too greatly, apply normalization or scaling.
- Vectorize strings
- Because ordinary strings are difficult to train on directly, they need to be converted to numbers. Vectorize them using a query command such as
tfidfif needed. - Data quality and quantity
- Data must be accurate, complete, and consistent. It is advisable to secure a sufficient quantity so that both normal and abnormal patterns are adequately represented.
- Prevent overfitting
- Consider techniques such as cross-validation and regularization to avoid fitting too closely to specific training data.
- Reflect changing patterns
- Because security threat patterns can change over time, it is advisable to retrain operating models periodically.
View/search ML datasets
You can view or search the ML dataset list under Policies > ML Datasets.
- Name: Name of the ML dataset. Click the name to go to the dataset detail screen.
- Description: Description of the purpose or structure of the ML dataset.
- Count: Number of data entries currently in the ML dataset.
- Modified At: Date the ML dataset was created or last modified.
To find a specific ML dataset in the list, use the search tool in the toolbar. The search tool finds ML datasets whose Name or Description contains the entered keyword, and is not case-sensitive.
Download list
To save the ML dataset list to your local PC, click
in the toolbar and select the desired file format.
Refresh list
To reload the ML dataset list with the latest data, click
in the toolbar.
Export ML datasets
To back up or move an ML dataset to another environment, you can export it as a file.
- Select the checkbox on the row of the ML dataset to export in the list.
- Click Export in the toolbar.
- In the Export ML Dataset dialog, set the file name and click OK.
Import ML datasets
To re-register a previously exported ML dataset file, use the import feature.
- Click Import in the toolbar.
- In the Import ML Dataset dialog, click Select File and choose the previously saved ML dataset file.
- After selecting the file, click OK.
Add an ML dataset
To add an ML dataset for reuse as input data for machine learning models, follow these steps.
-
Under Policies > ML Datasets, click Add in the toolbar.
-
On the Add ML Dataset screen, configure the settings.
- Name: Unique name to identify the ML dataset (up to 50 characters).
- Description: Description of the dataset's purpose or structure (up to 2,000 characters).
- Query Statement: Query to run when generating the ML dataset. The field values produced by this query are used when training the model (up to 10,000 characters).
-
Review the information for accuracy and click OK.
View an ML dataset
To check what data is actually included in an ML dataset, view the dataset detail screen.
-
Click the Name of the ML dataset to view in the list.
-
Review the data on the detail screen. If needed, click Add to the right of Filter to add filtering conditions.
Edit an ML dataset
To update the description of an ML dataset to match current operating standards, use the edit feature.
- Click the Name of the ML dataset to edit in the list.
- On the Edit ML Dataset screen, update the information and click OK. Only the Description can be edited.
Delete an ML dataset
To clean up ML datasets that are no longer in use, use the delete feature.
- Select the checkbox on the row of the ML dataset to delete in the list.
- Click Delete from the action menu that appears in the toolbar.
- In the Delete ML Dataset dialog, review the list of ML datasets to delete and click Delete. Click Cancel if you do not want to delete.



