Machine Learning Traning Dataset

Overview

A machine learning training dataset (ML dataset) is a collection of data used for machine learning model training. Since machine learning models learn patterns from given datasets and use them to predict values or patterns in similar data, the quality of the training dataset significantly impacts the model's outcome.

Considerations

Training datasets should closely match the data generated in the environment where the machine learning model will be applied. Consider the following aspects when creating a dataset:

  • Including Diverse Information: Collect and prepare data that encompasses various characteristics such as time, location, user information, event type, and success/failure. It is recommended to use real-world data.

  • Selecting Target Variables for Supervised Learning: Supervised learning models, such as Random Forest, require a target variable for prediction. When creating a training dataset, ensure that it includes the target variable (prediction objective value).

  • Data Preprocessing: Remove or adjust data errors and extreme values. Normalize or scale features to prevent large discrepancies.

  • String Vectorization: Machine learning models cannot process raw text, so convert textual data into numerical representations using methods like tfidf query commands.

  • Data Quality and Quantity: Ensure data is accurate, complete, and consistent. Sufficient data is necessary for learning diverse patterns, especially for including enough examples of abnormal events.

  • Preventing Overfitting: Apply techniques such as cross-validation and regularization to avoid overfitting to the training data.

  • Changing Patterns: Since security threat patterns evolve over time, periodically retrain the machine learning model.

Search ML Dataset

You can view or search the training dataset list under Policies > ML Datasets.

ML datasets

  • Name: Training dataset name
  • Description: Additional information about the training dataset
  • Count: Number of data entries in the training dataset
  • Modified At: Creation or last modification date of the training dataset

To find a specific ML dataset, use the search tool in the toolbar. The search tool filters datasets based on keywords in the Name and Description fields. The search is not case-sensitive.

Download ML Dataset List

To download the training dataset list as a file to your local PC, click Download in the toolbar and select the desired file format.

Refresh ML Dataset List

To update the training dataset list with the latest information, click Refresh in the toolbar.

Import and Export ML Dataset

You can export or import training datasets as files, which can be used for backup and restoration.

To export a tranining dataset:

  1. Select the checkbox of the dataset to export from the list.
  2. Click Export in the toolbar.
  3. In the Export Training Dataset dialog box, set the name and click OK.

To import a tranining dataset:

  1. Click Import in the toolbar.
  2. In the Import Training Dataset dialog box, click Select File and choose a previously exported dataset file.
  3. After selecting the file, click OK.

Add ML Dataset

To add a training dataset:

  1. Go to Policies > ML Datasets and click Add in the toolbar.

  2. In the Add ML Dataset screen, enter the required values and click OK.

    • Name: Training dataset name (up to 50 characters)
    • Description: Detailed description of the dataset (up to 2,000 characters)
    • Query Statement: Query used to generate the training dataset (up to 10,000 characters). When training machine learning model, the dataset's field values are used. You can modify the field names if needed.
Note
Dataset names cannot be duplicate. Additionally, if an invalid query statement or nonexistent data table is specified, the dataset cannot be added.

View ML Dataset

Click the name of a training dataset in the list to view its data.

View ML Dataset 1

Filters can be applied to refine displayed data. To add a filter, click Add next to Filter.

View ML Dataset 2

Edit ML Dataset

To edit a training dataset:

  1. Click the name of the training dataset in the list.
  2. In the Edit ML Dataset screen, update the information and click OK. The only editable field is Description.

Delete ML Dataset

To delete a training dataset:

  1. Select the checkbox of the dataset(s) to delete from the list.
  2. Click Delete in the toolbar.
  3. In the Delete ML Dataset dialog box, review the selected datasets and click Delete. To cancel, click Cancel.
Note
Machine learning datasets linked to machine learning models cannot be deleted. To delete them, first remove the dataset from the respective machine learning models.