ngram

Splits the string in a specified field into consecutive n-character units (n-grams) and outputs the token list. Use this for character-level feature extraction such as text similarity comparison and anomalous domain detection.

Command properties

PropertyDescription
Command typeTransforming
Required permissionNone
License usageN/A
Parallel executionSupported
Distributed executionNot supported

Syntax

ngram n=INT field=FIELD

Options

n=INT
N-gram size. Splits the string into consecutive n-character units. Must be an integer between 1 and 10.
field=FIELD
Name of the field to split into n-grams.

Output fields

FieldTypeDescription
ngramsarrayList of token results from the n-gram split. If the input string length is n or less, a single-element array containing the original string is returned.

Error codes

Parse errors
Error codeMessageDescription
40810Specify the n option for the ngram command.The n option was not specified.
40811The n value for the ngram command must be between 1 and 10.The n value is less than 1 or greater than 10.
40812Specify the field option for the ngram command.The field option was not specified.
Runtime errors

None

Description

The ngram command splits the value of the specified field in input records using the n-gram method. An n-gram is a list of substrings consisting of n consecutive characters from a string.

For example, if n=3 and the input string is "google", the result is ["goo", "oog", "ogl", "gle"].

If the input field value is not a string, the record passes through unchanged. If the input string length is n or less, an array containing the original string as a single element is assigned to the ngrams field.

The generated n-gram token list can be used as input for the tfidf command, or to analyze character patterns in domain names.

Examples

  1. Split domain names into 3-grams

    json "[{'domain': 'google.com'}, {'domain': 'xkcd123a.net'}, {'domain': 'example.org'}]"
    | ngram n=3 field=domain
    

    Splits the string in the domain field into 3-character units and stores the result in the ngrams field.

  2. Calculate TF-IDF scores from n-gram tokens

    table duration=1d dns_logs
    | ngram n=3 field=domain
    | eval line = strjoin(" ", ngrams)
    | tfidf line
    

    Splits domain names into 3-grams, then joins the tokens with spaces to calculate TF-IDF scores.