ngram
Splits the string in a specified field into consecutive n-character units (n-grams) and outputs the token list. Use this for character-level feature extraction such as text similarity comparison and anomalous domain detection.
Command properties
| Property | Description |
|---|---|
| Command type | Transforming |
| Required permission | None |
| License usage | N/A |
| Parallel execution | Supported |
| Distributed execution | Not supported |
Syntax
Options
n=INT- N-gram size. Splits the string into consecutive n-character units. Must be an integer between 1 and 10.
field=FIELD- Name of the field to split into n-grams.
Output fields
| Field | Type | Description |
|---|---|---|
ngrams | array | List of token results from the n-gram split. If the input string length is n or less, a single-element array containing the original string is returned. |
Error codes
Parse errors
| Error code | Message | Description |
|---|---|---|
| 40810 | Specify the n option for the ngram command. | The n option was not specified. |
| 40811 | The n value for the ngram command must be between 1 and 10. | The n value is less than 1 or greater than 10. |
| 40812 | Specify the field option for the ngram command. | The field option was not specified. |
Runtime errors
None
Description
The ngram command splits the value of the specified field in input records using the n-gram method. An n-gram is a list of substrings consisting of n consecutive characters from a string.
For example, if n=3 and the input string is "google", the result is ["goo", "oog", "ogl", "gle"].
If the input field value is not a string, the record passes through unchanged. If the input string length is n or less, an array containing the original string as a single element is assigned to the ngrams field.
The generated n-gram token list can be used as input for the tfidf command, or to analyze character patterns in domain names.
Examples
-
Split domain names into 3-grams
json "[{'domain': 'google.com'}, {'domain': 'xkcd123a.net'}, {'domain': 'example.org'}]" | ngram n=3 field=domainSplits the string in the
domainfield into 3-character units and stores the result in thengramsfield. -
Calculate TF-IDF scores from n-gram tokens
table duration=1d dns_logs | ngram n=3 field=domain | eval line = strjoin(" ", ngrams) | tfidf lineSplits domain names into 3-grams, then joins the tokens with spaces to calculate TF-IDF scores.