Skip to content

Datajson v5.2 #12863

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from
118 changes: 114 additions & 4 deletions doc/userguide/rules/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
Datasets
========

Using the ``dataset`` and ``datarep`` keyword it is possible to match on
large amounts of data against any sticky buffer.
Using the ``dataset`` and ``datarep`` keyword it is possible
to match on large amounts of data against any sticky buffer.

For example, to match against a DNS black list called ``dns-bl``::

Expand Down Expand Up @@ -79,7 +79,9 @@ Syntax::
dataset:<cmd>,<name>,<options>;

dataset:<set|unset|isset|isnotset>,<name> \
[, type <string|md5|sha256|ipv4|ip>, save <file name>, load <file name>, state <file name>, memcap <size>, hashsize <size>];
[, type <string|md5|sha256|ipv4|ip>, save <file name>, load <file name>, state <file name>, memcap <size>, hashsize <size>
, format <csv|json|jsonline>, enrichment_key <output_key>, value_key <json_key>, array_key <json_path>,
remove_key];

type <type>
the data type: string, md5, sha256, ipv4, ip
Expand All @@ -94,6 +96,23 @@ memcap <size>
maximum memory limit for the respective dataset
hashsize <size>
allowed size of the hash for the respective dataset
format <type>
the format of the file: csv, json. Defaut to csv. See
:ref:`dataset with json format <datasets_json>` for json
and jsonline option
enrichment_key <key>
the key to use for the enrichment of the alert event
for json format
value_key <key>
the key to use for the value of the alert
for json format
array_key <key>
the key to use for the array of the alert
for json format
remove_key
if set, the JSON object pointed by value key will be removed
from the alert event


.. note:: 'type' is mandatory and needs to be set.

Expand Down Expand Up @@ -146,6 +165,47 @@ The rules will only match if the data is in the list and the reputation
value is higher than 200.


.. _datasets_json:

dataset with json
~~~~~~~~~~~~~~~~~

DataJSON allows matching data against a set and output data attached to the matching
value in the event.

There is two format supported: ``json`` and ``jsonline``. The difference is that
``json`` format is a single JSON object, while ``jsonline`` is handling file with
one JSON object per line. The ``jsonline`` format is useful for large files
as the parsing is done line by line.

Syntax::

dataset:<cmd>,<name>,<options>;

dataset:<isset|isnotset>,<name> \
[, type <string|md5|sha256|ipv4|ip>, load <file name>, format <json|jsonline>, memcap <size>, hashsize <size>, enrichment_key <json_key> \
, value_key <json_key>, array_key <json_path>];

Example rules could look like::

alert http any any -> any any (msg:"IP match"; ip.dst; dataset:isset,bad_ips, type ip, load bad_ips.json, format json, enrichment_key bad_ones, value_key ip; sid:8000001;)

In this example, the match will occur if the destination IP is in the set and the
alert will have an ``alert.extra.bad_ones`` subobject that will contain the JSON
data associated to the value (``bad_ones`` coming from ``enrichment_key`` option).

When format is ``json`` or ``jsonline``, the ``value_key`` is used to get
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some threat intel software such as MISP are producing a list of IOCs with context in a reply to a REST API call. So with them we can use json format for that. But if scripting is involved to crunch the data the jsonline format can be convenient to use as it requires less memory to generate.

the value in the line (``jsonline`` format) or in the array (``json`` format).
At least one single element needs to be have the ``value_key`` present in the data file to
have a successful load.
If ``array_key`` is present, Suricata will extract the corresponding subobject that has to be
a JSON array and search for element to add to the set in this array. This is only valid for ``json`` format.

If you don't want to have the ``value_key`` in the alert, you can use the
``remove_key`` option. This will remove the key from the alert event.

See :ref:`Datajson format <datajson_data>` for more information.

Rule Reloads
------------

Expand Down Expand Up @@ -243,6 +303,28 @@ Syntax::

dataset-dump

dataset-add-json
~~~~~~~~~~~~~~~~

Unix Socket command to add data to a set. On success, the addition becomes
active instantly.

Syntax::

dataset-add-json <set name> <set type> <data> <json_info>

set name
Name of an already defined dataset
type
Data type: string, md5, sha256, ipv4, ip
data
Data to add in serialized form (base64 for string, hex notation for md5/sha256, string representation for ipv4/ip)

Example adding 'google.com' to set 'myset'::

dataset-add-json myset string Z29vZ2xlLmNvbQ== {"city":"Mountain View"}


File formats
------------

Expand Down Expand Up @@ -285,13 +367,41 @@ which when piped to ``base64 -d`` reveals its value::
datarep
~~~~~~~

The datarep format follows the dataset, expect that there are 1 more CSV
The datarep format follows the dataset, except that there are 1 more CSV
field:

Syntax::

<data>,<value>

.. _datajson_data:

dataset with JSON enrichment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If ``format json`` is used in the parameters of a dataset keyword, then the loaded
file has to contain a valid JSON object.

If ``value_key``` option is present then the file has to contain a valid JSON
object containing an array where the key equal to ``value_key`` value is present.

For example, if the file ``file.json`` is like the following example (typical of return of REST API call) ::

{
"time": "2024-12-21",
"response": {
"threats":
[
{"host": "toto.com", "origin": "japan"},
{"host": "grenouille.com", "origin": "french"}
]
}
}

then the match to check the list of threats using datajson can be defined as ::

http.host; dataset:isset,threats,load file.json, enrichment_key threat, value_key host, array_key response.threats;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this matches on hosts like toto.com, etc? I don't see the key threat? Where does this format come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the confusion here, threat will be used in the output:

    {
        "alert": {
            "context": {
                "threat": {
                    "host": "toto.com",
                    "origin": "japan"
                }
            }
        }
    }

I'm adding this example to updated doc.


.. _datasets_file_locations:

File Locations
Expand Down
61 changes: 61 additions & 0 deletions doc/userguide/rules/payload-keywords.rst
Original file line number Diff line number Diff line change
Expand Up @@ -774,6 +774,67 @@ qualities of pcre as well. These are:
.. note:: The following characters must be escaped inside the content:
``;`` ``\`` ``"``

PCRE extraction
~~~~~~~~~~~~~~~

It is possible to capture groups from the regular expression and log them into the
alert events.

There is 3 capabilities:

* pkt: the extracted group is logged as pkt variable in ``metadata.pktvars``
* alert: the extracted group is logged to the ``alert.extra`` subobject
* flow: the extracted group is stored in a flow variable and end up in the ``metadata.flowvars``

To use the feature, parameters of pcre keyword need to be updated.
After the regular pcre regex and options, a comma separated lists of variable names.
The prefix here is ``flow:``, ``pkt:`` or ``alert:`` and the names can contain special
characters now. The names map to the capturing substring expressions in order ::

pcre:"/([a-z]+)\/[a-z]+\/(.+)\/(.+)\/changelog$/GUR, \
flow:ua/ubuntu/repo,flow:ua/ubuntu/pkg/base, \
flow:ua/ubuntu/pkg/version";

This would result in the alert event has something like ::

"metadata": {
"flowvars": [
{"ua/ubuntu/repo": "fr"},
{"ua/ubuntu/pkg/base": "curl"},
{"ua/ubuntu/pkg/version": "2.2.1"}
]
}

The other events on the same flow such as the ``flow`` one will
also have the flow vars.

If this is not wanted, you can use the ``alert:`` construct to only
get the event in the alert ::

pcre:"/([a-z]+)\/[a-z]+\/(.+)\/(.+)\/changelog$/GUR, \
alert:ua/ubuntu/repo,alert:ua/ubuntu/pkg/base, \
alert:ua/ubuntu/pkg/version";

With that syntax, the result of the extraction will appear like ::

"alert": {
"extra": {
"ua/ubuntu/repo": "fr",
"ua/ubuntu/pkg/base": "curl",
"ua/ubuntu/pkg/version": "2.2.1"
]
}

A combination of the extraction scopes can be combined.

It is also possible to extract key/value pair in the ``pkt`` scope.
One capture would be the key, the second the value. The notation is similar to the last ::

pcre:"^/([A-Z]+) (.*)\r\n/, pkt:key,pkt:value";

``key`` and ``value`` are simply hardcoded names to trigger the key/value extraction.
As a consequence, they can't be used as name for the variables.

Suricata's modifiers
~~~~~~~~~~~~~~~~~~~~

Expand Down
7 changes: 6 additions & 1 deletion etc/schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,11 @@
"xff": {
"type": "string"
},
"extra": {
"type": "object",
"additionalProperties": true,
"description": "Extra data created by keywords such as datajson"
},
"metadata": {
"type": "object",
"properties": {
Expand Down Expand Up @@ -2886,7 +2891,7 @@
"type": "string"
}
},
"additionalProperties": false
"additionalProperties": true
}
},
"flowints": {
Expand Down
32 changes: 30 additions & 2 deletions rust/suricatasc/src/unix/commands.rs
Original file line number Diff line number Diff line change
Expand Up @@ -71,12 +71,11 @@ impl<'a> CommandParser<'a> {
}

pub fn parse(&self, input: &str) -> Result<serde_json::Value, CommandParseError> {
let parts: Vec<&str> = input.split(' ').map(|s| s.trim()).collect();
let mut parts: Vec<&str> = input.split(' ').map(|s| s.trim()).collect();
if parts.is_empty() {
return Err(CommandParseError::Other("No command provided".to_string()));
}
let command = parts[0];
let args = &parts[1..];

let spec = self
.commands
Expand All @@ -91,6 +90,13 @@ impl<'a> CommandParser<'a> {

// Calculate the number of required arguments for better error reporting.
let required = spec.iter().filter(|e| e.required).count();
let optional = spec.iter().filter(|e| !e.required).count();
// Handle the case where the command has only required arguments and allow
// last one to contain spaces.
if optional == 0 {
parts = input.splitn(required + 1, ' ').collect();
}
let args = &parts[1..];

let mut json_args = HashMap::new();

Expand Down Expand Up @@ -386,6 +392,28 @@ fn command_defs() -> Result<HashMap<String, Vec<Argument>>, serde_json::Error> {
"type": "string",
},
],
"dataset-add-json": [
{
"name": "setname",
"required": true,
"type": "string",
},
{
"name": "settype",
"required": true,
"type": "string",
},
{
"name": "datavalue",
"required": true,
"type": "string",
},
{
"name": "datajson",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datajson

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have the option datavalue just before and this part is the json so naming is ok I think.

"required": true,
"type": "string",
},
],
"get-flow-stats-by-id": [
{
"name": "flow_id",
Expand Down
2 changes: 2 additions & 0 deletions src/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ noinst_HEADERS = \
conf.h \
conf-yaml-loader.h \
counters.h \
datajson.h \
datasets.h \
datasets-ipv4.h \
datasets-ipv6.h \
Expand Down Expand Up @@ -627,6 +628,7 @@ libsuricata_c_a_SOURCES = \
conf.c \
conf-yaml-loader.c \
counters.c \
datajson.c \
datasets.c \
datasets-ipv4.c \
datasets-ipv6.c \
Expand Down
Loading
Loading