OISF · regit · Mar 2, 2025 · Mar 2, 2025 · Mar 2, 2025 · Mar 2, 2025
@@ -3,8 +3,8 @@
 Datasets
 ========
 
-Using the ``dataset`` and ``datarep`` keyword it is possible to match on
-large amounts of data against any sticky buffer.
+Using the ``dataset`` and ``datarep`` keyword it is possible
+to match on large amounts of data against any sticky buffer.
 
 For example, to match against a DNS black list called ``dns-bl``::
 
@@ -79,7 +79,9 @@ Syntax::
     dataset:<cmd>,<name>,<options>;
 
     dataset:<set|unset|isset|isnotset>,<name> \
-        [, type <string|md5|sha256|ipv4|ip>, save <file name>, load <file name>, state <file name>, memcap <size>, hashsize <size>];
+        [, type <string|md5|sha256|ipv4|ip>, save <file name>, load <file name>, state <file name>, memcap <size>, hashsize <size>
+         , format <csv|json|jsonline>, enrichment_key <output_key>, value_key <json_key>, array_key <json_path>,
+         remove_key];
 
 type <type>
   the data type: string, md5, sha256, ipv4, ip
@@ -94,6 +96,23 @@ memcap <size>
   maximum memory limit for the respective dataset
 hashsize <size>
   allowed size of the hash for the respective dataset
+format <type>
+  the format of the file: csv, json. Defaut to csv. See
+  :ref:`dataset with json format <datasets_json>` for json
+  and jsonline option
+enrichment_key <key>
+  the key to use for the enrichment of the alert event
+  for json format
+value_key <key>
+  the key to use for the value of the alert
+  for json format
+array_key <key>
+  the key to use for the array of the alert
+  for json format
+remove_key
+  if set, the JSON object pointed by value key will be removed
+  from the alert event
+
 
 .. note:: 'type' is mandatory and needs to be set.
 
@@ -146,6 +165,47 @@ The rules will only match if the data is in the list and the reputation
 value is higher than 200.
 
 
+.. _datasets_json:
+
+dataset with json
+~~~~~~~~~~~~~~~~~
+
+DataJSON allows matching data against a set and output data attached to the matching
+value in the event.
+
+There is two format supported: ``json`` and ``jsonline``. The difference is that
+``json`` format is a single JSON object, while ``jsonline`` is handling file with
+one JSON object per line. The ``jsonline`` format is useful for large files
+as the parsing is done line by line.
+
+Syntax::
+
+    dataset:<cmd>,<name>,<options>;
+
+    dataset:<isset|isnotset>,<name> \
+        [, type <string|md5|sha256|ipv4|ip>, load <file name>, format <json|jsonline>, memcap <size>, hashsize <size>, enrichment_key <json_key> \
+         , value_key <json_key>, array_key <json_path>];
+
+Example rules could look like::
+
+    alert http any any -> any any (msg:"IP match"; ip.dst; dataset:isset,bad_ips, type ip, load bad_ips.json, format json, enrichment_key bad_ones, value_key ip; sid:8000001;)
+
+In this example, the match will occur if the destination IP is in the set and the
+alert will have an ``alert.extra.bad_ones`` subobject that will contain the JSON
+data associated to the value (``bad_ones`` coming from ``enrichment_key`` option).
+
+When format is ``json`` or ``jsonline``, the ``value_key`` is used to get
+the value in the line (``jsonline`` format) or in the array (``json`` format).
+At least one single element needs to be have the ``value_key`` present in the data file to
+have a successful load.
+If ``array_key`` is present, Suricata will extract the corresponding subobject that has to be
+a JSON array and search for element to add to the set in this array. This is only valid for ``json`` format.
+
+If you don't want to have the ``value_key`` in the alert, you can use the
+``remove_key`` option. This will remove the key from the alert event.
+
+See :ref:`Datajson format <datajson_data>` for more information.
+
 Rule Reloads
 ------------
 
@@ -243,6 +303,28 @@ Syntax::
 
     dataset-dump
 
+dataset-add-json
+~~~~~~~~~~~~~~~~
+
+Unix Socket command to add data to a set. On success, the addition becomes
+active instantly.
+
+Syntax::
+
+    dataset-add-json <set name> <set type> <data> <json_info>
+
+set name
+  Name of an already defined dataset
+type
+  Data type: string, md5, sha256, ipv4, ip
+data
+  Data to add in serialized form (base64 for string, hex notation for md5/sha256, string representation for ipv4/ip)
+
+Example adding 'google.com' to set 'myset'::
+
+    dataset-add-json myset string Z29vZ2xlLmNvbQ== {"city":"Mountain View"}
+
+
 File formats
 ------------
 
@@ -285,13 +367,41 @@ which when piped to ``base64 -d`` reveals its value::
 datarep
 ~~~~~~~
 
-The datarep format follows the dataset, expect that there are 1 more CSV
+The datarep format follows the dataset, except that there are 1 more CSV
 field:
 
 Syntax::
 
     <data>,<value>
 
+.. _datajson_data:
+
+dataset with JSON enrichment
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If ``format json`` is used in the parameters of a dataset keyword, then the loaded 
+file has to contain a valid JSON object.
+
+If ``value_key``` option is present then the file has to contain a valid JSON
+object containing an array where the key equal to ``value_key`` value is present.
+
+For example, if the file ``file.json`` is like the following example (typical of return of REST API call) ::
+
+    {
+        "time": "2024-12-21",
+        "response": {
+            "threats":
+                [
+                    {"host": "toto.com", "origin": "japan"},
+                    {"host": "grenouille.com", "origin": "french"}
+                ]
+        }
+    }
+
+then the match to check the list of threats using datajson can be defined as ::
+
+    http.host; dataset:isset,threats,load file.json, enrichment_key threat, value_key host, array_key response.threats;
+
 .. _datasets_file_locations:
 
 File Locations

@@ -774,6 +774,67 @@ qualities of pcre as well.  These are:
 .. note:: The following characters must be escaped inside the content:
              ``;`` ``\`` ``"``
 
+PCRE extraction
+~~~~~~~~~~~~~~~
+
+It is possible to capture groups from the regular expression and log them into the
+alert events.
+
+There is 3 capabilities:
+
+* pkt: the extracted group is logged as pkt variable in ``metadata.pktvars``
+* alert: the extracted group is logged to the ``alert.extra`` subobject
+* flow: the extracted group is stored in a flow variable and end up in the ``metadata.flowvars``
+
+To use the feature, parameters of pcre keyword need to be updated.
+After the regular pcre regex and options, a comma separated lists of variable names.
+The prefix here is ``flow:``, ``pkt:`` or ``alert:`` and the names can contain special
+characters now. The names map to the capturing substring expressions in order ::
+
+  pcre:"/([a-z]+)\/[a-z]+\/(.+)\/(.+)\/changelog$/GUR, \
+      flow:ua/ubuntu/repo,flow:ua/ubuntu/pkg/base,     \
+      flow:ua/ubuntu/pkg/version";
+
+This would result in the alert event has something like ::
+
+  "metadata": {
+    "flowvars": [
+       {"ua/ubuntu/repo": "fr"},
+       {"ua/ubuntu/pkg/base": "curl"},
+       {"ua/ubuntu/pkg/version": "2.2.1"}
+    ]
+  }
+
+The other events on the same flow such as the ``flow`` one will
+also have the flow vars.
+
+If this is not wanted, you can use the ``alert:`` construct to only
+get the event in the alert ::
+
+  pcre:"/([a-z]+)\/[a-z]+\/(.+)\/(.+)\/changelog$/GUR, \
+      alert:ua/ubuntu/repo,alert:ua/ubuntu/pkg/base,     \
+      alert:ua/ubuntu/pkg/version";
+
+With that syntax, the result of the extraction will appear like ::
+
+  "alert": {
+    "extra": {
+       "ua/ubuntu/repo": "fr",
+       "ua/ubuntu/pkg/base": "curl",
+       "ua/ubuntu/pkg/version": "2.2.1"
+    ]
+  }
+
+A combination of the extraction scopes can be combined.
+
+It is also possible to extract key/value pair in the ``pkt`` scope.
+One capture would be the key, the second the value. The notation is similar to the last ::
+
+  pcre:"^/([A-Z]+) (.*)\r\n/, pkt:key,pkt:value";
+
+``key`` and ``value`` are simply hardcoded names to trigger the key/value extraction.
+As a consequence, they can't be used as name for the variables.
+
 Suricata's modifiers
 ~~~~~~~~~~~~~~~~~~~~
 

@@ -217,6 +217,11 @@
                 "xff": {
                     "type": "string"
                 },
+                "extra": {
+                    "type": "object",
+                    "additionalProperties": true,
+                    "description": "Extra data created by keywords such as datajson"
+                },
                 "metadata": {
                     "type": "object",
                     "properties": {
@@ -2886,7 +2891,7 @@
                                 "type": "string"
                             }
                         },
-                        "additionalProperties": false
+                        "additionalProperties": true
                     }
                 },
                 "flowints": {

@@ -71,12 +71,11 @@ impl<'a> CommandParser<'a> {
     }
 
     pub fn parse(&self, input: &str) -> Result<serde_json::Value, CommandParseError> {
-        let parts: Vec<&str> = input.split(' ').map(|s| s.trim()).collect();
+        let mut parts: Vec<&str> = input.split(' ').map(|s| s.trim()).collect();
         if parts.is_empty() {
             return Err(CommandParseError::Other("No command provided".to_string()));
         }
         let command = parts[0];
-        let args = &parts[1..];
 
         let spec = self
             .commands
@@ -91,6 +90,13 @@ impl<'a> CommandParser<'a> {
 
         // Calculate the number of required arguments for better error reporting.
         let required = spec.iter().filter(|e| e.required).count();
+        let optional = spec.iter().filter(|e| !e.required).count();
+        // Handle the case where the command has only required arguments and allow
+        // last one to contain spaces.
+        if optional == 0 {
+            parts = input.splitn(required + 1, ' ').collect();
+        }
+        let args = &parts[1..];
 
         let mut json_args = HashMap::new();
 
@@ -386,6 +392,28 @@ fn command_defs() -> Result<HashMap<String, Vec<Argument>>, serde_json::Error> {
 		"type": "string",
             },
 	],
+	"dataset-add-json": [
+            {
+		"name": "setname",
+		"required": true,
+		"type": "string",
+            },
+            {
+		"name": "settype",
+		"required": true,
+		"type": "string",
+            },
+            {
+		"name": "datavalue",
+		"required": true,
+		"type": "string",
+            },
+            {
+		"name": "datajson",
+		"required": true,
+		"type": "string",
+            },
+	],
 	"get-flow-stats-by-id": [
             {
 		"name": "flow_id",

@@ -48,6 +48,7 @@ noinst_HEADERS = \
 	conf.h \
 	conf-yaml-loader.h \
 	counters.h \
+	datajson.h \
 	datasets.h \
 	datasets-ipv4.h \
 	datasets-ipv6.h \
@@ -627,6 +628,7 @@ libsuricata_c_a_SOURCES = \
 	conf.c \
 	conf-yaml-loader.c \
 	counters.c \
+	datajson.c \
 	datasets.c \
 	datasets-ipv4.c \
 	datasets-ipv6.c \