Datajson v5.2 #12863

regit · 2025-03-29T08:47:01Z

Update of #12707 addressing comments and adding some features.

Contribution style:

I have read the contributing guide lines at
https://docs.suricata.io/en/latest/devguide/contributing/contribution-process.html

Our Contribution agreements:

I have signed the Open Information Security Foundation contribution agreement at
https://suricata.io/about/contribution-agreement/ (note: this is only required once)

Changes (if applicable):

I have updated the User Guide (in doc/userguide/) to reflect the changes made
I have updated the JSON schema (in etc/schema.json) to reflect all logging changes
(including schema descriptions)
I have created a ticket at
https://redmine.openinfosecfoundation.org/projects/suricata/issues

Link to ticket: https://redmine.openinfosecfoundation.org/issues/7372

Describe changes:

Add options to dataset instead of introducing datajson keyword
Support for jsonline format
Add remove_key option

Provide values to any of the below to override the defaults.

SV_BRANCH=OISF/suricata-verify#2383

regit · 2025-03-29T08:50:16Z

Setting to draft as I'm waiting for:

Github CI checks
Feedback and data from Stamus' team

victorjulien · 2025-03-29T09:40:24Z

Why are there so many datajson references if this consolidates the logic into dataset+jsonline?

regit · 2025-03-29T16:19:58Z

Why are there so many datajson references if this consolidates the logic into dataset+jsonline?

There is still a datajson.c file that is handling the reading and parsing function so datasets.c file does not trigger OOM when you open it. For user only dataset keyword is used.

I choose to prefix the commit with datajson as I was a bit nostalgic and because it was easier to track this way. I can do a reword to datasets/json if it is more suitable.

Here is the diff stat of the PR.

 doc/userguide/rules/datasets.rst         |  118 ++++++++++++++++-
 doc/userguide/rules/payload-keywords.rst |   61 +++++++++
 etc/schema.json                          |    7 +-
 rust/suricatasc/src/unix/commands.rs     |   32 ++++-
 src/Makefile.am                          |    2 +
 src/datajson.c                           | 1009 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 src/datajson.h                           |   52 ++++++++
 src/datasets-ipv4.c                      |   25 ++++
 src/datasets-ipv4.h                      |    9 +-
 src/datasets-ipv6.c                      |   26 ++++
 src/datasets-ipv6.h                      |    9 +-
 src/datasets-md5.c                       |   25 ++++
 src/datasets-md5.h                       |    9 +-
 src/datasets-sha256.c                    |   25 ++++
 src/datasets-sha256.h                    |    9 +-
 src/datasets-string.c                    |   33 +++++
 src/datasets-string.h                    |    9 +-
 src/datasets.c                           |   67 ++++------
 src/datasets.h                           |   15 +++
 src/decode.c                             |   29 +++++
 src/decode.h                             |    7 ++
 src/detect-dataset.c                     |  143 ++++++++++++++++++++-
 src/detect-dataset.h                     |    5 +
 src/detect-engine-alert.c                |   23 ++++
 src/detect-engine-content-inspection.c   |    1 -
 src/detect-engine-register.h             |    1 +
 src/detect-pcre.c                        |   85 +++++++++++--
 src/detect.c                             |    1 +
 src/detect.h                             |   13 ++
 src/output-json-alert.c                  |    9 ++
 src/packet.c                             |    1 +
 src/runmode-unix-socket.c                |   69 ++++++++++
 src/runmode-unix-socket.h                |    1 +
 src/unix-manager.c                       |    2 +
 src/util-byte.c                          |   30 +++++
 src/util-byte.h                          |    2 +
 src/util-ip.h                            |    3 +
 src/util-var.h                           |    2 +
 38 files changed, 1902 insertions(+), 67 deletions(-)

This patch introduces new option to dataset keyword. Where regular dataset allows match from sets, dataset with json format allows the same but also adds JSON data to the alert event. This data is coming from the set definition it self. For example, an ipv4 set will look like: [{"ip": "10.16.1.11", "test": "success","context":3}] The syntax is a JSON array but it can also be a JSON object with an array inside. The idea is to directly used data coming from the API of a threat intel management software. The syntax of the keyword is the following: dataset:isset,src_ip,type ip,load src.lst,format json, \ enrichment_key src_ip, value_key ip; Compare to dataset, it just have a supplementary option key that is used to indicate in which subobject the JSON value should be added. The information is added in the even under the alert.extra subobject: "alert": { "extra": { "src_ip": { "ip": "10.6.1.11", "test": "success", "context": 3 }, The main interest of the feature is to be able to contextualize a match. For example, if you have an IOC source, you can do [ {"buffer": "value1", "actor":"APT28","Country":"FR"}, {"buffer": "value2", "actor":"APT32","Country":"NL"} ] This way, a single dataset is able to produce context to the event where it was not possible before and multiple signatures had to be used. The format introduced in datajson is an evolution of the historical datarep format. This has some limitations. For example, if a user fetch IOCs from a threat intel server there is a large change that the format will be JSON or XML. Suricata has no support for the second but can support the first one. Keeping the key value may seem redundant but it is useful to have it directly accessible in the extra data to be able to query it independantly of the signature (where it can be multiple metadata or even be a transformed metadata). In some case, when interacting with data (mostly coming from threat intel servers), the JSON array containing the data to use is not at the root of the object and it is ncessary to access a subobject. This patch implements this with support of key in level1.level2. This is done via the `array_key` option that contains the path to the data. Ticket: OISF#7372

With datajson infrastructure in place, it is now possible to add data in the extra information section. Following an idea by Jason Ish, this patch adds the feature for pcre extraction. A PCRE such as pcre:"/(?P<alert_ua>[a-zA-Z]+)\//" will add the content of the captured group to alert.extra.ua.

It can contain any vars so need addition properties.

There is just a change in the iterator to go from json to jsonline so let's factorize the parsing functions. Ticket: OISF#7372

This format allows to use a one valid JSON object per line in the data file. Ticket: OISF#7372

codecov · 2025-03-29T17:00:45Z

Codecov Report

Attention: Patch coverage is 60.74672% with 389 lines in your changes missing coverage. Please review.

Project coverage is 81.04%. Comparing base (834378f) to head (37f45a4).

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #12863      +/-   ##
==========================================
- Coverage   81.11%   81.04%   -0.07%     
==========================================
  Files         940      943       +3     
  Lines      260665   261620     +955     
==========================================
+ Hits       211431   212023     +592     
- Misses      49234    49597     +363

Flag	Coverage Δ
fuzzcorpus	`56.94% <6.35%> (-0.32%)`	⬇️
livemode	`19.29% <4.03%> (-0.10%)`	⬇️
pcap	`43.77% <2.01%> (-0.30%)`	⬇️
suricata-verify	`63.60% <60.14%> (-0.03%)`	⬇️
unittests	`58.41% <2.52%> (-0.21%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This option allows to remove the key corresponding to the match value from the JSON object before creating the JSON object that will be added to the `extra` data. For example, matching on the following JSON on the `ip` key: ```json {"ip": "10.16.1.11", "test": "success", "context":3} ``` with a match like: ``` dataset:isset,src_ip,type ip,load src.lst,format jsonline,enrichment_key src_ip,value_key ip; ``` will produce the following: ```json "extra": { "src_ip": { "ip": "10.16.1.11", "test": "success", "context": 3 } ``` if we add the `remove_key` option to the match: ``` dataset:isset,src_ip,type ip,load src.lst,format jsonline,enrichment_key src_ip,value_key ip, remove_key; ``` it will produce the following: ```json "extra": { "src_ip": { "test": "success", "context": 3 } ``` The option is set to false by default. Ticket: OISF#7372

Patch adds ``remove_key`` option and clarifies the text.

suricata-qa · 2025-03-31T09:55:04Z

Information:

ERROR: QA failed on SURI_TLPR1_suri_time.

field	baseline	test	%
	SURI_TLPR1_stats_chk
.uptime	619	655	105.82%

Pipeline 25429

victorjulien

While I like the feature, I don't think the current state of the implementation is close to what I will merge. I think there is still far too much duplication as well as some of implementation is too intrusive in critical detect/alert paths.

etc/schema.json

victorjulien · 2025-03-29T09:35:41Z

rust/suricatasc/src/unix/commands.rs

+		"type": "string",
+            },
+            {
+		"name": "datajson",


we have the option datavalue just before and this part is the json so naming is ok I think.

src/datajson.c

victorjulien · 2025-04-01T14:15:59Z

src/datajson.c

+    json_error_t jerror;
+    json_t *msg = json_loads(in, 0, &jerror);
+    if (msg == NULL) {
+        /* JANSSON does not see an integer, float or a string as valid JSON.


should we validate the data before sending it to jansson then?

My idea was to let the JSON interpretation to Jansson that is specialized in that and just have some code for corner cases.

victorjulien · 2025-04-01T14:20:00Z

src/datajson.c

+ *  \retval 0 data was not added to the hash as it is already there
+ *  \retval -1 failed to add data to the hash
+ */
+static int DatajsonAddString(


Looking at this it should just be in datasets.c as DatasetAddStringwJSON, similar to DatasetAddStringwRep I think

The datasets.c file is really too long with more than 1500 lines and with a lot of functions having similar name it is complicated to navigate. Having a separate file is convenient for these reasons.

src/datajson.h

src/decode.h

victorjulien · 2025-04-01T14:35:04Z

src/detect-dataset.c

+            SCLogDebug("r found: %d, len: %zu", r.found, r.json.len);
+            if (!r.found)
+                return 0;
+            if (r.json.len > 0) {


this looks like a thread safety violation. DatajsonLookup unlocks the dataset hash entry, but the code here still references data in the hash

victorjulien · 2025-04-01T14:37:35Z

src/detect-engine-content-inspection.c

@@ -641,7 +641,6 @@ static int DetectEngineContentInspectionInternal(DetectEngineThreadCtx *det_ctx,
            goto match;
        }
        goto no_match_discontinue;
-


victorjulien · 2025-04-01T14:43:29Z

src/decode.c

@@ -145,9 +145,38 @@ PacketAlert *PacketAlertCreate(void)
    return pa_array;
 }

+void PacketAlertRecycle(PacketAlert *pa_array)


I think this is pretty intrusive for a niche feature... thinking that there should be a simpler way to handle it. For example a thread local storage that is shared between detect and output (these are always in the same thread). Then we could have some preallocated buffers there that we can just reuse.

Do you think we can drop the capability to separate detection and output tasks ? This capability looks interesting to me even if I have nothing to show to prove it.
If we drop it then, I agree we can do something like you propose and potentially more as alerts can be remove from packet structure.

regit requested review from jasonish, jufajardini, victorjulien and a team as code owners March 29, 2025 08:47

regit marked this pull request as draft March 29, 2025 08:47

regit force-pushed the datajson-v5.2 branch 2 times, most recently from 63c6cd9 to 916efaf Compare March 29, 2025 09:30

regit added 2 commits March 29, 2025 17:13

util/ip: add IPv4 and IPv6 length

4651d8b

util/byte: add HexToRaw function

b864c5b

regit force-pushed the datajson-v5.2 branch from 916efaf to 6d4085f Compare March 29, 2025 16:14

regit added 8 commits March 29, 2025 17:24

eve/schema: pktvars is a container

3017bec

It can contain any vars so need addition properties.

eve/schema: document datajson output

9251564

doc/userguide: add dataset with json

d506f83

datajson: prepare jsonline format

4253775

There is just a change in the iterator to go from json to jsonline so let's factorize the parsing functions. Ticket: OISF#7372

datajson: implement jsonline format

db2c0ab

This format allows to use a one valid JSON object per line in the data file. Ticket: OISF#7372

doc/userguide: basic doc for jsonline format

8f315d6

regit force-pushed the datajson-v5.2 branch from 6d4085f to 07f6a85 Compare March 29, 2025 16:26

regit added 2 commits March 29, 2025 18:04

doc/userguide: improve datajson doc

37f45a4

Patch adds ``remove_key`` option and clarifies the text.

regit force-pushed the datajson-v5.2 branch from 07f6a85 to 37f45a4 Compare March 29, 2025 17:05

catenacyber mentioned this pull request Mar 30, 2025

Datajson v4 #12707

Closed

5 tasks

victorjulien requested changes Apr 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datajson v5.2 #12863

Datajson v5.2 #12863

regit commented Mar 29, 2025

regit commented Mar 29, 2025

victorjulien commented Mar 29, 2025

regit commented Mar 29, 2025

codecov bot commented Mar 29, 2025 •

edited

Loading

suricata-qa commented Mar 31, 2025

victorjulien left a comment

victorjulien Mar 29, 2025

regit Apr 6, 2025

victorjulien Apr 1, 2025

regit Apr 6, 2025

victorjulien Apr 1, 2025

regit Apr 6, 2025

victorjulien Apr 1, 2025

victorjulien Apr 1, 2025

victorjulien Apr 1, 2025

regit Apr 7, 2025

Datajson v5.2 #12863

Are you sure you want to change the base?

Datajson v5.2 #12863

Conversation

regit commented Mar 29, 2025

Contribution style:

Our Contribution agreements:

Changes (if applicable):

Provide values to any of the below to override the defaults.

regit commented Mar 29, 2025

victorjulien commented Mar 29, 2025

regit commented Mar 29, 2025

codecov bot commented Mar 29, 2025 • edited Loading

Codecov Report

suricata-qa commented Mar 31, 2025

victorjulien left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 29, 2025 •

edited

Loading