Skip to content

Datajson v5.2 #12863

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from
Draft

Datajson v5.2 #12863

wants to merge 12 commits into from

Conversation

regit
Copy link
Contributor

@regit regit commented Mar 29, 2025

Update of #12707 addressing comments and adding some features.

Contribution style:

Our Contribution agreements:

Changes (if applicable):

Link to ticket: https://redmine.openinfosecfoundation.org/issues/7372

Describe changes:

  • Add options to dataset instead of introducing datajson keyword
  • Support for jsonline format
  • Add remove_key option

Provide values to any of the below to override the defaults.

SV_BRANCH=OISF/suricata-verify#2383

@regit regit requested review from jasonish, jufajardini, victorjulien and a team as code owners March 29, 2025 08:47
@regit regit marked this pull request as draft March 29, 2025 08:47
@regit
Copy link
Contributor Author

regit commented Mar 29, 2025

Setting to draft as I'm waiting for:

  • Github CI checks
  • Feedback and data from Stamus' team

@regit regit force-pushed the datajson-v5.2 branch 2 times, most recently from 63c6cd9 to 916efaf Compare March 29, 2025 09:30
@victorjulien
Copy link
Member

Why are there so many datajson references if this consolidates the logic into dataset+jsonline?

@regit
Copy link
Contributor Author

regit commented Mar 29, 2025

Why are there so many datajson references if this consolidates the logic into dataset+jsonline?

There is still a datajson.c file that is handling the reading and parsing function so datasets.c file does not trigger OOM when you open it. For user only dataset keyword is used.

I choose to prefix the commit with datajson as I was a bit nostalgic and because it was easier to track this way. I can do a reword to datasets/json if it is more suitable.

Here is the diff stat of the PR.

 doc/userguide/rules/datasets.rst         |  118 ++++++++++++++++-
 doc/userguide/rules/payload-keywords.rst |   61 +++++++++
 etc/schema.json                          |    7 +-
 rust/suricatasc/src/unix/commands.rs     |   32 ++++-
 src/Makefile.am                          |    2 +
 src/datajson.c                           | 1009 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 src/datajson.h                           |   52 ++++++++
 src/datasets-ipv4.c                      |   25 ++++
 src/datasets-ipv4.h                      |    9 +-
 src/datasets-ipv6.c                      |   26 ++++
 src/datasets-ipv6.h                      |    9 +-
 src/datasets-md5.c                       |   25 ++++
 src/datasets-md5.h                       |    9 +-
 src/datasets-sha256.c                    |   25 ++++
 src/datasets-sha256.h                    |    9 +-
 src/datasets-string.c                    |   33 +++++
 src/datasets-string.h                    |    9 +-
 src/datasets.c                           |   67 ++++------
 src/datasets.h                           |   15 +++
 src/decode.c                             |   29 +++++
 src/decode.h                             |    7 ++
 src/detect-dataset.c                     |  143 ++++++++++++++++++++-
 src/detect-dataset.h                     |    5 +
 src/detect-engine-alert.c                |   23 ++++
 src/detect-engine-content-inspection.c   |    1 -
 src/detect-engine-register.h             |    1 +
 src/detect-pcre.c                        |   85 +++++++++++--
 src/detect.c                             |    1 +
 src/detect.h                             |   13 ++
 src/output-json-alert.c                  |    9 ++
 src/packet.c                             |    1 +
 src/runmode-unix-socket.c                |   69 ++++++++++
 src/runmode-unix-socket.h                |    1 +
 src/unix-manager.c                       |    2 +
 src/util-byte.c                          |   30 +++++
 src/util-byte.h                          |    2 +
 src/util-ip.h                            |    3 +
 src/util-var.h                           |    2 +
 38 files changed, 1902 insertions(+), 67 deletions(-)

regit added 8 commits March 29, 2025 17:24
This patch introduces new option to dataset keyword.
Where regular dataset allows match from sets, dataset with json
format allows the same but also adds JSON data to the alert
event. This data is coming from the set definition it self.
For example, an ipv4 set will look like:

  [{"ip": "10.16.1.11", "test": "success","context":3}]

The syntax is a JSON array but it can also be a JSON object
with an array inside. The idea is to directly used data coming
from the API of a threat intel management software.

The syntax of the keyword is the following:

  dataset:isset,src_ip,type ip,load src.lst,format json, \
       enrichment_key src_ip, value_key ip;

Compare to dataset, it just have a supplementary option key
that is used to indicate in which subobject the JSON value
should be added.

The information is added in the even under the alert.extra
subobject:

  "alert": {
    "extra": {
      "src_ip": {
        "ip": "10.6.1.11",
        "test": "success",
        "context": 3
      },

The main interest of the feature is to be able to contextualize
a match. For example, if you have an IOC source, you can do

 [
   {"buffer": "value1", "actor":"APT28","Country":"FR"},
   {"buffer": "value2", "actor":"APT32","Country":"NL"}
 ]

This way, a single dataset is able to produce context to the
event where it was not possible before and multiple signatures
had to be used.

The format introduced in datajson is an evolution of the
historical datarep format. This has some limitations. For example,
if a user fetch IOCs from a threat intel server there is a large
change that the format will be JSON or XML. Suricata has no support
for the second but can support the first one.

Keeping the key value may seem redundant but it is useful to have it
directly accessible in the extra data to be able to query it
independantly of the signature (where it can be multiple metadata
or even be a transformed metadata).

In some case, when interacting with data (mostly coming from
threat intel servers), the JSON array containing the data
to use is not at the root of the object and it is ncessary
to access a subobject.

This patch implements this with support of key in level1.level2.
This is done via the `array_key` option that contains the path
to the data.

Ticket: OISF#7372
With datajson infrastructure in place, it is now possible to
add data in the extra information section. Following an idea
by Jason Ish, this patch adds the feature for pcre extraction.

A PCRE such as pcre:"/(?P<alert_ua>[a-zA-Z]+)\//" will add the
content of the captured group to alert.extra.ua.
It can contain any vars so need addition properties.
There is just a change in the iterator to go from json to jsonline
so let's factorize the parsing functions.

Ticket: OISF#7372
This format allows to use a one valid JSON object per line in the
data file.

Ticket: OISF#7372
Copy link

codecov bot commented Mar 29, 2025

Codecov Report

Attention: Patch coverage is 60.74672% with 389 lines in your changes missing coverage. Please review.

Project coverage is 81.04%. Comparing base (834378f) to head (37f45a4).

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #12863      +/-   ##
==========================================
- Coverage   81.11%   81.04%   -0.07%     
==========================================
  Files         940      943       +3     
  Lines      260665   261620     +955     
==========================================
+ Hits       211431   212023     +592     
- Misses      49234    49597     +363     
Flag Coverage Δ
fuzzcorpus 56.94% <6.35%> (-0.32%) ⬇️
livemode 19.29% <4.03%> (-0.10%) ⬇️
pcap 43.77% <2.01%> (-0.30%) ⬇️
suricata-verify 63.60% <60.14%> (-0.03%) ⬇️
unittests 58.41% <2.52%> (-0.21%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

regit added 2 commits March 29, 2025 18:04
This option allows to remove the key corresponding to the match
value from the JSON object before creating the JSON object that
will be added to the `extra` data.

For example, matching on the following JSON on the `ip` key:

```json
{"ip": "10.16.1.11", "test": "success", "context":3}
```

with a match like:

```
dataset:isset,src_ip,type ip,load src.lst,format jsonline,enrichment_key src_ip,value_key ip;
```

will produce the following:

```json
"extra": {
  "src_ip": {
    "ip": "10.16.1.11",
    "test": "success",
    "context": 3
  }
```

if we add the `remove_key` option to the match:

```
dataset:isset,src_ip,type ip,load src.lst,format jsonline,enrichment_key src_ip,value_key ip, remove_key;
```

it will produce the following:

```json
"extra": {
  "src_ip": {
    "test": "success",
    "context": 3
  }
```

The option is set to false by default.

Ticket: OISF#7372
Patch adds ``remove_key`` option and clarifies the text.
@catenacyber catenacyber mentioned this pull request Mar 30, 2025
5 tasks
@suricata-qa
Copy link

Information:

ERROR: QA failed on SURI_TLPR1_suri_time.

field baseline test %
SURI_TLPR1_stats_chk
.uptime 619 655 105.82%

Pipeline 25429

Copy link
Member

@victorjulien victorjulien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I like the feature, I don't think the current state of the implementation is close to what I will merge. I think there is still far too much duplication as well as some of implementation is too intrusive in critical detect/alert paths.

"type": "string",
},
{
"name": "datajson",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datajson

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have the option datavalue just before and this part is the json so naming is ok I think.

json_error_t jerror;
json_t *msg = json_loads(in, 0, &jerror);
if (msg == NULL) {
/* JANSSON does not see an integer, float or a string as valid JSON.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we validate the data before sending it to jansson then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was to let the JSON interpretation to Jansson that is specialized in that and just have some code for corner cases.

* \retval 0 data was not added to the hash as it is already there
* \retval -1 failed to add data to the hash
*/
static int DatajsonAddString(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this it should just be in datasets.c as DatasetAddStringwJSON, similar to DatasetAddStringwRep I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The datasets.c file is really too long with more than 1500 lines and with a lot of functions having similar name it is complicated to navigate. Having a separate file is convenient for these reasons.

SCLogDebug("r found: %d, len: %zu", r.found, r.json.len);
if (!r.found)
return 0;
if (r.json.len > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a thread safety violation. DatajsonLookup unlocks the dataset hash entry, but the code here still references data in the hash

@@ -641,7 +641,6 @@ static int DetectEngineContentInspectionInternal(DetectEngineThreadCtx *det_ctx,
goto match;
}
goto no_match_discontinue;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

@@ -145,9 +145,38 @@ PacketAlert *PacketAlertCreate(void)
return pa_array;
}

void PacketAlertRecycle(PacketAlert *pa_array)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is pretty intrusive for a niche feature... thinking that there should be a simpler way to handle it. For example a thread local storage that is shared between detect and output (these are always in the same thread). Then we could have some preallocated buffers there that we can just reuse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can drop the capability to separate detection and output tasks ? This capability looks interesting to me even if I have nothing to show to prove it.
If we drop it then, I agree we can do something like you propose and potentially more as alerts can be remove from packet structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants