Skip to content

Datajson v6.5 #13343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 33 commits into
base: master
Choose a base branch
from
Open

Datajson v6.5 #13343

wants to merge 33 commits into from

Conversation

regit
Copy link
Contributor

@regit regit commented Jun 1, 2025

Update of #13194.

Scan-build report an issue that seems like a false positive to me. Second pair of eyes welcome here.

Contribution style:

Our Contribution agreements:

Changes (if applicable):

Link to ticket: https://redmine.openinfosecfoundation.org/issues/7372

Describe changes:

  • Address comments from latest review
  • Some refactoring to reduce code size and complexity
  • Upload scan-build artifacts for easy retrieval of analysis

SV_BRANCH=OISF/suricata-verify#2535

regit added 30 commits June 1, 2025 10:12
This patch introduces new option to dataset keyword.
Where regular dataset allows match from sets, dataset with json
format allows the same but also adds JSON data to the alert
event. This data is coming from the set definition it self.
For example, an ipv4 set will look like:

  [{"ip": "10.16.1.11", "test": "success","context":3}]

The syntax is a JSON array but it can also be a JSON object
with an array inside. The idea is to directly used data coming
from the API of a threat intel management software.

The syntax of the keyword is the following:

  dataset:isset,src_ip,type ip,load src.lst,format json, \
       enrichment_key src_ip, value_key ip;

Compare to dataset, it just have a supplementary option key
that is used to indicate in which subobject the JSON value
should be added.

The information is added in the even under the alert.extra
subobject:

  "alert": {
    "extra": {
      "src_ip": {
        "ip": "10.6.1.11",
        "test": "success",
        "context": 3
      },

The main interest of the feature is to be able to contextualize
a match. For example, if you have an IOC source, you can do

 [
   {"buffer": "value1", "actor":"APT28","Country":"FR"},
   {"buffer": "value2", "actor":"APT32","Country":"NL"}
 ]

This way, a single dataset is able to produce context to the
event where it was not possible before and multiple signatures
had to be used.

The format introduced in datajson is an evolution of the
historical datarep format. This has some limitations. For example,
if a user fetch IOCs from a threat intel server there is a large
change that the format will be JSON or XML. Suricata has no support
for the second but can support the first one.

Keeping the key value may seem redundant but it is useful to have it
directly accessible in the extra data to be able to query it
independantly of the signature (where it can be multiple metadata
or even be a transformed metadata).

In some case, when interacting with data (mostly coming from
threat intel servers), the JSON array containing the data
to use is not at the root of the object and it is ncessary
to access a subobject.

This patch implements this with support of key in level1.level2.
This is done via the `array_key` option that contains the path
to the data.

Ticket: OISF#7372
With datajson infrastructure in place, it is now possible to
add data in the extra information section. Following an idea
by Jason Ish, this patch adds the feature for pcre extraction.

A PCRE such as pcre:"/(?P<alert_ua>[a-zA-Z]+)\//" will add the
content of the captured group to alert.extra.ua.
It can contain any vars so need addition properties.
There is just a change in the iterator to go from json to jsonline
so let's factorize the parsing functions.

Ticket: OISF#7372
This format allows to use a one valid JSON object per line in the
data file.

Ticket: OISF#7372
This option allows to remove the key corresponding to the match
value from the JSON object before creating the JSON object that
will be added to the `extra` data.

For example, matching on the following JSON on the `ip` key:

```json
{"ip": "10.16.1.11", "test": "success", "context":3}
```

with a match like:

```
dataset:isset,src_ip,type ip,load src.lst,format jsonline,enrichment_key src_ip,value_key ip;
```

will produce the following:

```json
"extra": {
  "src_ip": {
    "ip": "10.16.1.11",
    "test": "success",
    "context": 3
  }
```

if we add the `remove_key` option to the match:

```
dataset:isset,src_ip,type ip,load src.lst,format jsonline,enrichment_key src_ip,value_key ip, remove_key;
```

it will produce the following:

```json
"extra": {
  "src_ip": {
    "test": "success",
    "context": 3
  }
```

The option is set to false by default.

Ticket: OISF#7372
Patch adds ``remove_key`` option and clarifies the text.
Using `alert.extra` was not really reflecting the nature of what
was added. So renaming it to `alert.context`.
They are is renamed to datasets-context-json.* so we see that
it is about context and not about a new datasets type.
This will be used to factorize the code with datajson.
Factorize DatasetGet and DatajsonGet to only have the difference
between the two in the respective function.
regit added 2 commits June 1, 2025 10:12
DatajsonAdd function is now responsible of the handling of the mem
of datajsontype passed as argument.
@regit regit requested review from jasonish, jufajardini, victorjulien and a team as code owners June 1, 2025 19:00
@victorjulien
Copy link
Member

@regit Can you split out independent commits into a separate PR to make review easier?

@victorjulien
Copy link
Member

scan-build

   CC       datasets-context-json.o
datasets-context-json.c:928:13: warning: Potential leak of memory pointed to by 'jvalue.value' [unix.Malloc]
  928 |             return DatajsonAdd(set, (uint8_t *)&in.s_addr, SC_IPV4_LEN, &jvalue);
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.

@jasonish jasonish self-assigned this Jun 3, 2025

Example adding 'google.com' to set 'myset'::

dataset-add-json myset string Z29vZ2xlLmNvbQ== {"city":"Mountain View"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On exit after trying this, I get a leak:

==165801==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 27 byte(s) in 1 object(s) allocated from:
    #0 0x7fcec1120e15 in malloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:67
    #1 0x7fcec107cc6a in strndup /usr/src/debug/gcc/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:450
    #2 0x558814f38868 in SCStrndupFunc /home/jason/oisf/dev/suricata/review/src/util-mem.c:90
    #3 0x558815851c67 in ParseJsonLine /home/jason/oisf/dev/suricata/review/src/datasets-context-json.c:85
    #4 0x558815859561 in DatajsonAddSerialized /home/jason/oisf/dev/suricata/review/src/datasets-context-json.c:892
    #5 0x558815602fd7 in UnixSocketDatajsonAdd /home/jason/oisf/dev/suricata/review/src/runmode-unix-socket.c:865
    #6 0x558814ef35d4 in UnixCommandExecute /home/jason/oisf/dev/suricata/review/src/unix-manager.c:499
    #7 0x558814ef4380 in UnixCommandRun /home/jason/oisf/dev/suricata/review/src/unix-manager.c:615
    #8 0x558814ef4ab9 in UnixMain /home/jason/oisf/dev/suricata/review/src/unix-manager.c:666
    #9 0x558814ef836a in UnixManager /home/jason/oisf/dev/suricata/review/src/unix-manager.c:1163
    #10 0x558814ee093f in TmThreadsManagement /home/jason/oisf/dev/suricata/review/src/tm-threads.c:571
    #11 0x7fcec105e29a in asan_thread_start /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_interceptors.cpp:239
    #12 0x7fcec0be77ea  (/usr/lib/libc.so.6+0x957ea) (BuildId: 468e3585c794491a48ea75fceb9e4d6b1464fc35)

SUMMARY: AddressSanitizer: 27 byte(s) leaked in 1 allocation(s).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this, can you provide an example of how that dataset was setup? It appears to be accepted, but I can't get the JSON into my alert.

@jasonish
Copy link
Member

jasonish commented Jun 3, 2025

On rule loading, if I forget the enrichment_key, I get an error:

Error: detect-dataset: json format needs an 'enrichment_key' parameter [DetectDatasetSetup:detect-dataset.c:506]
Error: detect: error parsing signature "alert http any any -> any any (http.host; dataset: isset, http-host.json, type string, format ndjson, load http-host.json, enchrichment_key asdf; sid:1;)" from file ./test.rules at line 2 [DetectLoadSigFile:detect-engine-loader.c:197]

But if I forget the value_key, I don't get an error, but probably should:

Warning: datasets-context-json: No valid entries for key '' found in the file './http-host.json' [DatajsonLoadTypeFromJsonline:datasets-context-json.c:359]
Warning: suricata: "security.limit-noproc" (setrlimit()) not set when using address sanitizer [SuricataPostInit:suricata.c:3050]

I believe the value_key is more critical here.


Example rules could look like::

alert http any any -> any any (msg:"IP match"; ip.dst; dataset:isset,bad_ips, type ip, load bad_ips.json, format json, enrichment_key bad_ones, value_key ip; sid:8000001;)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could value_key simply be key? For the JSON format, we'll always need a 'primary' key of sorts, so just simplify it to key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could enrichment_key be context_key, since it lands in the context object of the alert?

@@ -85,6 +85,7 @@ enum DetectKeywordId {
DETECT_BYTE_EXTRACT,
DETECT_DATASET,
DETECT_DATAREP,
DETECT_DATAJSON,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer used?

@jasonish
Copy link
Member

jasonish commented Jun 3, 2025

This is nice.. I converted https://threatview.io/Downloads/High-Confidence-CobaltStrike-C2%20-Feeds.txt to a NDJSON dataset, and can now get extra context like:

  "alert": {
    "action": "allowed",
    "gid": 1,
    "signature_id": 2,
    "rev": 0,
    "signature": "Threatview C2",
    "category": "",
    "severity": 3,
    "context": {
      "threatview": {
        "ip": "107.148.1.188",
        "date": "16 June 2024 02:37 PM UTC",
        "host": "support.whatsappsignup.com",
        "protocol": "https",
        "beacon_config": "support.whatsappsignup.com,/jquery-4.8.1.min.js",
        "comment": "Generated by Threatview[.]io"
      }
    }
  },

Instead of just an IP address and the message ET Threatview.io High Confidence Cobalt Strike C2 IP group 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants