Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[9.0] [Automatic Import] Fix unstructured syslog flow (#213042) #213118

Merged
merged 1 commit into from
Mar 4, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,14 @@ export const unstructuredLogState = {
jsonSamples: ['{"message":"dummy data"}'],
finalized: false,
ecsVersion: 'testVersion',
errors: { test: 'testerror' },
errors: [{ test: 'testerror' }],
additionalProcessors: [],
isFirst: false,
unParsedSamples: ['dummy data'],
currentPattern: '%{GREEDYDATA:message}',
};

export const unstructuredLogResponse = {
grok_patterns: [
grok_pattern:
'####<%{MONTH} %{MONTHDAY}, %{YEAR} %{TIME} (?:AM|PM) %{WORD:timezone}> <%{WORD:log_level}> <%{WORD:component}> <%{DATA:hostname}> <%{DATA:server_name}> <%{DATA:thread_info}> <%{DATA:user}> <%{DATA:empty_field}> <%{DATA:empty_field2}> <%{NUMBER:timestamp}> <%{DATA:message_id}> <%{GREEDYDATA:message}>',
],
};
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,35 @@ export const EX_ANSWER_LOG_TYPE: SamplesFormat = {
header: false,
columns: ['ip', 'timestamp', 'request', 'status', '', 'bytes'],
};
export const LOG_FORMAT_EXAMPLE_LOGS = [
{
example:
'[18/Feb/2025:22:39:16 +0000] CONNECT conn=20597223 from=10.1.1.1:1234 to=10.2.3.4:4389 protocol=LDAP',
format: 'Structured',
},
{
example:
'2021-10-22 22:12:09,871 DEBUG [org.keycloak.events] (default task-3) operationType=CREATE, realmId=test, clientId=abcdefgh userId=sdfsf-b89c-4fca-9088-sdfsfsf, ipAddress=10.1.1.1, resourceType=USER, resourcePath=users/07972d16-b173-4c99-803d-90f211080f40',
format: 'Structured',
},
{
example:
'<166>Aug 21 22:08:13 myfirewall.my-domain.tld (squid-1)[6802]: [1598040493.253 325](tel:1598040493.253 325) 175.16.199.1 TCP_MISS/304 2912 GET https://github.com/3ilson/pfelk/file-list/master - HIER_DIRECT/81.2.69.145 -',
format: 'Unstructured',
},
{
example:
'<30>1 2021-07-03T23:01:56.547105-05:00 pfSense.example.com charon 18610 - - 08[CFG] ppk_id = (null)',
format: 'Unstructured',
},
{
example:
'2016/10/25 14:49:34 [error] 54053#0: *1 open() "/usr/local/Cellar/nginx/1.10.2_1/html/favicon.ico" failed (2: No such file or directory)',
format: 'Unstructured',
},
{
example:
'2025/02/12|14:42:42:871|FAKePolicyNumber-ws-sharedendorsement-autocore-54--fhfh-rghrg-0|INFO |http-nio-8080-exec-58 |RatingHelper.sendToPolicyPro:1521 |-call to PolicyPro for /rest/v2/actions/ISSUEEXT successful',
format: 'Unstructured',
},
];
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ import { LOG_FORMAT_DETECTION_PROMPT } from './prompts';
import type { LogDetectionNodeParams } from './types';
import { SamplesFormat } from '../../../common';
import { LOG_FORMAT_DETECTION_SAMPLE_ROWS } from '../../../common/constants';
import { LOG_FORMAT_EXAMPLE_LOGS } from './constants';

export async function handleLogFormatDetection({
state,
Expand All @@ -26,6 +27,7 @@ export async function handleLogFormatDetection({
const logFormatDetectionResult = await logFormatDetectionNode.invoke({
ex_answer: state.exAnswer,
log_samples: samples.join('\n'),
example_logs: LOG_FORMAT_EXAMPLE_LOGS,
package_title: state.packageTitle,
datastream_title: state.dataStreamTitle,
});
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,18 @@ Follow these steps to do this:
* 'leef': If the log samples have Log Event Extended Format (LEEF) then classify it as "name: leef".
* 'fix': If the log samples have Financial Information eXchange (FIX) then classify it as "name: fix".
* 'unsupported': If you cannot put the format into any of the above categories then classify it with "name: unsupported".
2. Header: for structured and unstructured format:
2. You can look at the example_logs in the context to understand different log formats.
3. Header: for structured and unstructured format:
- if the samples have any or all of priority, timestamp, loglevel, hostname, ipAddress, messageId in the beginning information then set "header: true".
- if the samples have a syslog header then set "header: true"
- else set "header: false". If you are unable to determine the syslog header presence then set "header: false".
3. Note that a comma-separated list should be classified as 'csv' if its rows only contain values separated by commas. But if it looks like a list of comma separated key-values pairs like 'key1=value1, key2=value2' it should be classified as 'structured'.
4. Note that a comma-separated list should be classified as 'csv' if its rows only contain values separated by commas. But if it looks like a list of comma separated key-values pairs like 'key1=value1, key2=value2' it should be classified as 'structured'.
<example_logs>
\`\`\`json
{example_logs}
\`\`\`
</example_logs>
You ALWAYS follow these guidelines when writing your response:
<guidelines>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,10 @@

export const GROK_EXAMPLE_ANSWER = {
rfc: 'RFC2454',
regex:
'/(?:(d{4}[-]d{2}[-]d{2}[T]d{2}[:]d{2}[:]d{2}(?:.d{1,6})?(?:[+-]d{2}[:]d{2}|Z)?)|-)s(?:([w][wd.@-]*)|-)s(.*)$/',
grok_patterns: ['%{WORD:key1}:%{WORD:value1};%{WORD:key2}:%{WORD:value2}:%{GREEDYDATA:message}'],
grok_pattern: '%{WORD:key1}:%{WORD:value1};%{WORD:key2}:%{WORD:value2}:%{GREEDYDATA:message}',
};

export const GROK_ERROR_EXAMPLE_ANSWER = {
grok_patterns: [
grok_pattern:
'%{TIMESTAMP:timestamp}:%{WORD:value1};%{WORD:key2}:%{WORD:value2}:%{GREEDYDATA:message}',
],
};

This file was deleted.

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ import { StateGraph, END, START } from '@langchain/langgraph';
import type { UnstructuredLogState } from '../../types';
import { handleUnstructured } from './unstructured';
import type { UnstructuredGraphParams, UnstructuredBaseNodeParams } from './types';
import { handleUnstructuredError } from './error';
import { handleUnstructuredValidate } from './validate';

const graphState: StateGraphArgs<UnstructuredLogState>['channels'] = {
Expand All @@ -30,6 +29,10 @@ const graphState: StateGraphArgs<UnstructuredLogState>['channels'] = {
value: (x: string[], y?: string[]) => y ?? x,
default: () => [],
},
currentPattern: {
value: (x: string, y?: string) => y ?? x,
default: () => '',
},
grokPatterns: {
value: (x: string[], y?: string[]) => y ?? x,
default: () => [],
Expand All @@ -42,8 +45,12 @@ const graphState: StateGraphArgs<UnstructuredLogState>['channels'] = {
value: (x: boolean, y?: boolean) => y ?? x,
default: () => false,
},
unParsedSamples: {
value: (x: string[], y?: string[]) => y ?? x,
default: () => [],
},
errors: {
value: (x: object, y?: object) => y ?? x,
value: (x: object[], y?: object[]) => y ?? x,
default: () => [],
},
additionalProcessors: {
Expand All @@ -54,11 +61,16 @@ const graphState: StateGraphArgs<UnstructuredLogState>['channels'] = {
value: (x: string, y?: string) => y ?? x,
default: () => '',
},
isFirst: {
value: (x: boolean, y?: boolean) => y ?? x,
default: () => false,
},
};

function modelInput({ state }: UnstructuredBaseNodeParams): Partial<UnstructuredLogState> {
return {
finalized: false,
isFirst: true,
lastExecutedChain: 'modelInput',
};
}
Expand All @@ -72,10 +84,10 @@ function modelOutput({ state }: UnstructuredBaseNodeParams): Partial<Unstructure
}

function validationRouter({ state }: UnstructuredBaseNodeParams): string {
if (Object.keys(state.errors).length === 0) {
if (Object.keys(state.unParsedSamples).length === 0) {
return 'modelOutput';
}
return 'handleUnstructuredError';
return 'handleUnparsed';
}

export async function getUnstructuredGraph({ model, client }: UnstructuredGraphParams) {
Expand All @@ -84,9 +96,6 @@ export async function getUnstructuredGraph({ model, client }: UnstructuredGraphP
})
.addNode('modelInput', (state: UnstructuredLogState) => modelInput({ state }))
.addNode('modelOutput', (state: UnstructuredLogState) => modelOutput({ state }))
.addNode('handleUnstructuredError', (state: UnstructuredLogState) =>
handleUnstructuredError({ state, model, client })
)
.addNode('handleUnstructured', (state: UnstructuredLogState) =>
handleUnstructured({ state, model, client })
)
Expand All @@ -100,11 +109,10 @@ export async function getUnstructuredGraph({ model, client }: UnstructuredGraphP
'handleUnstructuredValidate',
(state: UnstructuredLogState) => validationRouter({ state }),
{
handleUnstructuredError: 'handleUnstructuredError',
handleUnparsed: 'handleUnstructured',
modelOutput: 'modelOutput',
}
)
.addEdge('handleUnstructuredError', 'handleUnstructuredValidate')
.addEdge('modelOutput', END);

const compiledUnstructuredGraph = workflow.compile();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ export const GROK_MAIN_PROMPT = ChatPromptTemplate.fromMessages([
<samples>
{samples}
</samples>
<errors>
{errors}
</errors>
</context>`,
],
[
Expand All @@ -22,18 +25,19 @@ export const GROK_MAIN_PROMPT = ChatPromptTemplate.fromMessages([
Your goal is to accurately extract key components such as timestamps, hostnames, priority levels, process names, events, VLAN information, MAC addresses, IP addresses, STP roles, port statuses, messages and more.
Follow these steps to help improve the grok patterns and apply it step by step:
1. Familiarize yourself with various syslog message formats.
2. PRI (Priority Level): Encoded in angle brackets, e.g., <134>, indicating the facility and severity.
3. Timestamp: Use \`SYSLOGTIMESTAMP\` for RFC 3164 timestamps (e.g., Aug 10 16:34:02). Use \`TIMESTAMP_ISO8601\` for ISO 8601 (RFC 5424) timestamps. For epoch time, use \`NUMBER\`.
4. If the timestamp could not be categorized into a predefined format, extract the date time fields separately and combine them with the format identified in the grok pattern.
5. Make sure to identify the timezone component in the timestamp.
6. Hostname/IP Address: The system or device that generated the message, which could be an IP address or fully qualified domain name
7. Process Name and PID: Often included with brackets, such as sshd[1234].
8. VLAN information: Usually in the format of VLAN: 1234.
9. MAC Address: The network interface MAC address.
10. Port number: The port number on the device.
11. Look for status codes ,interface ,log type, source ,User action, destination, protocol, etc.
12. message: This is the free-form message text that varies widely across log entries.
1. If there are errors try to identify the root cause and provide a solution.
2. Familiarize yourself with various syslog message formats.
3. PRI (Priority Level): Encoded in angle brackets, e.g., <134>, indicating the facility and severity.
4. Timestamp: Use \`SYSLOGTIMESTAMP\` for RFC 3164 timestamps (e.g., Aug 10 16:34:02). Use \`TIMESTAMP_ISO8601\` for ISO 8601 (RFC 5424) timestamps. For epoch time, use \`NUMBER\`.
5. If the timestamp could not be categorized into a predefined format, extract the date time fields separately and combine them with the format identified in the grok pattern.
6. Make sure to identify the timezone component in the timestamp.
7. Hostname/IP Address: The system or device that generated the message, which could be an IP address or fully qualified domain name
8. Process Name and PID: Often included with brackets, such as sshd[1234].
9. VLAN information: Usually in the format of VLAN: 1234.
10. MAC Address: The network interface MAC address.
11. Port number: The port number on the device.
12. Look for status codes ,interface ,log type, source ,User action, destination, protocol, etc.
13. message: This is the free-form message text that varies widely across log entries.
You ALWAYS follow these guidelines when writing your response:
Expand All @@ -54,54 +58,3 @@ export const GROK_MAIN_PROMPT = ChatPromptTemplate.fromMessages([
],
['ai', 'Please find the JSON object below:'],
]);

export const GROK_ERROR_PROMPT = ChatPromptTemplate.fromMessages([
[
'system',
`You are an expert in Syslogs and identifying the headers and structured body in syslog messages. Here is some context for you to reference for your task, read it carefully as you will get questions about it later:
<context>
<current_pattern>
{current_pattern}
</current_pattern>
</context>`,
],
[
'human',
`Please go through each error below, carefully review the provided current grok pattern, and resolve the most likely cause to the supplied error by returning an updated version of the current_pattern.
<errors>
{errors}
</errors>
Follow these steps to help improve the grok patterns and apply it step by step:
1. Familiarize yourself with various syslog message formats.
2. PRI (Priority Level): Encoded in angle brackets, e.g., <134>, indicating the facility and severity.
3. Timestamp: Use \`SYSLOGTIMESTAMP\` for RFC 3164 timestamps (e.g., Aug 10 16:34:02). Use \`TIMESTAMP_ISO8601\` for ISO 8601 (RFC 5424) timestamps. For epoch time, use \`NUMBER\`.
4. If the timestamp could not be categorized into a predefined format, extract the date time fields separately and combine them with the format identified in the grok pattern.
5. Make sure to identify the timezone component in the timestamp.
6. Hostname/IP Address: The system or device that generated the message, which could be an IP address or fully qualified domain name
7. Process Name and PID: Often included with brackets, such as sshd[1234].
8. VLAN information: Usually in the format of VLAN: 1234.
9. MAC Address: The network interface MAC address.
10. Port number: The port number on the device.
11. Look for status codes ,interface ,log type, source ,User action, destination, protocol, etc.
12. message: This is the free-form message text that varies widely across log entries.
You ALWAYS follow these guidelines when writing your response:
<guidelines>
- Make sure to map the remaining message part to \'message\' in grok pattern.
- Make sure to add \`{packageName}.{dataStreamName}\` as a prefix to each field in the pattern. Refer to example response.
- Do not respond with anything except the processor as a JSON object enclosed with 3 backticks (\`), see example response above. Use strict JSON response format.
</guidelines>
You are required to provide the output in the following example response format:
<example_response>
A: Please find the JSON object below:
\`\`\`json
{ex_answer}
\`\`\`
</example_response>`,
],
['ai', 'Please find the JSON object below:'],
]);
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ export interface HandleUnstructuredNodeParams extends UnstructuredNodeParams {
}

export interface GrokResult {
grok_patterns: string[];
grok_pattern: string;
message: string;
}

Expand Down
Loading