[Streams] Replay loghub data with synthtrace #212120

dgieselaar · 2025-02-21T16:31:09Z

Download, parse and replay loghub data with Synthtrace, for use in the Streams project. In summary:

adds a @kbn/sample-log-parser package which parses Loghub sample data, creates valid parsers for extracting and replacing timestamps, using the LLM
add a sample_logs scenario which uses the parsed data sets to replay Loghub data continuously as if it were live data
refactor some parts of Synthtrace (follow-up work captured in [Synthtrace] Consolidate clients #212179)

flash1293

Implementation-wise this looks pretty good to me. Some meta questions:

Should we rely on the public loghub repo or fork it off? I'm a little worried this breaking at some point because loghub changes its layout. This would also make it easier to expand it by our own means. In both cases we should cite loghub and the paper somewhere appropriate (like a readme file) as by the license
I'm not so sure about the different speeds. I'm running via node scripts/synthtrace.js sample_logs --live --kibana=http://localhost:5601 --target=http://localhost:9200 --liveBucketSize=1000 and the liveBucketSize is essentially not considered because it computes its own speed. Can we make it taken into account? Different speeds for different data sets are a nice touch as it mirrors reality, but I would like to control the factor of data intake (and speed everything up by a factor of 1k for example). Maybe that's already possible and I just don't know the right command
I spot-checked some aspects of the refactoring and it makes sense to me, but I didn't dig through everything and as I'm not super familiar with the code base it's likely I'm missing something in there

github-actions · 2025-02-24T15:24:38Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

/oblt-deploy : Deploy a Kibana instance using the Observability test environments.
run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

dgieselaar · 2025-02-24T15:38:32Z

Should we rely on the public loghub repo or fork it off? I'm a little worried this breaking at some point because loghub changes its layout. This would also make it easier to expand it by our own means. In both cases we should cite loghub and the paper somewhere appropriate (like a readme file) as by the license

I'm fine with either - but maybe good to do that as a follow-up, I'm not sure what the legal ramifications are.

I'm not so sure about the different speeds. I'm running via node scripts/synthtrace.js sample_logs --live --kibana=http://localhost:5601/ --target=http://localhost:9200/ --liveBucketSize=1000 and the liveBucketSize is essentially not considered because it computes its own speed. Can we make it taken into account? Different speeds for different data sets are a nice touch as it mirrors reality, but I would like to control the factor of data intake (and speed everything up by a factor of 1k for example). Maybe that's already possible and I just don't know the right command

Yes, totally forgot about this setting, I should be able to use it. Would we use a constant indexing rate for each generator, or keep the relative rate per generator (e.g. Android indexes at a way higher rate than Macbook)?

flash1293 · 2025-02-24T17:48:29Z

I'm fine with either - but maybe good to do that as a follow-up, I'm not sure what the legal ramifications are.

Sounds good, then we should add a backlink to the repo and paper and follow up later.

Would we use a constant indexing rate for each generator, or keep the relative rate per generator (e.g. Android indexes at a way higher rate than Macbook)?

I would prefer the latter, in practice this kind of thing happens all the time.

elasticmachine · 2025-02-25T10:47:27Z

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

dgieselaar · 2025-02-25T10:49:32Z

@flash1293 I've added runOptions.rpm to set a target rpm for all the generators. Unless I'm mistaken or misunderstand what you mean, I don't think liveBucketSize matters - that defines the time range that synthtrace requests data for. interval('5s') defines the granularity of the timestamps. We don't really have a native concept for rpm in synthtrace, each scenario itself defines the volume of requests.

MiriamAparicio

LGTM 🌟
I've been thinking about the clients in synthtrace, makes more sense consolidating them, there was a lot of repeated code. Thanks for this 👏

flash1293 · 2025-02-26T14:38:01Z

Discussed offline:

Requests shouldn't just shift original timestamps, but instead be smoothed out over the target time frame (because otherwise some data sets log very little docs)
The scenario should not disable streams as part of the setup routine as this drops existing configuration

pheyos

src/platform/packages/private/kbn-journeys/services/synthtrace.ts changes LGTM

achyutjhunjhunwala · 2025-03-04T11:19:52Z

...atform/packages/shared/kbn-apm-synthtrace/src/lib/apm/client/apm_synthtrace_kibana_client.ts

-      if (responseJson.response && responseJson.response.latestVersion) {
-        return responseJson.response.latestVersion as string;
+      if (!response.item.latestVersion) {
+        throw new Error(`Failed to fetch APM package version`);


Are we not supporting Synthtrace for 7.x ?

I totally accidentally removed it, but yeah, let's forget about 7.x, or do you have concerns?

…er # Please enter a commit message to explain why this merge is necessary,

elasticmachine · 2025-03-05T16:48:19Z

💔 Build Failed

Buildkite Build
Commit: 8d977ef

Failed CI Steps

Check Types

History

💔 Build #279401 failed 38a6cd6
💔 Build #279048 failed 74681f2
💚 Build #278662 succeeded f02eafb
💔 Build #278653 failed 40ca31b
💔 Build #278649 failed 57909f0

cc @dgieselaar

dgieselaar added 2 commits February 21, 2025 17:30

[Streams] Replay loghub data with synthtrace

57bbd8e

Add tests & ensure repo for synthtrace

fbf8063

dgieselaar mentioned this pull request Feb 22, 2025

[Synthtrace] Consolidate clients #212179

Open

dgieselaar and others added 5 commits February 22, 2025 14:18

Fix type issues

b17ae1c

Merge branch 'main' of github.com:elastic/kibana into sample-log-parser

57909f0

[CI] Auto-commit changed files from 'node scripts/generate codeowners'

be38355

Fix type issues

40ca31b

Use kibanaClient.fetch for entities client

f02eafb

dgieselaar added v9.0.0 v9.1.0 v8.19.0 Feature:Streams This is the label for the Streams Project labels Feb 23, 2025

dgieselaar self-assigned this Feb 23, 2025

dgieselaar marked this pull request as ready for review February 23, 2025 14:22

dgieselaar requested review from a team as code owners February 23, 2025 14:22

dgieselaar added backport:version Backport to applied version labels release_note:skip Skip the PR/issue when compiling release notes labels Feb 23, 2025

flash1293 reviewed Feb 24, 2025

View reviewed changes

botelastic bot added the ci:project-deploy-observability Create an Observability project label Feb 24, 2025

dgieselaar added 2 commits February 24, 2025 19:20

Merge branch 'main' of github.com:elastic/kibana into sample-log-parser

0d6ade8

Distribute RPM amongst scenarios

6d8253a

botelastic bot added the Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team label Feb 25, 2025

Merge branch 'main' of github.com:elastic/kibana into sample-log-parser

74681f2

Ikuni17 approved these changes Feb 25, 2025

View reviewed changes

dgieselaar added 2 commits February 26, 2025 12:10

Merge branch 'main' of github.com:elastic/kibana into sample-log-parser

e58b73e

Fix type issues

38a6cd6

MiriamAparicio approved these changes Feb 26, 2025

View reviewed changes

viduni94 approved these changes Feb 26, 2025

View reviewed changes

pheyos approved these changes Feb 26, 2025

View reviewed changes

achyutjhunjhunwala reviewed Mar 4, 2025

View reviewed changes

dgieselaar added 5 commits March 5, 2025 17:10

Query generation, graceful worker shutdown, progress reporters

2e69704

Removed comment for 7.x

7af77c5

Don't clean sample_logs

133a749

Merge branch 'main' of github.com:elastic/kibana into sample-log-pars…

269425c

…er # Please enter a commit message to explain why this merge is necessary,

Enable stream manager before bootstrap

8d977ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Streams] Replay loghub data with synthtrace #212120

[Streams] Replay loghub data with synthtrace #212120

dgieselaar commented Feb 21, 2025 •

edited

Loading

flash1293 left a comment •

edited

Loading

github-actions bot commented Feb 24, 2025

dgieselaar commented Feb 24, 2025

flash1293 commented Feb 24, 2025

elasticmachine commented Feb 25, 2025

dgieselaar commented Feb 25, 2025

MiriamAparicio left a comment

flash1293 commented Feb 26, 2025

pheyos left a comment

achyutjhunjhunwala Mar 4, 2025

dgieselaar Mar 5, 2025

elasticmachine commented Mar 5, 2025

[Streams] Replay loghub data with synthtrace #212120

Are you sure you want to change the base?

[Streams] Replay loghub data with synthtrace #212120

Conversation

dgieselaar commented Feb 21, 2025 • edited Loading

flash1293 left a comment • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Feb 24, 2025

🤖 GitHub comments

dgieselaar commented Feb 24, 2025

flash1293 commented Feb 24, 2025

elasticmachine commented Feb 25, 2025

dgieselaar commented Feb 25, 2025

MiriamAparicio left a comment

Choose a reason for hiding this comment

flash1293 commented Feb 26, 2025

pheyos left a comment

Choose a reason for hiding this comment

achyutjhunjhunwala Mar 4, 2025

Choose a reason for hiding this comment

dgieselaar Mar 5, 2025

Choose a reason for hiding this comment

elasticmachine commented Mar 5, 2025

💔 Build Failed

Failed CI Steps

History

dgieselaar commented Feb 21, 2025 •

edited

Loading

flash1293 left a comment •

edited

Loading