Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Streams] Replay loghub data with synthtrace #212120

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

dgieselaar
Copy link
Member

@dgieselaar dgieselaar commented Feb 21, 2025

Download, parse and replay loghub data with Synthtrace, for use in the Streams project. In summary:

  • adds a @kbn/sample-log-parser package which parses Loghub sample data, creates valid parsers for extracting and replacing timestamps, using the LLM
  • add a sample_logs scenario which uses the parsed data sets to replay Loghub data continuously as if it were live data
  • refactor some parts of Synthtrace (follow-up work captured in [Synthtrace] Consolidate clients #212179)

@dgieselaar dgieselaar added v9.0.0 v9.1.0 v8.19.0 Feature:Streams This is the label for the Streams Project labels Feb 23, 2025
@dgieselaar dgieselaar self-assigned this Feb 23, 2025
@dgieselaar dgieselaar marked this pull request as ready for review February 23, 2025 14:22
@dgieselaar dgieselaar requested review from a team as code owners February 23, 2025 14:22
@dgieselaar dgieselaar added backport:version Backport to applied version labels release_note:skip Skip the PR/issue when compiling release notes labels Feb 23, 2025
Copy link
Contributor

@flash1293 flash1293 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation-wise this looks pretty good to me. Some meta questions:

  • Should we rely on the public loghub repo or fork it off? I'm a little worried this breaking at some point because loghub changes its layout. This would also make it easier to expand it by our own means. In both cases we should cite loghub and the paper somewhere appropriate (like a readme file) as by the license
  • I'm not so sure about the different speeds. I'm running via node scripts/synthtrace.js sample_logs --live --kibana=http://localhost:5601 --target=http://localhost:9200 --liveBucketSize=1000 and the liveBucketSize is essentially not considered because it computes its own speed. Can we make it taken into account? Different speeds for different data sets are a nice touch as it mirrors reality, but I would like to control the factor of data intake (and speed everything up by a factor of 1k for example). Maybe that's already possible and I just don't know the right command
  • I spot-checked some aspects of the refactoring and it makes sense to me, but I didn't dig through everything and as I'm not super familiar with the code base it's likely I'm missing something in there

@botelastic botelastic bot added the ci:project-deploy-observability Create an Observability project label Feb 24, 2025
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@dgieselaar
Copy link
Member Author

Should we rely on the public loghub repo or fork it off? I'm a little worried this breaking at some point because loghub changes its layout. This would also make it easier to expand it by our own means. In both cases we should cite loghub and the paper somewhere appropriate (like a readme file) as by the license

I'm fine with either - but maybe good to do that as a follow-up, I'm not sure what the legal ramifications are.

I'm not so sure about the different speeds. I'm running via node scripts/synthtrace.js sample_logs --live --kibana=http://localhost:5601/ --target=http://localhost:9200/ --liveBucketSize=1000 and the liveBucketSize is essentially not considered because it computes its own speed. Can we make it taken into account? Different speeds for different data sets are a nice touch as it mirrors reality, but I would like to control the factor of data intake (and speed everything up by a factor of 1k for example). Maybe that's already possible and I just don't know the right command

Yes, totally forgot about this setting, I should be able to use it. Would we use a constant indexing rate for each generator, or keep the relative rate per generator (e.g. Android indexes at a way higher rate than Macbook)?

@flash1293
Copy link
Contributor

I'm fine with either - but maybe good to do that as a follow-up, I'm not sure what the legal ramifications are.

Sounds good, then we should add a backlink to the repo and paper and follow up later.

Would we use a constant indexing rate for each generator, or keep the relative rate per generator (e.g. Android indexes at a way higher rate than Macbook)?

I would prefer the latter, in practice this kind of thing happens all the time.

@botelastic botelastic bot added the Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team label Feb 25, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

@dgieselaar
Copy link
Member Author

@flash1293 I've added runOptions.rpm to set a target rpm for all the generators. Unless I'm mistaken or misunderstand what you mean, I don't think liveBucketSize matters - that defines the time range that synthtrace requests data for. interval('5s') defines the granularity of the timestamps. We don't really have a native concept for rpm in synthtrace, each scenario itself defines the volume of requests.

Copy link
Contributor

@MiriamAparicio MiriamAparicio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🌟
I've been thinking about the clients in synthtrace, makes more sense consolidating them, there was a lot of repeated code. Thanks for this 👏

@flash1293
Copy link
Contributor

Discussed offline:

  • Requests shouldn't just shift original timestamps, but instead be smoothed out over the target time frame (because otherwise some data sets log very little docs)
  • The scenario should not disable streams as part of the setup routine as this drops existing configuration

Copy link
Member

@pheyos pheyos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/platform/packages/private/kbn-journeys/services/synthtrace.ts changes LGTM

if (responseJson.response && responseJson.response.latestVersion) {
return responseJson.response.latestVersion as string;
if (!response.item.latestVersion) {
throw new Error(`Failed to fetch APM package version`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not supporting Synthtrace for 7.x ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally accidentally removed it, but yeah, let's forget about 7.x, or do you have concerns?

@elasticmachine
Copy link
Contributor

💔 Build Failed

Failed CI Steps

History

cc @dgieselaar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:version Backport to applied version labels ci:project-deploy-observability Create an Observability project Feature:Streams This is the label for the Streams Project release_note:skip Skip the PR/issue when compiling release notes Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team v8.19.0 v9.0.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants