Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Streams] Replay loghub data with synthtrace #212120

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

dgieselaar
Copy link
Member

@dgieselaar dgieselaar commented Feb 21, 2025

Download, parse and replay loghub data with Synthtrace, for use in the Streams project. In summary:

  • adds a @kbn/sample-log-parser package which parses Loghub sample data, creates valid parsers for extracting and replacing timestamps, using the LLM
  • add a sample_logs scenario which uses the parsed data sets to replay Loghub data continuously as if it were live data
  • refactor some parts of Synthtrace (follow-up work captured in [Synthtrace] Consolidate clients #212179)

Synthtrace changes

  • Replace custom Logger object with Kibana-standard ToolingLog
  • Report progress and estimated time to completion for long-running jobs
  • Simplify scenarioOpts (allow comma-separated key-value pairs instead of just JSON)
  • Simplify client initialization
  • When using workers, only bootstrap once (in the main thread)
  • Allow workers to gracefully shutdown
  • Downgrade some logging levels for less noise

@dgieselaar dgieselaar added v9.0.0 v9.1.0 v8.19.0 Feature:Streams This is the label for the Streams Project labels Feb 23, 2025
@dgieselaar dgieselaar self-assigned this Feb 23, 2025
@dgieselaar dgieselaar marked this pull request as ready for review February 23, 2025 14:22
@dgieselaar dgieselaar requested review from a team as code owners February 23, 2025 14:22
@dgieselaar dgieselaar added backport:version Backport to applied version labels release_note:skip Skip the PR/issue when compiling release notes labels Feb 23, 2025
Copy link
Contributor

@flash1293 flash1293 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation-wise this looks pretty good to me. Some meta questions:

  • Should we rely on the public loghub repo or fork it off? I'm a little worried this breaking at some point because loghub changes its layout. This would also make it easier to expand it by our own means. In both cases we should cite loghub and the paper somewhere appropriate (like a readme file) as by the license
  • I'm not so sure about the different speeds. I'm running via node scripts/synthtrace.js sample_logs --live --kibana=http://localhost:5601 --target=http://localhost:9200 --liveBucketSize=1000 and the liveBucketSize is essentially not considered because it computes its own speed. Can we make it taken into account? Different speeds for different data sets are a nice touch as it mirrors reality, but I would like to control the factor of data intake (and speed everything up by a factor of 1k for example). Maybe that's already possible and I just don't know the right command
  • I spot-checked some aspects of the refactoring and it makes sense to me, but I didn't dig through everything and as I'm not super familiar with the code base it's likely I'm missing something in there

@botelastic botelastic bot added the ci:project-deploy-observability Create an Observability project label Feb 24, 2025
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@dgieselaar
Copy link
Member Author

Should we rely on the public loghub repo or fork it off? I'm a little worried this breaking at some point because loghub changes its layout. This would also make it easier to expand it by our own means. In both cases we should cite loghub and the paper somewhere appropriate (like a readme file) as by the license

I'm fine with either - but maybe good to do that as a follow-up, I'm not sure what the legal ramifications are.

I'm not so sure about the different speeds. I'm running via node scripts/synthtrace.js sample_logs --live --kibana=http://localhost:5601/ --target=http://localhost:9200/ --liveBucketSize=1000 and the liveBucketSize is essentially not considered because it computes its own speed. Can we make it taken into account? Different speeds for different data sets are a nice touch as it mirrors reality, but I would like to control the factor of data intake (and speed everything up by a factor of 1k for example). Maybe that's already possible and I just don't know the right command

Yes, totally forgot about this setting, I should be able to use it. Would we use a constant indexing rate for each generator, or keep the relative rate per generator (e.g. Android indexes at a way higher rate than Macbook)?

@flash1293
Copy link
Contributor

I'm fine with either - but maybe good to do that as a follow-up, I'm not sure what the legal ramifications are.

Sounds good, then we should add a backlink to the repo and paper and follow up later.

Would we use a constant indexing rate for each generator, or keep the relative rate per generator (e.g. Android indexes at a way higher rate than Macbook)?

I would prefer the latter, in practice this kind of thing happens all the time.

@botelastic botelastic bot added the Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team label Feb 25, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

@dgieselaar
Copy link
Member Author

@flash1293 I've added runOptions.rpm to set a target rpm for all the generators. Unless I'm mistaken or misunderstand what you mean, I don't think liveBucketSize matters - that defines the time range that synthtrace requests data for. interval('5s') defines the granularity of the timestamps. We don't really have a native concept for rpm in synthtrace, each scenario itself defines the volume of requests.

Copy link
Contributor

@MiriamAparicio MiriamAparicio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🌟
I've been thinking about the clients in synthtrace, makes more sense consolidating them, there was a lot of repeated code. Thanks for this 👏

@flash1293
Copy link
Contributor

Discussed offline:

  • Requests shouldn't just shift original timestamps, but instead be smoothed out over the target time frame (because otherwise some data sets log very little docs)
  • The scenario should not disable streams as part of the setup routine as this drops existing configuration

Copy link
Member

@pheyos pheyos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/platform/packages/private/kbn-journeys/services/synthtrace.ts changes LGTM

if (responseJson.response && responseJson.response.latestVersion) {
return responseJson.response.latestVersion as string;
if (!response.item.latestVersion) {
throw new Error(`Failed to fetch APM package version`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not supporting Synthtrace for 7.x ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally accidentally removed it, but yeah, let's forget about 7.x, or do you have concerns?

@elasticmachine
Copy link
Contributor

elasticmachine commented Mar 6, 2025

💚 Build Succeeded

  • Buildkite Build
  • Commit: ca6805e
  • Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-212120-ca6805e7786b

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/apm-synthtrace 97 98 +1
@kbn/apm-synthtrace-client 272 274 +2
@kbn/sample-log-parser - 18 +18
total +21

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
@kbn/sample-log-parser - 1 +1
Unknown metric groups

API count

id before after diff
@kbn/apm-synthtrace 97 98 +1
@kbn/apm-synthtrace-client 272 274 +2
@kbn/sample-log-parser - 18 +18
total +21

ESLint disabled in files

id before after diff
@kbn/apm-synthtrace 2 3 +1

ESLint disabled line counts

id before after diff
@kbn/apm-synthtrace 6 2 -4

Total ESLint disabled count

id before after diff
@kbn/apm-synthtrace 8 5 -3

History

cc @dgieselaar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:version Backport to applied version labels ci:project-deploy-observability Create an Observability project Feature:Streams This is the label for the Streams Project release_note:skip Skip the PR/issue when compiling release notes Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team v8.19.0 v9.0.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants