Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cache at Step level #766

Merged
merged 75 commits into from
Oct 7, 2024
Merged

Add cache at Step level #766

merged 75 commits into from
Oct 7, 2024

Conversation

plaguss
Copy link
Contributor

@plaguss plaguss commented Jul 1, 2024

Description

This PR implements cache at step level.

Previously, we computed a signature for a pipeline, and when this signature changed, we recomputed everything.
Now the idea is to compute the signature per step, and once the signature has changed, only recompute the steps whose signature (or preceding) has changed. So for a pipeline A -> B -> C, if B step changes, we will recompute only B and C, but we will start with the data we had from A.

New cases we control with this change:

  • Some failure in a Pipeline during the computation (say we push Ctrl+c), we can restart from where we left.
  • If we have a pipeline a >> b >> c >> d and we change a step (say c), we will only recompute c and d.
  • Control the cache use per step level. We have an argument use_cache at the _Step level, when set to False, the cache won't be used from that step onwards, even if the pipeline remains the same.
step_b = MyStep(
    name="step_b",
    input_batch_size=10,
    use_cache=False,
)

Note: This has an impact in how we read the previous serialized parquet files, if any step's use_cache is set to False, for a pipeline that hasn't changed, we won't read the previous serialized content.

Closes #651

@plaguss plaguss self-assigned this Jul 1, 2024
Copy link

codspeed-hq bot commented Jul 1, 2024

CodSpeed Performance Report

Merging #766 will degrade performances by 25.09%

Comparing cache-per-step (3ea2b85) with develop (e027f99)

Summary

❌ 1 (👁 1) regressions

Benchmarks breakdown

Benchmark develop cache-per-step Change
👁 test_cache_time 398.9 ms 532.6 ms -25.09%

@plaguss plaguss requested a review from gabrielmbmb July 8, 2024 10:27
Copy link

github-actions bot commented Jul 9, 2024

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-766/

@gabrielmbmb gabrielmbmb merged commit ebab004 into develop Oct 7, 2024
7 checks passed
@gabrielmbmb gabrielmbmb deleted the cache-per-step branch October 7, 2024 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Cache at Step level
2 participants