Skip to content

Releases: huggingface/xet-core

v1.1.3-dev0

20 May 20:06
c465076
Compare
Choose a tag to compare
v1.1.3-dev0 Pre-release
Pre-release

What's Changed

  • Updates out-of-sync Cargo.lock in hf_xet/ by @hoytak in #341
  • Incremental progress on upload_xorb with retry_wrapper by @hoytak in #333
  • Track total processed bytes and total transferred bytes by @hoytak in #328
  • Streamline and aggregate file updates for reporting to python by @hoytak in #340
  • Merging Cargo.toml dependencies into workspace Cargo.toml by @jgodlew in #339

Full Changelog: v1.1.2...v1.1.3-dev0

[v1.1.2] Smol binaries, sdist, bug fixes

16 May 20:44
b6bb555
Compare
Choose a tag to compare

✨ New Features and Improvements

  • Much Smaller Binaries: In this release we’ve dropped the installed binary size across all platforms (ex. Linux went from ~96MB → ~14MB).
  • sdist installation support: Now hf-xet can be compiled using best practices for Python package sdist installation. Thanks @tiran and @szalpal for the original bug reports!

🐛 Bug Fixes (retries, open-files, sdist, smaller binaries)

  • More resilient uploads & downloads by adding retries to many error paths through download and upload. Fixes #300 #322 #311
  • Optimizations around model compression selection, object serialization.
  • Prevent "Too Many Open Files" error by limiting concurrent downloads.
  • Build & release updates to support sdist, dbg symbols. Fixes #255 #304
  • Code cleanup and refactoring around progress reporting.

What's Changed

Full Changelog: v1.1.1...v1.1.2

v1.1.1 - reduced binary size

12 May 21:33
6f24934
Compare
Choose a tag to compare

✨ New Features and Improvements

In this release, we've halved our installed binary size on Linux distributions and added some performance improvements during chunking and compression evaluation.

🐛 Bug Fixes

  • Our installed binaries were bloated and consuming most of the AWS Lambda size budget (thanks to @jp-agenta and @ggiallo28 for the original issues here and here)

What's Changed

  • Updating hf-xet version to 1.1.0 by @bpronan in #285
  • Make dedup critical crates compilation-compat with wasm by @seanses in #271
  • Completion tracking for accurate upload and download progress reporting. by @hoytak in #219
  • Adding session_id to requests and spans by @jgodlew in #291
  • Simplify chunking backgrounding code. by @hoytak in #292
  • Fix clippy issues in next rust version. by @hoytak in #298
  • Replace passed-around threadpool refs with thread local variable by @hoytak in #297
  • Connect detailed upload progress to hub by @hoytak in #301
  • Revert "Revert "Reduce Usage of Compression Format Detection"" by @rajatarya in #279
  • xtool query command by @seanses in #305
  • Fix compilation issue due to api change by @seanses in #309
  • Fixed race condition in dependency tracking. by @hoytak in #302
  • Changed debug to minimal for python wheel. by @hoytak in #312

Full Changelog: v1.1.0...v1.1.1

[v1.1.0] Upload byte array support

29 Apr 21:15
e4dca78
Compare
Choose a tag to compare

✨ New Features and Improvements

In this release, we’ve added the ability to upload to xet using a byte array. In addition to upload_files , we now have upload_bytes which enables python clients using hf-xet directly to upload a Python bytes array. Additionally, we have parallelized our dedupe and chunking passes to increase our upload performance.

🐛 Bug Fixes

What's Changed

Full Changelog: v1.0.5...v1.1.0

[v1.0.5] Fix for download errors

25 Apr 21:15
88a655f
Compare
Choose a tag to compare

🐛 Bug Fixes

  • Suppresses benign errors reported during download (thanks @lewtun @fakerybakery for reporting these!)

What's Changed

Full Changelog: v1.0.4...v1.0.5

[v1.0.4] High Performance Mode & better testing & bug fixes

24 Apr 18:08
f5d2c2d
Compare
Choose a tag to compare

🚀 High Performance Mode

In this release, instead of manually tuning performance flags, we added a simple HF_XET_HIGH_PERFORMANCE flag to set to True (or 1 or Yes) that will attempt to saturate all system resources when performing uploads & downloads. Consider this analogous to using hf-transfer before, but now with hf-xet. If you are uploading or downloading big files from HF, set this flag to get the most performance from your machine. Set it with:

export HF_XET_HIGH_PERFORMANCE=1
huggingface-cli ...

✨ New Features and Improvements

In this release we improved our automated testing and release process to ensure we don't break our integration with huggingface-hub. Now, prior to releasing a new hf-xet package we will verify it does not break existing tests in huggingface-hub (which include the packages that depend on huggingface-hub like transformers, diffusers, etc).

🐛 Bug Fixes

  • Adding support for slow download connections for files where previously 403 error codes returned (Fixes #238). Thanks to @rakeshwalisheter for filing this issue!

What's Changed

  • high performance mode by @assafvayner in #241
  • fix comment on max concurrent downloads by @assafvayner in #247
  • Removed I think unnecessary constraint that chunk cache must be 10x size of chunk by @ylow in #249
  • throw error instead of panic when header is not found in hub client by @sirahd in #251
  • ChunkCache interface returns chunk indices by @assafvayner in #253
  • Adding a pre-release workflow to test huggingface_hub by @bpronan in #242
  • use safe file writer in local client by @assafvayner in #250
  • make utils wasm-compat by @assafvayner in #257
  • Running huggingface_hub xet tests on PRs by @bpronan in #245
  • Segmented download and refresh fetch info on 403 by @seanses in #252
  • Fix variable name to agree with the logic by @seanses in #254
  • configurable boolean constants follows hf_hub truthy values by @assafvayner in #260
  • Updating the release jobs to write the new version by @bpronan in #263
  • Skipping the version update on empty tag by @bpronan in #264
  • Fixing the version output from the prerelease step by @bpronan in #265

Full Changelog: v1.0.3...v1.0.4

[v1.0.3] Bug fixes on upload & download

09 Apr 17:17
385f4f3
Compare
Choose a tag to compare

🐛 Bug Fixes

  • No longer using $TMPDIR env variable as part of the upload process, now will use $HF_XET_CACHE location instead
  • No longer potential divide by zero in chunk cache eviction

What's Changed

Full Changelog: v1.0.2...v1.0.3

[v1.0.2]: Performance Tuning knobs exposed 🎛️

06 Apr 18:04
d761183
Compare
Choose a tag to compare

🛠️ Small Fixes and Maintenance

In this minor release we changed how we expose configuration flags for hf-xet. This change adds consistency across the other environment variables. These configuration flags will enable high performance if your machine has lots of cores, network, and really fast disk. The defaults in hf-xet are unchanged and are intentionally modest to support the broadest range of hardware. Use these knobs to get great download speeds when downloading from a machine with lots of network and lots of cores.

# This is number of concurrent terms (range of bytes from within a xorb) downloaded from S3 per file.
# Increasing this will help with the speed of downloading a file if there is network bandwidth available. 
HF_XET_NUM_CONCURRENT_RANGE_GETS=16

# hf-xet is designed for SSD/NVMe disks to be used (using parallel writes). If you are using an HDD, setting this
# will change disk writes to be sequential instead of parallel.
# To set, HF_XET_RECONSTRUCT_WRITE_SEQUENTIALLY=true

(Note that hf-xet will have at most HF_XET_MAX_CONCURRENT_DOWNLOADS * HF_XET_NUM_CONCURRENT_RANGE_GETS 
parallel GETs from S3).

# Default cache size. Increasing this will give more space for caching terms/chunks fetched from S3.
# A larger cache can better take advantage of deduplication across repos & files.
HF_XET_CHUNK_CACHE_SIZE_BYTES=10737418240 (10GiB)

# setting this changes where the chunk cache is located (ideally set to a local SSD/nvme vs shared/distributed filesystem)
HF_XET_CACHE=~/.cache/huggingface/xet
# setting this will change where the chunk cache is located (`$HF_HOME/xet`). Lower precedence than `HF_XET_CACHE`
HF_HOME=~/.cache/huggingface

# If your network bandwidth is >> disk speed, e.g. 10 Gbps link vs SATA SSD or worse
# Disabling the xet cache will increase your performance. To disable xet cache, set HF_XET_CHUNK_CACHE_SIZE_BYTES=0.

💔 Breaking Changes

  • If you used XET_ environment variables to tune things before, please update to using HF_XET_ instead.

What's Changed

New Contributors

Full Changelog: v1.0.0...v1.0.2

[v1.0.0]: hf-xet is ready for 1.0.0 now! 🚀

01 Apr 19:22
d50c42d
Compare
Choose a tag to compare

🚀 Ready. Xet. Go!

hf-xet is ready for 1.0.0 now. We have been onboarding internal and external Hugging Face users over the last several weeks and have hardened this package, along with the backend services it relies on. Now, we will work hard to prevent any breaking changes going forward.

✨ New Features and Improvements

In this release we 2x the download performance for xet files. With some preliminary benchmarking, hf-xet is faster than hf-transfer.

⚡ Getting Started with Xet

You can start using Xet today by installing the optional dependency:

pip install -U huggingface_hub[hf_xet]

With that, you can seamlessly download files from Xet-enabled repositories! And don’t worry—everything remains fully backward-compatible if you’re not ready to upgrade yet.

Blog post: Xet on the Hub
Docs: Storage backends → Xet

Tip

Want to store your own files with Xet? We’re gradually rolling out support on the Hugging Face Hub, so hf_xet uploads may need to be enabled for your repo. Join the waitlist to get onboarded soon!

What's Changed

  • add utils over unpacked_chunk_offsets by @assafvayner in #193
  • Enable unpacked_chunk_offsets check by @seanses in #192
  • CAS Data Aggregator by @hoytak in #194
  • Merge RemoteClient and HttpShardClient. by @hoytak in #183
  • update merklehash to wasm compat by @assafvayner in #205
  • Restrict memory allocation size on Xorb deserialization by @seanses in #197
  • xet_threadpool is wasm compilable by @assafvayner in #204
  • opt-test profile for builds with debug assertions for regular use by @hoytak in #210
  • Macro for declaring configurable constants. by @hoytak in #211
  • concurrent limited join set by @assafvayner in #209
  • optionally deserialize only boundaries section of xorb metadata by @assafvayner in #203
  • make cas endpoint option in xtool migration by @sirahd in #216
  • Data Processing Refactor / Simplification towards WASM compatibility by @hoytak in #199
  • Fold parallel xorb uploader into FileUploadSession. by @hoytak in #217
  • Env to write reconstruction terms in parallel by @jgodlew in #221
  • Conservative retry strategy for global dedup query by @seanses in #218
  • Update hf_xet to 1.0.0 by @jgodlew in #222

Full Changelog: v0.1.4...v1.0.0

v0.1.4

05 Mar 23:16
6f46f80
Compare
Choose a tag to compare
v0.1.4 Pre-release
Pre-release

What's Changed

  • Fix calling async from sync inside async runtime by @seanses in #170
  • Make CasObject serde::Serializable by @znation in #172
  • Removed unused parts of the remote shard interface. by @hoytak in #175
  • Adding request_id to logs from CAS requests by @jgodlew in #173
  • Shard query resilient to shard file deletion by @seanses in #174
  • only store compressed chunk if the compressed chunk is smaller by @assafvayner in #177
  • Consolidate local client testing code into local_client.rs by @hoytak in #178
  • Rename data processing class names to reflect functionality by @hoytak in #176
  • Fix cas chunk header validation bug by @seanses in #181
  • Use KL divergence on bg4 groups to choose compression scheme by @hoytak in #179
  • Clean up global dedup query API by @hoytak in #180
  • Xorb Format Upgrade by @hoytak in #182
  • Fix validating v0 xorb bug by @seanses in #184
  • Fix parsing v0 xorb again by @seanses in #185
  • Validate the unpacked chunk offsets by @seanses in #186
  • Add back repo_type for compat by @seanses in #187
  • Adds support cache dir config via env variable by @rajatarya in #188
  • Releasing version 0.1.4 by @rajatarya in #191

New Contributors

Full Changelog: v0.1.3...v0.1.4