HuggingFace

HuggingFace 100× Faster Dataset Streaming: New API Release

12 days agoRead original →

Large-scale machine‑learning models rely on ever‑growing corpora, but loading entire datasets into RAM quickly becomes a bottleneck. HuggingFace’s recently released streaming API addresses this pain point by enabling on‑the‑fly data retrieval over HTTP or local storage, eliminating the need for bulky pre‑downloads. The new design integrates seamlessly with the existing \"datasets\" library, giving developers a familiar interface while offering the flexibility to stream from remote sources such as S3, GCS, or HuggingFace’s own dataset hub. Early benchmarks show that for a 10 GB text corpus, streaming can reduce peak memory usage from 8 GB to under 200 MB, a dramatic shift that opens the door to training larger models on modest hardware.

Behind the scenes, HuggingFace has overhauled the compression and chunking logic, introducing a lightweight protocol that transmits data in 64‑KB slices and decompresses them on demand. Coupled with a caching layer that reuses previously fetched chunks, the system achieves a near‑linear scaling with the number of worker processes. The result is a 100× speedup in data ingestion for typical transformer training pipelines, especially when paired with PyTorch DataLoaders. Moreover, the API exposes a simple stream() context manager that yields records one at a time, allowing fine‑grained control over preprocessing steps without loading the entire dataset into memory.

These efficiency gains translate into tangible benefits for both research and production teams. Smaller memory footprints mean that experiments can run on consumer GPUs or even CPUs, democratizing access to state‑of‑the‑art models. In production, the ability to stream data in real time reduces storage costs and accelerates model updates, making continuous training pipelines more viable. HuggingFace’s community has already begun contributing adapters for third‑party data stores, expanding the ecosystem. As the field pushes toward multi‑terabyte corpora and multimodal learning, such streaming optimizations will be essential to keep training times and infrastructure costs in check.

Want the full story?

Read on HuggingFace