Skip to content

PrimeIntellect-ai/datasetstream

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Datasetstream

A simple performant dataset streaming server & client for tokenized webtext datasets

Example

dataset_id = "openwebtext_train"
stream_url = f"http://localhost:8080/api/v1/datasets/{dataset_id}/stream"        

with DatasetClientIteratorSync(stream_url, seed=42, batch_size=32, seq_len=1024) as iterator:
    item: np.array
    for tokens in iterator:
        count += 1
        total_bytes_received += tokens.nbytes
        tokens = torch.from_numpy(tokens.astype(dtype=np.int64))

About

A simple performant dataset streaming server & client for tokenized webtext datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published