v0.11.0 - Cuda support, mixed const/runtime tensors, and device rewrite
What's Changed
- AddInto by @Dimev in #256
- added 5d & 6d tensors by @M1ngXU in #283
- Remove phantom by @M1ngXU in #282
- remove tensor bound by @Dimev in #297
- Adding nightly to cargo-test by @JYudelson1 in #294
- Devices/Dyn dimensions refactor by @coreylowman in #304
- Add instructions for running the mnist example. by @infalmo in #310
- Removes Dyn. Use usize directly by @coreylowman in #315
- Making f32 default dtype for Tensor, updating examples/docstrings by @coreylowman in #316
- Only running gha on push by @coreylowman in #317
- Adding Unit and HasUnitType. Reducing bounds for Dtype by @coreylowman in #313
- Removing build_test_device. Using TestDevice everywhere by @coreylowman in #324
- Adding SampleTensor, Removing RandTensor/RandnTensor by @coreylowman in #327
- Removing usages of tensor aliases by @coreylowman in #328
- Moving intel-mkl stuff into sub module in build.rs by @coreylowman in #329
- Adding Cuda device and skeleton cuda kernel impls by @coreylowman in #322
- Implementing abs/exp/div/sum_to cuda kernels by @coreylowman in #331
- permute_to and broadcast_to cuda kernels by @coreylowman in #343
- Add cuda implementations for unary and binary tensor operations in #341 and #334 by @nkoppel in #346
- Using atomicAdd in binary op backwards to properly handle strides by @coreylowman in #350
- Resolve #352 and #347 by @nkoppel in #354
- Implement reshape cuda kernel (resolves #336) by @nkoppel in #356
- Add missing device generic in transformer test by @ViliamVadocz in #358
- Add select and gather cuda kernels. by @nkoppel in #359
- Upgrade to cudarc 0.6.0 by @coreylowman in #361
- Add tests for binary broadcasted add and fix bugs to allow them to pass. by @nkoppel in #357
- run GHA on pull_request by @coreylowman in #364
- matmul cuda kernels by @coreylowman in #342
- Adding dynamic example. by @Narsil in #368
- Add cuda kernels for min_to/max_to by @coreylowman in #370
- Adding dropout cuda kernel by @coreylowman in #372
- Adding ConstDim and ConstShape for tensor creation by @coreylowman in #373
- Fixing computation of lda/ldb/ldc with cblas by @coreylowman in #375
- Modify sum_to cuda kernel to not need atomic adds in backwards by @nkoppel in #367
- Simplifying
trait Conv2DKernel
and Cpu implementation by @coreylowman in #376 - (#344) Implement cuda kernels for optimizers by @nkoppel in #378
- Fix max_to and min_to edge case with negative zero by @ViliamVadocz in #380
- Add cuda kernels for conv2d by @coreylowman in #369
- Rework pool2d internals & add pool2d cuda kernels by @coreylowman in #384
- Implement Shape for arrays (#377) by @nkoppel in #385
- Efficient cuda kernels for reductions by @nkoppel in #382
- Improving compilation times of deeply nested const generic modules by @coreylowman in #391
- Fixing remainder of cuda tests & fixing cblas/cublas matmul with strides [1,1] by @coreylowman in #393
- Adding Cuda device usage to mnist example by @coreylowman in #396
- Adding GeLU operator (used in Gpt2) by @Narsil in #397
- Removing codecov from workflows/readme by @coreylowman in #403
- Reorganize tensor_ops, and add cuda_utils.cuh by @nkoppel in #398
- Some small optimizations for conv2d on cpu by @coreylowman in #404
- Removing Device generic from Gradients & optimizers by @coreylowman in #402
- Add ToDevice and OnDevice to simplify nn api (#388) by @nkoppel in #394
- Removes
ModuleBuilder
, AddsBuildModule
&BuildOnDevice
by @coreylowman in #405 - Enable multi-core matmul by @infalmo in #417
- Fix GELU CUDA kernel compilation by @ViliamVadocz in #409
- Adding nn.Embedding layer. by @Narsil in #406
- Removing defaults for Tensor Dtype & Device generic parameters by @coreylowman in #418
- Removing Default for optimizers & adding &M to constructors by @coreylowman in #422
- Adding runtime assertion in
try_binary_op
that shapes are equal by @coreylowman in #428 - Add boolean operations and choose. by @nkoppel in #415
- Add TensorFrom trait to create tensors from both vectors and arrays. by @nkoppel in #414
- Adding nn builder structs, dtype generics, and remove device defaults. by @coreylowman in #433
- Upgrade to cudarc==0.7.0 and use alloc_async instead of alloc_zeros_async by @coreylowman in #440
- Add comparison tensor operations by @ViliamVadocz in #386
- Add synchronize method to Cuda device by @ViliamVadocz in #442
- f64 kernels by @coreylowman in #421
- Add stack tensors method by @coreylowman in #449
- cargo check cuda & run f64 tests in CI by @coreylowman in #447
- Fix bug in #451 by @nkoppel in #453
- Add more runtime shape checks by @coreylowman in #454
- Adding ReshapeTo::reshape_like by @coreylowman in #456
- Adding SampleTensor::sample_uniform_like and SampleTensor::sample_normal_like by @coreylowman in #457
- Improve examples (add Cuda) by @TimerErTim in #452
- Dataset iterators - adds batching, collating for iterators by @coreylowman in #462
- Fixing issue with to_device and broadcasted tensors by @coreylowman in #465
- Bump cudarc 0.7.2 by @coreylowman in #466
- Adding index out of bounds checks to select/gather kernels by @coreylowman in #467
- Rename to
add_dim
. by @infalmo in #471 - impl BuildModule for ZeroSizedModule by @coreylowman in #470
- Adds TensorCollection by @coreylowman in #469
- Fixing cargo doc warnings by @coreylowman in #473
- Using
--gpu-architecture native
with nvcc by @coreylowman in #474 - using TensorFromVec for OneHotEncode and Arange by @coreylowman in #477
- Small batchnorm optimizations by @coreylowman in #478
- nvcc: fixed type bug by @M1ngXU in #480
- Adds fast_alloc feature and binary kernel optimizations by @coreylowman in #481
- Adding some "benchmarking" scripts by @coreylowman in #483
- Add try_forward and try_forward_mut to Module and ModuleMut. by @nkoppel in #482
- Optimizing cpu kernels for reductions by @coreylowman in #484
- Using alloc_zeros_async and memset_zeros for cuda by @coreylowman in #489
- Making Conv2D unbiased by default, and adding Bias2D module by @coreylowman in #494
- Using image/filter stride in cuda kernel for conv by @coreylowman in #495
- bump cudarc version by @coreylowman in #498
- Adding attention_reshape (inference only) kernels. by @Narsil in #497
- Adding lifetime to gat in ExactSizeDataset by @coreylowman in #501
- added stack to device trait bound by @M1ngXU in #502
- Allowing
nn::Embedding
to be dynamic in shape. by @Narsil in #503 - Adding
UnbiasedLinear
(linear without bias). by @Narsil in #504 - Making K dimension of matmul dynamic. by @Narsil in #505
- Tensors the whole way down by @coreylowman in #508
- Sorting tapes by unique_id to ensure proper operation order by @coreylowman in #510
- cudarc 0.8.0 by @coreylowman in #512
- Adding axpy tensor op & ModelEMA module walker by @coreylowman in #511
- [Spring cleaning] Removes GradientTape & impl Clone for Gradients by @coreylowman in #514
- Optimizer now takes &Gradients by @coreylowman in #515
- Adding
tensor.trace_with(grads)
by @coreylowman in #517 - Adding
model.zero_grads(&mut gradients)
by @coreylowman in #518 - Adding gradient accumulation example by @coreylowman in #519
- Don't clone tensor data when permuting or broadcasting by @nkoppel in #522
- Use chunk_sum in cuda kernels of backward binary operations. by @nkoppel in #520
- Adding
no-std
feature flag, matrixmultiply/threading behind feature flag. numpy no longer default by @coreylowman in #528 - Adding
model.alloc_grads()
, removingDefault
forGradients
by @coreylowman in #524 - feat: adds BatchNorm1D by @kstavro in #513
- Safetensors support. by @Narsil in #381
- Fixing bool tests with safetensors (serde compatibility) by @coreylowman in #529
- Hotfixing the safetensors impl. by @Narsil in #531
- Adds
Tensor::concat
by @coreylowman in #530 - Changing stack to be method of array/vec instead of device by @coreylowman in #533
- Easier preprocessing by @coreylowman in #534
- Create tensor from usize (for e.g. select) on any Device by @M1ngXU in #535
- Handle path for TensorVisitors using a TensorViewer by @nkoppel in #538
- Adding nice error message when MHA num heads doesn't divide K/H by @coreylowman in #542
- Moving Reshape to use stable compile time asserts by @coreylowman in #543
- Finalizing nn exports by @coreylowman in #544
- Moving src/unique_id & src/gradients into src/tensor by @coreylowman in #545
- Docs update by @coreylowman in #549
- Bump to cudarc 0.9.0 by @coreylowman in #551
- chore: remove
cblas
feature in favor ofintel-mkl
by @Alexandcoats in #552 - Adds ReduceShapeSelf::LastAxis to Shape by @coreylowman in #555
- Letting batch & seq dimensions of matmul be dyn by @coreylowman in #556
- Moving transformers to stable, and accepting dyn dimensions for transformer input by @coreylowman in #557
- Reshape skip kernels with a contiguous tensor by @coreylowman in #558
- Removing double computation of mean in normalize by @coreylowman in #559
- feat: add realize shape by @Alexandcoats in #561
- Querying nvidia-smi for compute capability instead of native by @coreylowman in #564
- Adding features to cargo doc on ci by @coreylowman in #569
- Updating 01-tensor by @coreylowman in #570
- Fixing no-std support by @coreylowman in #571
- Removes .trace_into(), .trace() now requires Gradients object by @coreylowman in #566
- Adds
trait Trace
and generic training example by @coreylowman in #572 - matrixmultiply optional. Adds
cpu-seq-matmul
,cpu-par-matmul
,cpu-mkl-matmul
features by @coreylowman in #576 - Allow Modules to be constructed with the TensorCollection trait by @nkoppel in #548
New Contributors
- @Dimev made their first contribution in #256
- @JYudelson1 made their first contribution in #294
- @Narsil made their first contribution in #368
- @TimerErTim made their first contribution in #452
- @kstavro made their first contribution in #513
- @Alexandcoats made their first contribution in #552
- @ViliamVadocz made their first contribution in #358
Full Changelog: v0.10.0...v0.11.0