-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Issues: Lightning-AI/pytorch-lightning
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
StreamingDataset not working in multi-gpu environement
bug
Something isn't working
repro needed
The issue is missing a reproducible example
#20140
opened Jul 30, 2024 by
davidpicard
training time increase epoch by epoch
bug
Something isn't working
help wanted
Open to be worked on
performance
repro needed
The issue is missing a reproducible example
ver: 2.2.x
#20076
opened Jul 12, 2024 by
Eric-Lin-CVTE
Dataloader with >0 workers when using DDP causes a crash
bug
Something isn't working
data handling
Generic data-related topic
repro needed
The issue is missing a reproducible example
ver: 2.2.x
#20054
opened Jul 5, 2024 by
alexanderswerdlow
trainer.test() with given checkpoint logs last epoch instead of checkpoint epoch
bug
Something isn't working
help wanted
Open to be worked on
repro needed
The issue is missing a reproducible example
#20052
opened Jul 5, 2024 by
markussteindl
The training process will stop unexpectedly
bug
Something isn't working
needs triage
Waiting to be triaged by maintainers
repro needed
The issue is missing a reproducible example
#19920
opened May 30, 2024 by
5huanghuai
MisconfigurationException
bug
Something isn't working
repro needed
The issue is missing a reproducible example
#19516
opened Feb 23, 2024 by
moghadas76
PermissionError with ModelCheckpoints
bug
Something isn't working
callback: model checkpoint
repro needed
The issue is missing a reproducible example
#19397
opened Feb 2, 2024 by
aaprasad
Deepspeed Stage 3 crashes Lightning trainer
bug
Something isn't working
repro needed
The issue is missing a reproducible example
strategy: deepspeed
ver: 2.1.x
#19096
opened Nov 30, 2023 by
m-harmonic
BatchSizeFinder throws KeyError: 'limit_eval_batches'
bug
Something isn't working
duplicate
This issue or pull request already exists
help wanted
Open to be worked on
repro needed
The issue is missing a reproducible example
tuner
ver: 2.1.x
#18985
opened Nov 10, 2023 by
drusmanbashir
DDP + static graph can result in garbage data returned by Related to a 3rd-party
bug
Something isn't working
repro needed
The issue is missing a reproducible example
ver: 2.0.x
all_gather
3rd party
#18872
opened Oct 26, 2023 by
mooninrain
LightningModule.to_torchscript()
does not transfer check_inputs to correct device
bug
#18824
opened Oct 19, 2023 by
pfeatherstone
manual_backward and .backward() have different behaviour.
bug
Something isn't working
repro needed
The issue is missing a reproducible example
ver: 2.0.x
#18740
opened Oct 6, 2023 by
roedoejet
ModelCheckpoint Doesn't Delete Old Best Checkpoints When Resuming Training
bug
Something isn't working
callback: model checkpoint
repro needed
The issue is missing a reproducible example
ver: 1.9.x
#18687
opened Oct 2, 2023 by
danielzeng-gt
Model trained with Deepspeed stage 3 shape not match when loading
bug
Something isn't working
repro needed
The issue is missing a reproducible example
strategy: deepspeed
ver: 2.0.x
#18648
opened Sep 26, 2023 by
yinweisu
CombinedLoader
takes a long time when num_workers > 0
bug
#18584
opened Sep 19, 2023 by
johnathanchiu
Can't run the pytorch lightning program packaged with pyinstaller.
3rd party
Related to a 3rd-party
bug
Something isn't working
help wanted
Open to be worked on
repro needed
The issue is missing a reproducible example
ver: 1.9.x
#18492
opened Sep 6, 2023 by
laogonggong847
Model parameters don't get updated after upgrading from 1.1.4 to 2.0.7
bug
Something isn't working
repro needed
The issue is missing a reproducible example
ver: 2.0.x
ver: 2.1.x
#18346
opened Aug 20, 2023 by
yqin-falling-stars
Iterable dataset + DDP + SLURM + MultiGPU : Training stuck - error: The client socket has failed to connect to [ip6-localhost]:24355 (errno: 99 - Cannot assign requested address).
bug
Something isn't working
repro needed
The issue is missing a reproducible example
ver: 2.0.x
#18338
opened Aug 18, 2023 by
sri9s
load_from_checkpoint Right After fit Got FileNotFound Error
bug
Something isn't working
repro needed
The issue is missing a reproducible example
ver: 1.9.x
#18328
opened Aug 16, 2023 by
donglihe-hub
Incorrect batch progress saved in checkpoint at every_n_train_steps
bug
Something isn't working
help wanted
Open to be worked on
loops
Related to the Loop API
repro needed
The issue is missing a reproducible example
ver: 1.9.x
ver: 2.1.x
#18060
opened Jul 11, 2023 by
shuaitang5
Running out of memory when resuming the training from a checkpoint
bug
Something isn't working
checkpointing
Related to checkpointing
performance
repro needed
The issue is missing a reproducible example
ver: 2.0.x
#18059
opened Jul 11, 2023 by
RJPenic
RuntimeError: CUDA error: unspecified launch failure
bug
Something isn't working
repro needed
The issue is missing a reproducible example
ver: 2.0.x
#18039
opened Jul 10, 2023 by
Hanminghao
self.log(.., on_epoch=True) runs extremely slow
bug
Something isn't working
logging
Related to the `LoggerConnector` and `log()`
performance
repro needed
The issue is missing a reproducible example
ver: 2.0.x
#17988
opened Jul 4, 2023 by
LinWeizheDragon
Expected all tensors to be on the same device
bug
Something isn't working
repro needed
The issue is missing a reproducible example
ver: 2.0.x
#17851
opened Jun 16, 2023 by
whatisslove11
IsADirectoryError: [Errno 21] Is a directory: '/content'
bug
Something isn't working
repro needed
The issue is missing a reproducible example
ver: 2.1.x
ver: 2.2.x
#17730
opened May 31, 2023 by
rashidasohail
Previous Next
ProTip!
Type g p on any issue or pull request to go back to the pull request listing page.