This project is designed to improve YOLO's performance in segmenting surgical instruments in real-time surgical video. This is an implementation of the VQGAN-based version of BigDatasetGAN. I rearranged some of the code from Taiming Transformers and implemented the segmentation head for VQGAN from BigDatasetGAN based on the segmentation head from BigDatasetGAN.
I am currently working on improving the image and segmentation mask quality by enhancing the data quality and using transfer learning. I am training VQGAN on a subset of the SurgVu dataset (900k Images), fine-tuning on the SARAS-MEAD (23k Images) dataset, and then further fine-tuning on a smaller private dataset specific to Transorbital Robotic Surgery (2k Images). The idea is to train on a large dataset of surgical instruments used on porcine tissue (SurgVu), then fine-tune on a medium-sized dataset: SARAS-MEAD (~25k images), and a small private dataset TORS (~2k images) of surgical instruments used to operate on human tissue.
The VQDatasetGAN model generated these images at 256 x 256 resolution, then upsampled to 512 x 512



