Implementation of the NFNets from the paper: "ConvNets Match Vision Transformers at Scale" by Google Research. And, an exploration into a world with perhaps radically superior vision encoder modules for advanced multi-modal learning. Note as with all of my paper implementations this implementation does not feature any of the logic to train the model. You may implement that logic to your own heart's desire. I will not because data preprocessing is my weakness.
Finally, this paper will be implemented in part with the ultra-simple AI framework Zeta that I personally built over hundreds of paper implementations to save time by transforming complex AI Modules into modular and re-useable lego blocks to stop wasting 100s of hours on repitive activities like implementing Flash Attn from scratch and focus my time ONLY on creating things never before created. Now shall we begin?
$ pip install convnet
MIT