First of all, thank you for your valuable research.
I'm writing to seek clarification regarding the attention map generation process described in your paper. Specifically, in Apendix C.10 you explain the methodology for obtaining attention maps as follows:
Based on this description, I would expect the resulting attention maps to be relatively coarse heatmaps ($\sqrt{n} \times \sqrt{n}$). However, the visualizations presented in Figure 20 appear remarkably fine-grained:
I suspect I may be missing some implementation detail or misunderstanding part of the methodology. Could you please clarify how you transition from the described extraction process to the high-resolution attention maps shown in your results?
Or even if you open-sourced the code used to generate these visualizations, it would be extremely helpful for the community to better understand your implementation.
Thank you for your time and for sharing your work with the community.