-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update task summary #21067
Update task summary #21067
Conversation
The documentation is not available anymore as the PR was closed or merged. |
27db7fa
to
d1cbf2f
Compare
d1cbf2f
to
16cf536
Compare
Ok, I'm finally finished with the first draft (took a bit longer to learn some models I wasn't familiar with)! I'd appreciate a general review of the scope of this page to make sure we're aligned (ie, are some sections too in-depth, are some not explained well enough?). Thanks in advance @sgugger @MKhalusova ! 🥹 Afterward, I'll ping one of our audio and computer vision experts for a more in-depth review of those sections 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure looks good to me, thanks for adding this tutorial. It would indeed be valuable to have a vision expert and an audio expert go through the tutorial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a massive effort and will likely become a super useful doc once it's merged! Thank you for writing it! I feel like we should set expectations for the reader expertise level for all the sections and provide links to resources where they can learn the basics (just like the link to the course). Most sections require some familiarity with the subject, and it is common for folks to have expertise in some modality but not in all of them.
Thanks for the feedback, I added some images to go along with the text! @NielsRogge, would you mind reviewing the computer vision section? This guide is a high-level overview, and the goal is to help users understand how a certain task is solved by a model. Please feel free to let me know if it's too detailed, not detailed enough, or if I got something wrong! Also, if you know of a good beginner's resource for computer vision we can link to, that'd be great as well to set expectations for the reader. Thanks! 👍 @sanchit-gandhi, if you could do the same with the audio section, that'd be awesome. Thank you! 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! Would maybe just go one level higher on the pre-training details for W2V2 (as these are pretty conceptually difficult 😅)
Also think it's fine to expand a bit on how the classification heads work. This is the key point here so maybe doubling down and making sure this is very clear and precise in how we describe it.
Also our convention has been to write the model as Wav2Vec2
, which differs from facebook's wav2vec 2.0
and the wav2vec2
naming used here! Perhaps we could stay consistent with our current docs and update to Wav2Vec2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Kudos for undertaking this effort!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work on this. In a follow-up PR it would be great to add a section on multimodal models, to explain how CLIP works for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the write-up!
* first draft of audio section * make style * first draft of computer vision section * add convnext and encoder tasks * finish up nlp tasks * minor edits * add arch images, more edits * fix image links * apply sanchit feedback * model naming convention * apply niels vit feedback * replace detr for segmentation with mask2former * apply feedback * apply feedback
* first draft of audio section * make style * first draft of computer vision section * add convnext and encoder tasks * finish up nlp tasks * minor edits * add arch images, more edits * fix image links * apply sanchit feedback * model naming convention * apply niels vit feedback * replace detr for segmentation with mask2former * apply feedback * apply feedback
* first draft of audio section * make style * first draft of computer vision section * add convnext and encoder tasks * finish up nlp tasks * minor edits * add arch images, more edits * fix image links * apply sanchit feedback * model naming convention * apply niels vit feedback * replace detr for segmentation with mask2former * apply feedback * apply feedback
This is the second part of updating the task summary to be more conceptual. After a brief introduction and background to the tasks Transformers can solve in part 1, this PR is a bit more advanced and digs deeper into explaining how Transformer solves these tasks.
To-do: