Skip to content

Added docs links to supported tasks #257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 22, 2023
Merged

Conversation

josephrocca
Copy link
Contributor

@josephrocca josephrocca commented Aug 22, 2023

Issue: #134 (comment)

I linked to the feature-extraction example for sentence-similarity - relevant issues:

So, for now at least, can I add an example like this to the docs for feature-extraction?

let extractor = await pipeline('feature-extraction', 'Xenova/e5-large-v2');
let dotProduct = (vec1, vec2) => vec1.reduce((sum, val, i) => sum + val * vec2[i], 0);

let passage1 = await extractor('passage: She likes carrots and celery.', { pooling: 'mean', normalize: true });
let passage2 = await extractor('passage: This is a good calculus guide.', { pooling: 'mean', normalize: true });
let query = await extractor('query: Taking care of rabbits', { pooling: 'mean', normalize: true });

let similarity1 = dotProduct(query.data, passage1.data);
let similarity2 = dotProduct(query.data, passage2.data);

@xenova
Copy link
Collaborator

xenova commented Aug 22, 2023

Amazing! Thanks so much 🤗. Could you also update 5_supported-tasks.snippet?

Afterwards I'll generate a preview for it

@josephrocca
Copy link
Contributor Author

@xenova Done 👍

Also, on top of the above sentence-similarity example, where do you think is the best place to add examples of popular workflows that don't currently fit into a pipeline? I'm mainly thinking about this CLIP (separate image/text) example:

#227 (comment)

(Thought: Maybe there's a way to simply use the text and image parts as separate feature-extraction pipelines?)

@xenova
Copy link
Collaborator

xenova commented Aug 22, 2023

So, for now at least, can I add an example like this to the docs for feature-extraction?

Hmm, I think I'd like to keep the code snippets to only use the pipeline function (and avoid pre- and post-processing needed by the user). But as you identified, there is technically no sentence-similarity pipeline (even though the functionality does exist). Perhaps we can just add the sentence-similarity (and even embeddings) pipelines to transformers.js before transformers 😅

Also, on top of the above sentence-similarity example, where do you think is the best place to add examples of popular workflows that don't currently fit into a pipeline? I'm mainly thinking about this CLIP (separate image/text) example:

I put those examples here and here, but I do agree that since it's quite a popular use-case, it might be worth creating a tutorial/guide for it. Same for other embeddings.

@josephrocca
Copy link
Contributor Author

Can I add a link to the available models too? E.g. something like this:

image

Where the (models) link is:

https://huggingface.co/models?pipeline_tag=fill-mask&library=transformers.js&sort=trending

(If yes, and you prefer a different place/format/link-text/etc. let me know)

Relevant:

@xenova
Copy link
Collaborator

xenova commented Aug 22, 2023

Can I add a link to the available models too? E.g. something like this:

That's a great idea! Yes please! 🤗

If yes, and you prefer a different place/format/link-text/etc. let me know

I'm not too picky/bothered :) I don't think it's too confusing or anything.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 22, 2023

The documentation is not available anymore as the PR was closed or merged.

@josephrocca
Copy link
Contributor Author

josephrocca commented Aug 22, 2023

Yes please!

Done!

Hmm, I think I'd like to keep the code snippets to only use the pipeline function (and avoid pre- and post-processing needed by the user)

This seems like a bad idea imo if it's at the cost of the user/dev experience. I know I'd definitely have benefited from a code snippet like this. Is this just a mild preference, or something you're quite sure about? I definitely prefer that docs examples are as useful as possible to newbies. The other end of the spectrum is a very "technical" list of snippets/facts (parameter types, return values, etc.) - things that don't really help the users who are in need of the most help - the newbies who are just trying to get something working as a starting point.

As a user I definitely would have benefited from having an example like the one I gave. I've created gists of minimal examples like that that I can refer back to, and I think every user would have to end up repeating that work. Cosine vs dot? pooling? normalization? passage1 isn't a vector? ohh passage1.data. etc. - this can add up to 30 mins of work or more, which isn't a great experience. If the docs contain simple, working snippets for common tasks then it's such a breath of fresh air - all the technical data on parameter/return value types etc. should be secondary to that (again, in order to prioritise helping newbies get started quickly).

Perhaps we can just add the sentence-similarity (and even embeddings) pipelines to transformers.js before transformers

Even if the sentence-similarity pipeline does this behind the scenes, I think the feature-extraction pipeline should still have an example like this since it's such a common use case. A dot product is as much post-processing as an addition/multiplication - i.e. this example is not super specialised/unique.

Worth noting also that sometimes the pre-existing pipelines don't quite fit the use case - e.g. I may have some existing vectors, and some text (instead of just text pairs), or I may want to save the vectors as well as the similarity scores, rather than just getting a similarity score. Or I may want to compare features across modalities like with CLIP. IIUC, these are the sorts of things people will use the feature-extraction pipeline for, and so it makes sense to give them examples of basic stuff like checking vector similarity.

Apologies for the wall of text! 😅

@xenova
Copy link
Collaborator

xenova commented Aug 22, 2023

Is this just a mild preference, or something you're quite sure about?

Mild preference :) If something is better for the dev experience, then I'll do that!

I definitely prefer that docs examples are as useful as possible to newbies. The other end of the spectrum is a very "technical" list of snippets/facts (parameter types, return values, etc.) - things that don't really help the users who are in need of the most help - the newbies who are just trying to get something working as a starting point.

Agreed, though I would say that the /api/pipelines section is meant to have those technical details, while /pipelines shouldn't (it should be high-level).

As a user I definitely would have benefited from having an example like the one I gave. I've created gists of minimal examples like that that I can refer back to, and I think every user would have to end up repeating that work. Cosine vs dot? pooling? normalization? passage1 isn't a vector? ohh passage1.data. etc. - this can add up to 30 mins of work or more, which isn't a great experience. If the docs contain simple, working snippets for common tasks then it's such a breath of fresh air - all the technical data on parameter/return value types etc. should be secondary to that (again, in order to prioritise helping newbies get started quickly).

Yes that's definitely something which should be improved. Perhaps adding a table of contents to the top of /api/pipelines which would link them to the relevant code snippets would be a simple addition for now (to replace the ugly auto-generated block which is there right now).

For example, it could be similar to the available tasks section, but also linking to (or including) the parameters

Or I may want to compare features across modalities like with CLIP. IIUC, these are the sorts of things people will use the feature-extraction pipeline for

Currently, the feature-extraction pipeline is only for text (something I actually found out recently, as I also thought it was for all modalities). The recommended way to get the raw model outputs is by loading models with the from_pretrained method of AutoModel, AutoModelForXXX, or XXXModel, running the Processor and/or tokenizer separately, and passing these inputs to the model. This is obviously quite tedious, and code snippets for this will help greatly.

@josephrocca
Copy link
Contributor Author

josephrocca commented Aug 22, 2023

though I would say that the /api/pipelines section is meant to have those technical details, while /pipelines shouldn't (it should be high-level).

Nothing wrong with having technical details there imo (especially now that we have links that go straight to relevant code snippets - much easier for newbies to navigate), but if there are already example code snippets there, why not make them as useful as possible to the dev that's reading them? If 50% of people hitting the page want to do X, then the code snippet should probably show an example of X - especially if it's just a couple more lines of code.

But I agree that stuff that's higher level (than e.g. a dot product or whatever), should probably go on a separate page (same with not-as-common use cases).

@xenova
Copy link
Collaborator

xenova commented Aug 22, 2023

Yeah that makes sense 👍 The library also has some other (not-as-well documented) methods for dot product and cosine similarity, so we could always just use those.

For now, I'll merge these changes (as I am prepping v2.5.3 now), and we can continue improving the docs in other PRs 😇🤗

Thanks again for these improvements!

@xenova xenova merged commit 9bb6923 into huggingface:main Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants