Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove duplicate information from altLabels and texts #24

Open
1 task
ioggstream opened this issue May 2, 2024 · 0 comments
Open
1 task

Remove duplicate information from altLabels and texts #24

ioggstream opened this issue May 2, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@ioggstream
Copy link
Contributor

ioggstream commented May 2, 2024

I expect

  • Skill altLabels present in the json.gz summaries to not contain redundant definitions

Instead

They contain texts with little variations (e.g., plurals, ...)

Notes

  • Use spacy to identify duplicate sentences based on lemmas
  • pick one sentence for each equivalence class

These sentences could then be used e.g. to create spacy matchers based on lemmas.

@ioggstream ioggstream added the enhancement New feature or request label May 2, 2024
@ioggstream ioggstream moved this to Todo in esco-playground May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

2 participants