Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move "by" tag contains filter to tag exact match filter #4464

Closed
AetherUnbound opened this issue Jun 7, 2024 · 6 comments · Fixed by #4481
Closed

Move "by" tag contains filter to tag exact match filter #4464

AetherUnbound opened this issue Jun 7, 2024 · 6 comments · Fixed by #4481
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix good first issue New-contributor friendly help wanted Open to participation from the community 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@AetherUnbound
Copy link
Contributor

Description

Looking over our transformations, it seems we have two separate tag filtering steps that happen on the MediaStore class: full string matching and substring matching. The latter case, however, has the value by as a substring match option. Whereas all of the other options feel fairly unique and pointed, the value by could be part of so many valid tag possibilities that it seems like it should be part of the full string matching instead. Examples being: "by the sea", "colby jack", "byte array", etc.

We should move this value from TAG_CONTAINS_DENYLIST to TAG_DENYLIST in this file:

TAG_DENYLIST = {
"no person",
"squareformat",
"undefined",
}
# Filter out tags that contain the following terms. All entrées should be lowercase.
TAG_CONTAINS_DENYLIST = {
":",
"=",
"by",
"by-nc",
"by-nc-nd",
"by-nc-sa",
"by-nd",
"by-sa",
"cc0",
"creative commons",
"flickriosapp",
"pdm",
"public domain",
"uploaded",
}

@AetherUnbound AetherUnbound added 💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs good first issue New-contributor friendly help wanted Open to participation from the community labels Jun 7, 2024
@sarayourfriend
Copy link
Contributor

👍 to by should be a full-tag match, maybe cc by (etc for the other CC license terms) as well.

In line with the discussion of #4417 and into #4456, if we will soon have the ability to filter tags later in our data lifecycle, it'd be great to not do any kind of wholesale exclusion of tags at this point.

My understanding of some of the requirements of #4456, in line with the requirements of #4417 not to include certain tags in search but keep them in our data, would lead me to think that excluding these tags at all is against that trend? I suppose I'm making an assumption about how those discussions will play out.

If we move this later on, though, we'll have greater flexibility to play with what tags we use for search, without precluding the possibility of tags not directly used in search still being useful. Perhaps the existence of some of these tags, even if they aren't part of the searched text, is a useful indicator of some quality of the work that would be relevant for some subset of searches (if not all searches).

If the task for this issue is just to move this check into the other set, fine for now, but perhaps an additional issue would be good to also stop throwing these tags away at ingestion time (if that's how they're being excluded).

@krysal
Copy link
Member

krysal commented Jun 10, 2024

If the task for this issue is just to move this check into the other set, fine for now, but perhaps an additional issue would be good to also stop throwing these tags away at ingestion time (if that's how they're being excluded).

I was about to ask exactly this because, from sync discussions, I had understood that we wanted to preserve tags with low accuracy and move forward with cleaning deny-listed tags in the catalog since we're already excluding them in new records. Hence, I created #4453 (blocked while this is clarified). If the desire is to keep everything, then #4453 would not be necessary.

@sarayourfriend
Copy link
Contributor

We are moving towards that, I assume Madison opened this issue as something we can resolve as a separate problem, rather than also solving it in that issue (or waiting to solve it after that issue).

As things are, we are incorrectly excluding the types of tags Madison mentioned in the issue. Why block this on overall changes?

@krysal
Copy link
Member

krysal commented Jun 11, 2024

We are moving towards that...

Perfect, that is all the confirmation I needed. Thanks!

As things are, we are incorrectly excluding the types of tags Madison mentioned in the issue. Why block this on overall changes?

I wasn't referring to blocking this issue (#4464) but #4453, which is unnecessary, so it's closed now. Sorry for the misunderstanding. I agree with moving the by tag 👍

@szymon-polaczy
Copy link
Contributor

I'd like to work on this

@AetherUnbound
Copy link
Contributor Author

AetherUnbound commented Jun 20, 2024

Thanks @sarayourfriend and @krysal for piping in - yes, I agree that we should stop filtering tags at ingestion in the future once we move the filtering out of the ingestion server (see #4524). I'll make an issue for follow-up!

Edit: issue made - #4542

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix good first issue New-contributor friendly help wanted Open to participation from the community 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants