Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always use Bicleaner AI #367

Merged
merged 33 commits into from
Jan 24, 2024
Merged

Always use Bicleaner AI #367

merged 33 commits into from
Jan 24, 2024

Conversation

eu9ene
Copy link
Collaborator

@eu9ene eu9ene commented Jan 16, 2024

  • Always use Bicleaner-AI and deprecate Bicleaner
  • If a language pair is not supported use the multilingual model
  • Always use full models
  • Download models using bicleaner tool
  • Rewrite and simplify model pack downloading, replace static URLs with a python script
  • Add tests
  • Refactor bicleaner kinds
  • Add filtered lines to artifacts: https://firefox-ci-tc.services.mozilla.com/tasks/FZR3B6bATa-DpxMv3PqbDg/runs/0
  • Increased maxRunTime for some steps that will work with huge datasets and might perform resource heavy operations

Fixes #55 #360

@eu9ene eu9ene requested a review from bhearsum January 19, 2024 18:53
@eu9ene eu9ene marked this pull request as ready for review January 19, 2024 18:53
@eu9ene eu9ene requested a review from a team as a code owner January 19, 2024 18:53
@eu9ene eu9ene requested a review from marco-c January 19, 2024 18:59
@@ -12,18 +12,10 @@ transforms:
- translations_taskgraph.transforms.cached_tasks:transforms
- taskgraph.transforms.task:transforms

# todo: why bicleaner and clean-corpus are required here? it doesn't work without them
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the issue?
I bet @bhearsum or @gabrielBusta can help

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have to know more about what doesn't work to say anything with certainty, but in general you need to have kind-dependencies set if any tasks in this kind are generated from another kind.

(I am a little bit surprised that clean-corpus is required here....that looks like it may be an historical artifact from when all of the merge tasks were in the same kind...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merging devest does not depend on cleaning or bicleaner. We merge raw datasets here. When I remove those two and validate the taskgraph it says:

2024-01-22 15:52:17,682 - ERROR - Error loading tasks for kind merge-devset:
Traceback (most recent call last):
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/generator.py", line 311, in _run
    new_tasks = kind.load_tasks(
                ^^^^^^^^^^^^^^^^
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/generator.py", line 76, in load_tasks
    tasks = [
            ^
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/generator.py", line 76, in <listcomp>
    tasks = [
            ^
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1347, in check_run_task_caches
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1276, in check_task_dependencies
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1262, in check_task_identifiers
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1243, in chain_of_trust
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1236, in add_github_checks
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1061, in build_task
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1029, in add_index_routes
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 957, in process_treeherder_metadata
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 919, in validate
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 907, in task_name_from_label
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 863, in set_defaults
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 842, in set_implementation
    for task in tasks:
  File "/Users/epavlov/mozilla/source/origin/firefox-translations-training/taskcluster/translations_taskgraph/transforms/cached_tasks.py", line 117, in cache_task
    for task in order_tasks(config, tasks):
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/cached_tasks.py", line 22, in order_tasks
    pending = deque(tasks)
              ^^^^^^^^^^^^
  File "/Users/epavlov/mozilla/source/origin/firefox-translations-training/taskcluster/translations_taskgraph/transforms/cached_tasks.py", line 69, in add_cache
    for job in jobs:
  File "/Users/epavlov/mozilla/source/origin/firefox-translations-training/taskcluster/translations_taskgraph/transforms/cached_tasks.py", line 55, in resolved_keyed_by_fields
    for job in jobs:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/base.py", line 144, in __call__
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/run/__init__.py", line 361, in make_task_description
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/run/__init__.py", line 236, in use_fetches
    for task in order_tasks(config, tasks):
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/cached_tasks.py", line 22, in order_tasks
    pending = deque(tasks)
              ^^^^^^^^^^^^
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/run/__init__.py", line 164, in add_resource_monitor
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/run/__init__.py", line 152, in set_label
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/run/__init__.py", line 136, in set_implementation
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/run/__init__.py", line 116, in rewrite_when_to_optimization
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/base.py", line 144, in __call__
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task_context.py", line 85, in render_task
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/base.py", line 144, in __call__
    for task in tasks:
  File "/Users/epavlov/mozilla/source/origin/firefox-translations-training/taskcluster/translations_taskgraph/transforms/find_upstreams.py", line 97, in upstreams_for_locales
    for job in jobs:
  File "/Users/epavlov/mozilla/source/origin/firefox-translations-training/taskcluster/translations_taskgraph/transforms/find_upstreams.py", line 82, in resolve_keyed_by_fields
    cleaning_type = get_cleaning_type(config.kind_dependencies_tasks.values())
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/epavlov/mozilla/source/origin/firefox-translations-training/taskcluster/translations_taskgraph/transforms/find_upstreams.py", line 75, in get_cleaning_type
    raise Exception("Unable to find cleaning type!")
Exception: Unable to find cleaning type!
Traceback (most recent call last):
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/main.py", line 907, in main
    return args.command(vars(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/main.py", line 430, in show_taskgraph
    ret = generate_taskgraph(options, parameters, logdir)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/main.py", line 192, in generate_taskgraph
    out = format_taskgraph(options, spec, logfile(spec))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/main.py", line 148, in format_taskgraph
    tg = getattr(tgg, options["graph_attr"])
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/generator.py", line 170, in full_task_graph
    return self._run_until("full_task_graph")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/generator.py", line 425, in _run_until
    k, v = next(self._run)
           ^^^^^^^^^^^^^^^
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/generator.py", line 311, in _run
    new_tasks = kind.load_tasks(
                ^^^^^^^^^^^^^^^^
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/generator.py", line 76, in load_tasks
    tasks = [
            ^
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/generator.py", line 76, in <listcomp>
    tasks = [
            ^
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1347, in check_run_task_caches
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1276, in check_task_dependencies
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1262, in check_task_identifiers
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1243, in chain_of_trust
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1236, in add_github_checks
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1061, in build_task
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 1029, in add_index_routes
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 957, in process_treeherder_metadata
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 919, in validate
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 907, in task_name_from_label
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 863, in set_defaults
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task.py", line 842, in set_implementation
    for task in tasks:
  File "/Users/epavlov/mozilla/source/origin/firefox-translations-training/taskcluster/translations_taskgraph/transforms/cached_tasks.py", line 117, in cache_task
    for task in order_tasks(config, tasks):
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/cached_tasks.py", line 22, in order_tasks
    pending = deque(tasks)
              ^^^^^^^^^^^^
  File "/Users/epavlov/mozilla/source/origin/firefox-translations-training/taskcluster/translations_taskgraph/transforms/cached_tasks.py", line 69, in add_cache
    for job in jobs:
  File "/Users/epavlov/mozilla/source/origin/firefox-translations-training/taskcluster/translations_taskgraph/transforms/cached_tasks.py", line 55, in resolved_keyed_by_fields
    for job in jobs:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/base.py", line 144, in __call__
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/run/__init__.py", line 361, in make_task_description
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/run/__init__.py", line 236, in use_fetches
    for task in order_tasks(config, tasks):
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/cached_tasks.py", line 22, in order_tasks
    pending = deque(tasks)
              ^^^^^^^^^^^^
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/run/__init__.py", line 164, in add_resource_monitor
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/run/__init__.py", line 152, in set_label
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/run/__init__.py", line 136, in set_implementation
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/run/__init__.py", line 116, in rewrite_when_to_optimization
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/base.py", line 144, in __call__
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/task_context.py", line 85, in render_task
    for task in tasks:
  File "/Users/epavlov/Library/Caches/pypoetry/virtualenvs/firefox-translations-training-U0HVfMmK-py3.11/lib/python3.11/site-packages/taskgraph/transforms/base.py", line 144, in __call__
    for task in tasks:
  File "/Users/epavlov/mozilla/source/origin/firefox-translations-training/taskcluster/translations_taskgraph/transforms/find_upstreams.py", line 97, in upstreams_for_locales
    for job in jobs:
  File "/Users/epavlov/mozilla/source/origin/firefox-translations-training/taskcluster/translations_taskgraph/transforms/find_upstreams.py", line 82, in resolve_keyed_by_fields
    cleaning_type = get_cleaning_type(config.kind_dependencies_tasks.values())
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/epavlov/mozilla/source/origin/firefox-translations-training/taskcluster/translations_taskgraph/transforms/find_upstreams.py", line 75, in get_cleaning_type
    raise Exception("Unable to find cleaning type!")
Exception: Unable to find cleaning type!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this issue is not related to this PR though

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Yep - that's an oversight from when merge-devset moved to its own kind. #385 fixes it.

@marco-c
Copy link
Collaborator

marco-c commented Jan 22, 2024

I see a downside in the approach. Previously we had all models listed. It was a bit painful when you wanted to add a new one, but it allowed us to specify exactly which model version we wanted to download and so it allowed us to make use of caching effectively (because we knew when to invalidate things).
Unless I'm missing something, with the new approach we are not specifying which model to download, so we will always use the latest one, which makes things not easily reproducible.
Also, when downloading a model for a pair like en-ru where we use the multilingual one, how do we invalidate the cache in case a en-ru non-multilingual model becomes available?

If we want to accept the downside in exchange for avoiding to list all models, we should probably document the decision somewhere.

Copy link
Collaborator

@bhearsum bhearsum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm marking this as Request changes due to the note below about the bicleaner model cache. Hopefully I'm understanding the intent there - please feel free to ping me to discuss.

todo: With non-ai bicleaner going away, the bicleaner in https://github.com/mozilla/firefox-translations-training/blob/71013bcea0e4647d04d508daf45fe2a96c27ef0d/taskcluster/translations_taskgraph/transforms/find_upstreams.py#L71 is now dead code and can be removed.

@@ -12,18 +12,10 @@ transforms:
- translations_taskgraph.transforms.cached_tasks:transforms
- taskgraph.transforms.task:transforms

# todo: why bicleaner and clean-corpus are required here? it doesn't work without them
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have to know more about what doesn't work to say anything with certainty, but in general you need to have kind-dependencies set if any tasks in this kind are generated from another kind.

(I am a little bit surprised that clean-corpus is required here....that looks like it may be an historical artifact from when all of the merge tasks were in the same kind...)

taskcluster/kinds/bicleaner-model/kind.yml Outdated Show resolved Hide resolved
Copy link
Member

@gregtatum gregtatum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a few small pieces of feedback looking through this, but others have done a more thorough review, so I'm just leaving a comment.

taskcluster/kinds/bicleaner/kind.yml Outdated Show resolved Hide resolved
taskcluster/kinds/bicleaner/kind.yml Show resolved Hide resolved
tests/test_bicleaner.py Outdated Show resolved Hide resolved
tests/test_bicleaner.py Show resolved Hide resolved
tests/test_bicleaner.py Outdated Show resolved Hide resolved
@eu9ene
Copy link
Collaborator Author

eu9ene commented Jan 22, 2024

I see a downside in the approach. Previously we had all models listed. It was a bit painful when you wanted to add a new one, but it allowed us to specify exactly which model version we wanted to download and so it allowed us to make use of caching effectively (because we knew when to invalidate things). Unless I'm missing something, with the new approach we are not specifying which model to download, so we will always use the latest one, which makes things not easily reproducible. Also, when downloading a model for a pair like en-ru where we use the multilingual one, how do we invalidate the cache in case a en-ru non-multilingual model becomes available?

If we want to accept the downside in exchange for avoiding to list all models, we should probably document the decision somewhere.

All good points!

The idea was to delegate downloading to bicleaner and easily switch between the full and lite models + hide multilingual one to reduce complexity on Taskcluster side. I looked at the code and you're right, it does download latest which is not ideal, but there was just one release of the models so I don't think this will become a problem for us. Paracrawl project is over and I don't know whether there are any plans to retrain the models or train new languages. Let's keep an eye on this.

As for caching, we can just set expiration for the task (currently 90 days). And anyway we're likely always interested in the latest models as soon as they are compatible with the code. When/if they release something new they'll likely bump bicleaner-ai version and when we update it the caches will get invalidated. I assume they'll change the URLs and HF repo names if they ever train new incompatible models. I think the language set was like this for two years and it's unlikely it will change.

We download fastText model in the same way, now with OpusCleaner. We also download tons of datasets every time which also might get updated and we don't cache them well in the same way. We can check but I don't think downloading is a huge cost for us compared to GPUs.

I agree in general that it would be nice to make all the data/model downloading more transparent but doing downloading dynamically we increase flexibility and reduce complexity.

I'll add docs about this.

@marco-c
Copy link
Collaborator

marco-c commented Jan 22, 2024

My concerns are more with reproducibility than cost in this case, but if we don't expect the models to change often and we document this well, it seems OK.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Jan 22, 2024

My concerns are more with reproducibility than cost in this case, but if we don't expect the models to change often and we document this well, it seems OK.

I see. I documented all this and how to invalidate the caches manually in the code and in the doc under Bicleaner -> ## Models and caching. It is a bit of a crutch but let's get back to it if we start having problems. I don't like the alternative of keeping all this logic on Takscluster since it's really hard to maintain, so sacrificing transparency here seems like the lesser of two evils.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Jan 22, 2024

Ok, I think I addressed most of the issues now. Caching is not ideal but let's see how it goes. The developers might never release a new model for this tool. We can always manually invalidate caches if needed and rethink this approach if it becomes a problem.

@eu9ene eu9ene requested a review from bhearsum January 22, 2024 23:58
@eu9ene
Copy link
Collaborator Author

eu9ene commented Jan 24, 2024

Ci for train-teacher:

[vcs 2024-01-24T18:48:08.464Z] Cloning into '/home/ubuntu/tasks/task_170612204020291/checkouts/vcs'...
[vcs 2024-01-24T18:48:20.147Z] fatal: unable to access 'https://github.com/mozilla/firefox-translations-training/': Could not resolve host: github.com

it looks like some temporary failure of git clone. The bicleaner steps has passed. I don't want to delay merging, so I'll rerun full CI on the next PR

@eu9ene eu9ene merged commit 99f2397 into main Jan 24, 2024
40 of 41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bicleaner Ai multilingual model
4 participants