Skip to content

Conversation

Ghesselink
Copy link
Contributor

@Ghesselink Ghesselink commented Sep 20, 2024

Brief summary of current structure;
The script compiles .po files into .mo files and stores them in the cache, but only if the cache is not already populated. Otherwise, the translations are retrieved directly from the cache and made available on the server, with the language preference stored in a cookie on the frontend and translations passed to HTML via (server.py) render_template.

Todo

  • we have to disable caching, we cannot cache based on translations alone because the .md files can also change, we'll implement caching later on using a separate reverse proxy
  • create .mo files in create_resources for easy get started
  • create .mo files using multiple threads
  • When there is no translation, don't show an empty aside but rather show that there are no translations for this language and a link how to contribute.
  • Generate the language-selector options from the dictionary in Python, and remove the dict in JS (it's not used)
  • Chinese Simplified has a space, this most likely brakes the classname, it's the only one. (but this is hopefully solved by autogenerating the options.

Comment on lines 6 to 9
<p>
<aside class="aside-note">
<mark>TRANSLATION</mark>
<div>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do; styling similar as entity.html


<h1>{{ number }} {{ entity}}</h1>

<p>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do; styling similar as entity.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do so using a partial template / include, e.g partials/_translation.html and then

<p>
    {% include "partials/_translations_aside.html" %}
</p>

@Ghesselink
Copy link
Contributor Author

Ghesselink commented Apr 21, 2025

c&p from Ghesselink#1

Some changes that I needed to get things running:

    Support windows (msgfmt -> babel, utf-8, ...)
    eliminated the need for page reload to switch language
    tweaked style
    updated some defaults to point to the submodule

Todo

    we have to disable caching, we cannot cache based on translations alone because the .md files can also change, we'll implement caching later on using a separate reverse proxy
    create .mo files in create_resources for easy get started
    create .mo files using multiple threads
    When there is no translation, don't show an empty aside but rather show that there are no translations for this language and a link how to contribute.
    Generate the language-selector options from the dictionary in Python, and remove the dict in JS (it's not used)
    Chinese Simplified has a space, this most likely brakes the classname, it's the only one. (but this is hopefully solved by autogenerating the options.

Questions

    I really don't understand load_original() this seems really inefficient and I don't know what it does. Related to that I don't understand why we have both TRANSLATIONS_DIR and CROWDIN_REPO_DIR, I don't know what to initialize these to. I see there is only one submodule initialized.

Support windows (msgfmt -> babel, utf-8, ...)

I'm sorry, I didn't consider that

we have to disable caching, we cannot cache based on translations alone because the .md files can also change, we'll implement caching later on using a separate reverse proxy
create .mo files in create_resources for easy get started
create .mo files using multiple threads

Does this mean we'll still translate the entire html and load only the 'active' translation, or do we then just create .mo files and do the translation on request (when loading the page)? In that case, we can still cache these .mo files based on newly incoming translations in the translate repository

I really don't understand load_original() this seems really inefficient and I don't know what it does

I'm having another look at it, but load_original() is redundant for entities and properties; I've used it for type objects. More specifically, for type values. For example (IfcWallType).
image
So load_original is used to get the original values here and afterwards further processed in the type html.https://github.com/Ghesselink/IFC4.3.x-development/blob/944fdc00c861fe6f829c19d2424f48cb96d2c7bb/code/templates/type.html#L64-L74

I've had some issues with loading these from the .md locally. Furthermore, this way we're completed sure that the translated text is the same as the original. However, as you mentioned, when we move away from caching based on translations and track changes in the md file too this will be redundant.

I don't understand why we have both TRANSLATIONS_DIR and CROWDIN_REPO_DIR, I don't know what to initialize these to.

It's a bit of a mix with terminology sometimes. 'Translations' can mean a couple of things now already;

  • The repository containing the translations -> CROWDIN_REPO_DIR
  • The actual translation process of IFC entities, properties and types (+ type values).
  • The place where translations are being stored -> TRANSLATIONS_DIR
  • This branch and when referencing to this project ..

Perhaps we can better rename this to cached_translations or similar, and store the http responses in redis (?)

@aothms
Copy link
Collaborator

aothms commented May 13, 2025

Does this mean we'll still translate the entire html and load only the 'active' translation, or do we then just create .mo files and do the translation on request (when loading the page)?

I would say create all .mo files at startup and when there are changes to the translations. Most likely in poller.py

I still don't really understand the load_original (and how it interacts with caching). Could you give some example values and how it's not possible to get these from the compiled .mo?

The place where translations are being stored

You mean the html cache?

fyi caching is easy to do for ex in nginx https://docs.nginx.com/nginx/admin-guide/content-cache/content-caching/

Let me know how you think we can advance on finalizing this?

translate.build_cache(clean=True, use_hash=True)

# First time. Spider the site to build indices in Redis. Then terminate.
subprocess.call([sys.executable, "translate.py", "build-cache", "--clean"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems redundant with the python cal above?

except subprocess.CalledProcessError:
return b""

def update_repo(repo, branch):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also change this to use pip install GitPython?

Comment on lines 71 to 75
else:
translate.build_cache(use_hash=True)

if trans_changed:
translate.build_cache()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many calls to build_cache?

code/poller.py Outdated
if trans_changed:
translate.build_cache()

if not (main_changed or trans_changed or first_time):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? I think even if things have changed, it's ok to sleep?

print(f"[ERR] {po}: {e}", file=sys.stderr)

def _compile_one(po, mo):
os.makedirs(os.path.dirname(mo), exist_ok=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it makes a difference, but maybe do this outside of the function call so that you group all the IO operations and don't repeat the same call for files sharing their directories:

for path in set(map(os.path.dirname, map(operator.itemgetter(1), tasks)):

Note that you also do the makedirs in compile_po_to_mo() so can also be removed there then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it, it's clearer to do this in one place indeed. There's not a lot of performance difference. In both cases, the initial compilation of all (621) polib files to .mo takes 8 seconds. In case everything is skipped (i.e. not new translations), it's just 0.1s.
89b4218

Initial build:

translate.py build-cache --clean 
[...]
Done. compiled=621, skipped=0, pruned=0, errors=0, TRANSLATIONS_BUILD_DIR=/home/geert/Documents/translations/IFC4.3.x-development/code/compiled_translations in 8.55 seconds

Skipped:

debugpy/launcher 51103 -- /home/geert/Documents/translations/IFC4.3.x-development/code/translate.py build-cache 

Done. compiled=0, skipped=621, pruned=0, errors=0, TRANSLATIONS_BUILD_DIR=/home/geert/Documents/translations/IFC4.3.x-development/code/compiled_translations in 0.10 seconds

print(f"[ERR] {po}: {e}", file=sys.stderr)

else:
with ThreadPoolExecutor(max_workers=jobs) as ex:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try both ThreadPoolExecutor as well as ProcessPoolExecutor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially went for ProcessPoolExecutor but then switched to ThreadPool because I couldn't get it working. Looking at it again, I understand why: the directories must be created in a central spot (as you pointed out in another comment) and the compile function must be defined outside of the build_cache function (i.e. must be pickable).


I've tried both and tested it by creating a clean cache build.

ThreadPoolExecutor

python translate.py bench --pool thread  -j 8 --repeat 3
bench: pool=thread jobs=8 runs=[3.2557999299970106, 3.3374314489992685, 3.3936321290020715] avg=3.33s

ProcessPoolExecutor

python translate.py bench --pool process  -j 8 --repeat 3
bench: pool=process jobs=8 runs=[0.9578737699994235, 1.102173382001638, 1.0440669560011884] avg=1.03s

and an extra double-check

python3 translate.py build-cache -j 8 --clean --pool process
Done. compiled=621, skipped=0, pruned=0, errors=0, TRANSLATIONS_BUILD_DIR=/home/geert/Documents/translations/IFC4.3.x-development/code/compiled_translations in 1.11 seconds

ProcessPoolExecutor handled the compiles three times as fast as the threaded pool. I guess that is because polib is pure python and our task is mainly CPU-bound. I've left them both in, so we could switch if we'd like to (but defaulted to process).
0b9043d

if val:
out[suffix] = val

# remove wikileaks (e.g. '[[IfcBeam]]' --> 'IfcBeam')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo I guess ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, in the IFC.json it's represented like this. e.g.

"Definition": "An [[IfcBeam]] is typically a horizontal, or nearly horizontal, structural member that is capable of withstanding load primarily by resisting bending. It may also represent such a member from an architectural point of view. It is not required to be load bearing.",

This formatting is also used in the .pot files

msgid "IfcBeam_DEFINITION"
msgstr "An [[IfcBeam]] is typically a horizontal, or nearly horizontal, structural member that is capable of withstanding load primarily by resisting bending. It may also represent such a member from an architectural point of view. It is not required to be load bearing.

Because we're representing the translation at the top of the file I opted to remove the brackets. For now at least; in the near future it's probably better to put the translation inside the semantic definition part, and we can keep the brackets to include the link.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to wikileaks instead of links :P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh woops .. I didn't even notice that :p


def list_languages():
# get a list of available languages
langs = build_language_file_map()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_language_file_map() seems to get called an awful lot, but does a quite elaborate directory search. Maybe cache it and compute it once per minute. I think that's better than creating a JSON file from it using poller.py, because also that JSON file would need to be read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Also added two command line arguments to test it. e.g.

(translations) geert@PcGeert:~/Documents/translations/IFC4.3.x-development/code$ python3 translate.py debug-ttl
TTL=60s
after first run: {'lang_map': 1, 'flag_map': 1, 'list_langs': 1}
after second run: {'lang_map': 2, 'flag_map': 2, 'list_langs': 2} # after time.sleep(60.5s)

The average times per call (very little ..), called each map function 2000 times to test it.

(translations) geert@PcGeert:~/Documents/translations/IFC4.3.x-development/code$ python3 translate.py bench-ttl
cold: map=4.886 ms  flag=0.081 ms  list=0.022 ms
hot (cached avg): map=0.70 µs  flag=0.72 µs  list=0.97 µs
speedup×: map≈7014  flag≈112  list≈23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants