-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split giella-shared in several repos #20
Comments
Sounds fine to me. Packaging-wise, dependencies are never optional, so the optional part is for y'all to figure out. And splitting the package is simple enough, especially if the repo is also split. |
For dependency handling, there are at least the following alternatives:
There is of course the possibility to do some shell scripting, but that will tie us even more to Unix, while we should strive to become more platform agnostic. |
A bit of a meta question, but why still bother? WSL exists and is great, and they're working on WSLg. MSYS is also fine. Even Microsoft's own packaging system vcpkg will download MSYS for several uses. I see no benefit in catering to non-Unix these days. Windows 7, 8, and 8.1 are EOL, so all supported platforms either comes with or can install a full Posix/Unix environment with ease. |
the shared repos are now there and some use cases in langs-smj and myv (for urj-Cyrl). |
Guessing some CI stuff should be adapted, as seen here, shared repos are missing. https://divvun-tc.thetc.se/tasks/PqgV9La7Toi1onN0N9OPKw/runs/0/logs/public/logs/live.log#L6809 How do we handle that? I would rather have a generic way that wouldn't require lists of dependency per language in the CI config. Of course, the easy solution for now is to clone all of the shared repos but it'd be nice to have a way to avoid that in the future |
yeah we were thinking of some lightweight format for dependencies in style of requirements.txt or Cargo.toml somewhere, not sure if that will be easier for CI than what can be fetched from configure.ac at the moment though, I'm open for ideas at this point |
It has to be in Edit: And I see that's almost what it is. Just missing a |
For now I've implemented the "clone everything every time" solution for CI. |
The builders get packages from my repo, so when I package the shared deps then that's a way to get them. Which I have to do anyway, so I'll get on that right away. |
I added the with options, it seems to work nicely with dynamic variables but isn't very extensively tested |
Part of the idea was to generalize shared resources to also include an arbitrary list of The details and implementation is not important, but it needs to meet the following details:
I have played around with something using |
…llalt/giella-shared/issues/3); Don't mangle d/rules if nothing was bundled
I've packaged the shared repos and tried it out with giella-smj and that works (builds, but fails tests), but the
New apt-get packages:
|
https://github.com/giellalt/giella-core/blob/master/scripts/giellalt-get.bash here's liek a rough sketchy example of how one could automatically detect and fetch dependencies... yeah the installation of sharables in langs is still missing I'll do that next, basically should work as copypasta from shared makefile.ams. I wonder if pkg-config is also missing since the configure fails so early, the missing installation should only bounce at build time. |
indeed the pkg-config name of single langs is giella- instead of lang- |
Aha. Well it can't be I am also not thrilled with the shared packages installing to |
mm fair point. It should be trivial to have a macro to check separate pkg config and directory names but I'll hold it off for a while if we can have a consensus on all the naming questions first since the template operations on all repos are quite heavy to run |
I agree. On the other hand, should these packages be installed at all? Most/all of them are more like code fragments to be included early in the build phase, not precompiled binaries or libraries to be "linked" to at runtime. Just my humble five cents 🙂 |
What about making the deps list very simple, like a TSV file of the following format:
where the tag/revision field is optional, and if left out, it means |
For distro packaging you need to either install separately or bundle into a single tarball. The option of fetching the extras at build time or into a parallel folder does not exist. The fact that giella-core's m4 files must be bundled is a bit of a pain. I use both options. For nightly packages, the installed data dependencies are used because here we always want the latest of everything. For releases, data dependencies are bundled into the tarball because version drift will ruin things. Which reminds me, it is important for those dependencies to note in And for listing deps, I'd say |
It's a bit like header only libraries in C/C++ in a way, but yeah like Tino says it's good for packaging and distro use and comes quite for free in autotools setting. I've gone through the naming convention questions a bit, so the questions to agree upon are:
|
I say
|
I agree with all of this. |
it should be good for testing now, the ci that reports on zulip seems to succeed but there are probably number of corner cases that can fail still. |
Seems to work. |
The only thing I would like to improve with the new shared repos is the date of the commits. They seem to now be from the date that @flammie did the split, and not from the actual date of the commit. Could that be fixed? Also, the history does not go all the way back to the start, but that could be a left-over thing from the svn-to-git conversion. |
I used this: https://stackoverflow.com/questions/1365541/how-to-move-some-files-from-one-git-repo-to-another-not-a-clone-preserving-hi/11426261#11426261 to do the history, the commit dates look right to me on command line git but github seems to have different timing, maybe the --commiter-date-is-author-date option was wrong? This method is also nice because you can basically sed the log for anomalies that break the thing like massive moves. |
This is what it looks like in Tower, where the details reveal what is going wrong: That is, I am the author (and @flammie the committer). It seems that Tower (and GitHub) uses the committer date (May 2022), whereas the CLI log uses the author date (2017). Ideally I would like committer = author (unless there is a real (PR) merge, which I don't think we've had so far for this repo or the parent repo), both regarding the person and the date. Finally, we should also make sure that the history is complete. That is tracked in a separate project. |
It's trivial to fix since the author info is there and correct, but it will mean a force push to the repos. Howto: https://riptutorial.com/git/example/21122/setting-git-committer-equal-to-commit-author |
Done using this command for git filter-branch -f --commit-filter \
'export GIT_COMMITTER_NAME=\"$GIT_AUTHOR_NAME\";
export GIT_COMMITTER_EMAIL=\"$GIT_AUTHOR_EMAIL\";
export GIT_COMMITTER_DATE=\"$GIT_AUTHOR_DATE\";
git commit-tree $@' \
-- --all and force-pushed. Will also do the other repos, and finally clean up some wrong emails. Ie more force-pushing coming up. |
Just for reference, emails are cleaned using this command: git filter-branch --env-filter 'if [ "$GIT_AUTHOR_EMAIL" = "incorrect@email" ]; then
GIT_AUTHOR_EMAIL=correct@email;
GIT_AUTHOR_NAME="Correct Name";
GIT_COMMITTER_EMAIL=$GIT_AUTHOR_EMAIL;
GIT_COMMITTER_NAME="$GIT_AUTHOR_NAME"; fi' -- --all taken from https://stackoverflow.com/questions/4981126/how-to-amend-several-commits-in-git-to-change-author. |
After the above, I filtered Rewrite b5fcc6dba1a3915844f8641575bab14490229e5a (62/70) (3 seconds passed, remaining 0 predicted)
Ref 'refs/heads/main' was deleted
fatal: Not a valid object name HEAD
zsh: command not found: export GIT_COMMITTER_NAME=\"$GIT_AUTHOR_NAME\";\n export GIT_COMMITTER_EMAIL=\"$GIT_AUTHOR_EMAIL\";\n export GIT_COMMITTER_DATE=\"$GIT_AUTHOR_DATE\";\n git commit-tree $@ and the repo is busted. Of course one can live with wrong dates, it is just very irritating. Anyone any idea on how to fix the repo so that the dates are correct again? |
You somehow got 11 spaces after |
Thanks, that fixed it, and now |
This is now all done. Dependency management can be refined, but it works across all CI/build systems, which is good enough. |
Background
giella-shared
contains today a mixture of data for many different languages:Core idea
Ideally we would only have
giella-core
as a required dependency (thus needing to move the filters there), and everything else as separate repositories that can be subscribed on an as-needed/wanted basis.By generalising sharing resources, it would also be straightforward to share content across language repositories, like including
sma
andsme
proper nouns insmj
(with some filtering and restrictions). Technically there would be no difference between getting content fromlang-sme
andshared-smi
.Naming
shared-
, parallel tolang-
,keyboard-
etc. It does not have to be what is suggested here, other suggestions are welcome.smi
andurj
Concrete example
The present
giella-shared
would after a split become (with check marks for the actual split):shared-smi
: the present shared Sámi resourcesshared-mul
: the present shared symbols, url's and punctuation lexicons (mul
= multiple languages)shared-eng
: present shared English resources (like names)shared-urj-Cyrl
: shared resources for Uralic languages written in Cyrillicgiella-core/fst-filters/
: fst filters moved here, since they are a prerequisite for compiling fst'sAnother example:
lang-sme
as a source for North Sámi names when used in another Sámi language, like place names. Non-Sámi names inlang-sme
would be filtered out, and generic last elements could be (automatically) adapted to Lule Sámi spelling and inflection as needed. This is relevant both for text analysis and parsing in general, but especially for TTS, where there is a need to get a best possible transcription and pronunciation of whatever is thrown at the system. Place names from related neightbouring languages will certainly be a pain point for many minority languages in such a context.By treating all repos the same as a potential source for lexical and other resources, we get a more flexible and powerful infrastructure.
Restrictions
Ideally the shared resources should never be required — without access to them the result should only be a smaller analyser with worse coverage. This will make
giella-core
the only required external dependency.As far as possible, the resources in each repo should be independently compilable and testable, kind of like independent code libraries.
Benefits
Considerations
versioning
dependency management
We need a straightforward and simple system to declare dependency on a list of other repositories, kind of like Rust cargo lists. But as noted above, the system should be robust enough to not break if a resource is not available, only give a warning.
CI
Dependency management needs to be automatic, at least for CI systems. We need at least:
Covered by what is specified in
configure.ac
, at least for now./autogen.sh
in a directory, using the same cloning scheme as the depending repo — svn, git-ssh or git-https)Cleanup
Comments welcome!
@flammie and I discussed this today, the notes above are based on that. We would very much like feedback on these ideas from anyone, but especially from @TinoDidriksen @bbqsrc @Eijebong @Trondtr @aarppe
The text was updated successfully, but these errors were encountered: