Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please file issues for insufficiencies of lingpy #16

Open
LinguList opened this issue Apr 9, 2017 · 4 comments
Open

Please file issues for insufficiencies of lingpy #16

LinguList opened this issue Apr 9, 2017 · 4 comments

Comments

@LinguList
Copy link
Collaborator

I just figured when reading the documentation of an extension of ipa2tokens in this repo that you suppose that linpgy splits strings that are not identical with the input strings when removing whitespace. If this is really happening, it should be handled from within lingpy, and I would need some triggers to confirm. Note that you should make sure to normalize to one unicode version, as we do in lingpy, and that this may trigger differences (currently, you are not normalizing in the script!). Other reasons I would not know of, but it would be extremely valuable to be told those differences, so we can address them.

@Anaphory
Copy link
Collaborator

Anaphory commented Apr 9, 2017

True! The difference is that LingPy's ipa2tokens removes (and it's usually reasonable it does that) the - and . characters that tell it the ends of segments.

I'm fully aware that what I use here is quite a dirty and ad-hoc way to do what I wanted to do, it is supposedly only intermittent („Nichts währt länger als eine kurze provisorische Lösung“, though). I'll hopefully think of something better and suggest it to you at some point – I assume it would be an optional argument to ipa2tokens which tells it not to remove those characters but do something else with them (Be their own token? Merge with previous token? Merge with following token?)

@LinguList
Copy link
Collaborator Author

Ah, I see, this is of course a feature rather than a bug, as dots serve as vowel break markers and I don't see why to keep them, although one could modify to keep the dot. We have even new annotations, which allow to keep original stuff but will convert parts, using a "source/target" annotation, which would allow to mark laryngeal in IE, h₂ as h₂/ə, meaning: lingpy will read it as schwa, while the segment is still laryngeal 2. We now also use clear-cut orthography profiles to convert from orthography to ipa-like representations. I think as far as this repo is concerned, it would be useful to have a larger discussion on that, so you know where we are right now and may explain us why you might want to diverge from that.

In terms of implementation, the dot may be hard-coded, but one needs to look up the original code. In fact, you can pretty much adapt ipa2tokens to many, many of your needs, and I think the tutorial online, that is, where the function is described, may even give further instructions. If not, let me know, and I'll explain some more about the basic ideas behind it.

@LinguList
Copy link
Collaborator Author

BTW, on cldf, I recommend this page, as it is where I will develop the major specifications/recommendations which are usually on-line with what lingpy/edictor handle.

@Anaphory
Copy link
Collaborator

Thanks! That's helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants