Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading a full forms lexicon #130

Closed
arademaker opened this issue Aug 11, 2021 · 7 comments
Closed

Reading a full forms lexicon #130

arademaker opened this issue Aug 11, 2021 · 7 comments

Comments

@arademaker
Copy link

The words command produce all pairs of up/lower words. Do we have any command do read a file with those pairs and produce an fst from the pairs?

@mhulden
Copy link
Owner

mhulden commented Aug 11, 2021

You can use read spaced-text for that; however, the format required is a little different. You need to separate symbols with spaces and input/output pairs go on separate lines, with newlines in between. Example:

c a t
g a t o

d o g
p e r r o

produces a transducer that maps cat to gato and dog to perro.

@arademaker
Copy link
Author

Thank you, surely that can help us to have a morphological analyzer out of our full-forms Portuguese Lexicon at https://github.com/LR-POR/MorphoBr/. But, of course, such a transducer is not the perfect solution since it does not capture the rules of the morphology nor the position classes and the respective morphemes.

image


a l e t o l o g i n h a s	 
a l e t o l o g i a +N +DIM +F +PL

@arademaker
Copy link
Author

arademaker commented Mar 20, 2023

Hi @mhulden,

foma[0]: read spaced-text all.foma
Stack full!

I got a stack full error while reading a file with 8,027,574 lines. Any alternative? Can I increase the stack size? The file was created according to the above instructions

% head all.foma
a
a +N +M +SG

a s
a +N +M +PL

a z i n h o
a +N +DIM +M +SG

@arademaker arademaker reopened this Mar 20, 2023
@arademaker
Copy link
Author

arademaker commented Mar 20, 2023

I was able to compile the spaced-text files

% ll -h *.sp
-rw-r--r--  1 ar  staff    32M Mar 20 16:25 adjectives.sp
-rw-r--r--  1 ar  staff   1.4M Mar 20 16:25 adverbs.sp
-rw-r--r--  1 ar  staff    31M Mar 20 16:25 nouns.sp
-rw-r--r--  1 ar  staff   150M Mar 20 16:25 verbs.sp

with the foma script

% cat compile-m.foma
!Copyright (C) 2023 Alexandre Rademaker

read spaced-text nouns.sp
define nouns ;
clear stack

read spaced-text verbs.sp
define verbs ;
clear stack

read spaced-text adjectives.sp
define adjs ;
clear stack

read spaced-text adverbs.sp
define advs ;
clear stack

save defined morphobr.bin

after changing the https://github.com/mhulden/foma/blob/master/foma/int_stack.c#L22 to 5097152. Does it make sense?

@arademaker
Copy link
Author

The only strange behaviour I got is that adjectives are not considered:

% echo "fracota" | flookup -a -i morphobr.bin
fracota	fracote+N+F+SG

ar@tenis morpho-br % rg fracota
nouns/nouns-f.dict
16878:fracota	fracote+N+F+SG
16879:fracotas	fracote+N+F+PL
16880:fracotazinha	fracote+N+DIM+F+SG
16881:fracotazinhas	fracote+N+DIM+F+PL

adjectives/adjectives-f.dict
16046:fracota	fracote+A+F+SG
16047:fracotas	fracote+A+F+PL
16048:fracotazinha	fracote+A+DIM+F+SG
16049:fracotazinhas	fracote+A+DIM+F+PL

Any idea?

@mhulden
Copy link
Owner

mhulden commented Mar 21, 2023

Consider doing this instead of save defined

regex  nouns | verbs | adjs | advs;
save stack morphbr.bin

(save defined saves several FSTs and flookup only loads one - with the above, you should get a single FST one the stack and save that.)

@arademaker
Copy link
Author

Thanks, it worked. The strange behavior is that I tested it with nouns and verbs, and it works. That is, an ambiguous word. The problem may be that without this explicit combination of the FSTs with the disjunction. We ended up with an FST with multiple starting states, and the flookup tool tried only one?! But I was using the -a flag!

Anyway, the explicit disjunction to combine the FSTs worked fine!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants