-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
desambiguar manualmente as sentenças do treebank com Mood=Imp,Ind
(@dominickmaia, @juliana-gurgel e @heliolbs )
#559
Comments
@heliolbs e @dominickmaia, no privado, perguntei ao professor @leoalenc se deveria ser feito um levantamento de todas as IDs das sentenças antes de começarmos a fazer a desambiguação e ele respondeu:
|
Professor @leoalenc , ao utilizarmos a ferramenta de busca no treebank, encontramos 148 ocorrências de Na reunião de hoje, estudamos o script que o senhor postou na issue #486 e procuramos adaptá-lo para fazer o levantamento das sentenças do treebank com No output, há repetições dos IDs das sentenças que contêm mais de uma ambiguidade. Segue o script adaptado e o output: import Yauti
sents=Yauti.extractConlluSents(Yauti.TREEBANK_PATH)
for sent in sents:
mood=sent.filter(feats__Mood='Imp,Ind')
for m in mood:
print(sent.metadata['sent_id'])
MooreFP1994:0:0:6
MooreFP1994:0:0:24
MooreFP1994:0:0:24
Alencar2021:0:0:53
Navarro2016:0:0:38
Navarro2016:0:0:40
Navarro2016:0:0:126
Navarro2016:0:0:129
Navarro2016:0:0:135
Navarro2016:0:0:136
Navarro2016:0:0:139
Navarro2016:0:0:148
Avila2021:4:2:26
Avila2021:4:2:26
Avila2021:0:0:27
Avila2021:0:0:27
Avila2021:0:0:27
Avila2021:0:0:27
Avila2021:0:0:27
Avila2021:0:0:35
Avila2021:0:0:35
Avila2021:0:0:66
Avila2021:0:0:102
Avila2021:0:0:124
Avila2021:0:0:124
Avila2021:0:0:127
Avila2021:0:0:134
Avila2021:0:0:134
Avila2021:0:0:155
Avila2021:0:0:180
Avila2021:0:0:183
Avila2021:0:0:186
Avila2021:0:0:202
Avila2021:0:0:202
Avila2021:0:0:273
Avila2021:0:0:311
Avila2021:28:2:325
Avila2021:28:3:326
Avila2021:0:0:330
Avila2021:0:0:353
Avila2021:0:0:353
Avila2021:0:0:368
Avila2021:0:0:372
Avila2021:0:0:372
Avila2021:0:0:384
Avila2021:0:0:387
Avila2021:0:0:387
Avila2021:0:0:397
Avila2021:0:0:434
Avila2021:35:2:438
Avila2021:0:0:448
Avila2021:37:1:465
Avila2021:38:1:472
Avila2021:39:1:477
Avila2021:0:0:519
Avila2021:0:0:531
Avila2021:0:0:548
Avila2021:0:0:562
Avila2021:0:0:568
Avila2021:0:0:569
Avila2021:0:0:593
Avila2021:0:0:594
Avila2021:0:0:606
Avila2021:0:0:628
Avila2021:0:0:642
Avila2021:0:0:644
Avila2021:0:0:644
Avila2021:0:0:646
Avila2021:0:0:652
Avila2021:0:0:678
Avila2021:0:0:694
Avila2021:0:0:718
NTLN2019:1:1:8
NTLN2019:1:1:8
NTLN2019:1:1:8
NTLN2019:1:2:9
NTLN2019:2:4:15
NTLN2019:8:1:47
NTLN2019:8:1:47
NTLN2019:8:2:48
NTLN2019:0:0:49
NTLN2019:0:0:49
Cruz2011:0:0:19
Cruz2011:0:0:19
Cruz2011:0:0:43
Cruz2011:0:0:50
Cruz2011:0:0:73
Cruz2011:0:0:82
Cruz2011:0:0:101
Cruz2011:0:0:114
Cruz2011:0:0:114
Cruz2011:0:0:116
Cruz2011:0:0:116
Cruz2011:0:0:117
Casasnovas2006:5:10:58
Casasnovas2006:6:8:67
Casasnovas2006:6:8:67
Casasnovas2006:8:6:74
Casasnovas2006:9:3:87
Casasnovas2006:9:3:87
Casasnovas2006:9:31:115
Casasnovas2006:10:8:127
Casasnovas2006:10:12:131
Casasnovas2006:11:7:138
Casasnovas2006:11:19:150
Casasnovas2006:11:21:152
Casasnovas2006:11:21:152
Casasnovas2006:11:27:158
Casasnovas2006:11:27:158
Casasnovas2006:11:34:165
Casasnovas2006:11:34:165
Casasnovas2006:11:34:165
Casasnovas2006:0:0:17
Rodrigues1890:1-2-2:120:120
Rodrigues1890:2-5:4:104
Magalhaes1876:1-1-1:0:71
Magalhaes1876:1-1-1:0:71
Magalhaes1876:1:6:6
Magalhaes1876:1:12:12
Magalhaes1876:1:15:15
Magalhaes1876:1:21:21
Magalhaes1876:1:42:42
Magalhaes1876:1:43:43
Magalhaes1876:2:5:48
Magalhaes1876:2:6:49
Magalhaes1876:2:8:51
Amorim1928:19:51:51
Amorim1928:21:41:41
Amorim1928:21:61:61
Amorim1928:21:113:113
Amorim1928:6:378:378
Amorim1928:2:101:5500
Amorim1928:2:500:6000
Amorim1928:12:50:50
Amorim1928:18:25:25
Amorim1928:18:26:26
Hartt1938:0:0:9
Hartt1938:0:0:25
Costa1909:0:0:1001
Aguiar1898:21-6:7:547
Aguiar1898:21-6:7:547
Aguiar1898:21-6:7:547
Aguiar1898:21-6:7:547
Aguiar1898:21-6:7:547
Studart1926:0:0:2810
Studart1926:0:0:2810
Studart1926:0:0:36103 |
@juliana-gurgel , que ferramenta?
Sim, isso é esperado, porque uma única sentença pode conter mais de uma forma ambígua. |
>>> sents=Yauti.extractConlluSents(Yauti.TREEBANK_PATH)
>>> len(sents)
1825
>>> ambi={}
>>> for sent in sents:
mood=sent.filter(feats__Mood='Imp,Ind')
if mood:
sentid=sent.metadata['sent_id']
ambi[sentid]=len(mood)
>>> len(ambi)
114 |
>>> for sentid in list(ambi.keys())[:5]:
print(f"{sentid} has {ambi[sentid]} ambiguous form(s).")
MooreFP1994:0:0:6 has 1 ambiguous form(s).
MooreFP1994:0:0:24 has 2 ambiguous form(s).
Alencar2021:0:0:53 has 1 ambiguous form(s).
Navarro2016:0:0:38 has 1 ambiguous form(s).
Navarro2016:0:0:40 has 1 ambiguous form(s). |
>>> sorted_ambi = dict(sorted(ambi.items(), key=lambda item: item[1], reverse=True))
>>> for sentid,freq in sorted_ambi.items():
print(f"{sentid} has {freq} ambiguous form(s).")
Avila2021:0:0:27 has 5 ambiguous form(s).
Aguiar1898:21-6:7:547 has 5 ambiguous form(s).
NTLN2019:1:1:8 has 3 ambiguous form(s).
Casasnovas2006:11:34:165 has 3 ambiguous form(s).
MooreFP1994:0:0:24 has 2 ambiguous form(s).
Avila2021:4:2:26 has 2 ambiguous form(s).
Avila2021:0:0:35 has 2 ambiguous form(s).
Avila2021:0:0:124 has 2 ambiguous form(s).
Avila2021:0:0:134 has 2 ambiguous form(s).
Avila2021:0:0:202 has 2 ambiguous form(s).
Avila2021:0:0:353 has 2 ambiguous form(s).
Avila2021:0:0:372 has 2 ambiguous form(s).
Avila2021:0:0:387 has 2 ambiguous form(s).
Avila2021:0:0:644 has 2 ambiguous form(s).
NTLN2019:8:1:47 has 2 ambiguous form(s).
NTLN2019:0:0:49 has 2 ambiguous form(s).
Cruz2011:0:0:19 has 2 ambiguous form(s).
Cruz2011:0:0:114 has 2 ambiguous form(s).
Cruz2011:0:0:116 has 2 ambiguous form(s).
Casasnovas2006:6:8:67 has 2 ambiguous form(s).
Casasnovas2006:9:3:87 has 2 ambiguous form(s).
Casasnovas2006:11:21:152 has 2 ambiguous form(s).
Casasnovas2006:11:27:158 has 2 ambiguous form(s).
Magalhaes1876:1-1-1:0:71 has 2 ambiguous form(s).
Studart1926:0:0:2810 has 2 ambiguous form(s).
MooreFP1994:0:0:6 has 1 ambiguous form(s).
Alencar2021:0:0:53 has 1 ambiguous form(s).
Navarro2016:0:0:38 has 1 ambiguous form(s).
Navarro2016:0:0:40 has 1 ambiguous form(s).
Navarro2016:0:0:126 has 1 ambiguous form(s).
Navarro2016:0:0:129 has 1 ambiguous form(s).
Navarro2016:0:0:135 has 1 ambiguous form(s).
Navarro2016:0:0:136 has 1 ambiguous form(s).
Navarro2016:0:0:139 has 1 ambiguous form(s).
Navarro2016:0:0:148 has 1 ambiguous form(s).
Avila2021:0:0:66 has 1 ambiguous form(s).
Avila2021:0:0:102 has 1 ambiguous form(s).
Avila2021:0:0:127 has 1 ambiguous form(s).
Avila2021:0:0:155 has 1 ambiguous form(s).
Avila2021:0:0:180 has 1 ambiguous form(s).
Avila2021:0:0:183 has 1 ambiguous form(s).
Avila2021:0:0:186 has 1 ambiguous form(s).
Avila2021:0:0:273 has 1 ambiguous form(s).
Avila2021:0:0:311 has 1 ambiguous form(s).
Avila2021:28:2:325 has 1 ambiguous form(s).
Avila2021:28:3:326 has 1 ambiguous form(s).
Avila2021:0:0:330 has 1 ambiguous form(s).
Avila2021:0:0:368 has 1 ambiguous form(s).
Avila2021:0:0:384 has 1 ambiguous form(s).
Avila2021:0:0:397 has 1 ambiguous form(s).
Avila2021:0:0:434 has 1 ambiguous form(s).
Avila2021:35:2:438 has 1 ambiguous form(s).
Avila2021:0:0:448 has 1 ambiguous form(s).
Avila2021:37:1:465 has 1 ambiguous form(s).
Avila2021:38:1:472 has 1 ambiguous form(s).
Avila2021:39:1:477 has 1 ambiguous form(s).
Avila2021:0:0:519 has 1 ambiguous form(s).
Avila2021:0:0:531 has 1 ambiguous form(s).
Avila2021:0:0:548 has 1 ambiguous form(s).
Avila2021:0:0:562 has 1 ambiguous form(s).
Avila2021:0:0:568 has 1 ambiguous form(s).
Avila2021:0:0:569 has 1 ambiguous form(s).
Avila2021:0:0:593 has 1 ambiguous form(s).
Avila2021:0:0:594 has 1 ambiguous form(s).
Avila2021:0:0:606 has 1 ambiguous form(s).
Avila2021:0:0:628 has 1 ambiguous form(s).
Avila2021:0:0:642 has 1 ambiguous form(s).
Avila2021:0:0:646 has 1 ambiguous form(s).
Avila2021:0:0:652 has 1 ambiguous form(s).
Avila2021:0:0:678 has 1 ambiguous form(s).
Avila2021:0:0:694 has 1 ambiguous form(s).
Avila2021:0:0:718 has 1 ambiguous form(s).
NTLN2019:1:2:9 has 1 ambiguous form(s).
NTLN2019:2:4:15 has 1 ambiguous form(s).
NTLN2019:8:2:48 has 1 ambiguous form(s).
Cruz2011:0:0:43 has 1 ambiguous form(s).
Cruz2011:0:0:50 has 1 ambiguous form(s).
Cruz2011:0:0:73 has 1 ambiguous form(s).
Cruz2011:0:0:82 has 1 ambiguous form(s).
Cruz2011:0:0:101 has 1 ambiguous form(s).
Cruz2011:0:0:117 has 1 ambiguous form(s).
Casasnovas2006:5:10:58 has 1 ambiguous form(s).
Casasnovas2006:8:6:74 has 1 ambiguous form(s).
Casasnovas2006:9:31:115 has 1 ambiguous form(s).
Casasnovas2006:10:8:127 has 1 ambiguous form(s).
Casasnovas2006:10:12:131 has 1 ambiguous form(s).
Casasnovas2006:11:7:138 has 1 ambiguous form(s).
Casasnovas2006:11:19:150 has 1 ambiguous form(s).
Casasnovas2006:0:0:17 has 1 ambiguous form(s).
Rodrigues1890:1-2-2:120:120 has 1 ambiguous form(s).
Rodrigues1890:2-5:4:104 has 1 ambiguous form(s).
Magalhaes1876:1:6:6 has 1 ambiguous form(s).
Magalhaes1876:1:12:12 has 1 ambiguous form(s).
Magalhaes1876:1:15:15 has 1 ambiguous form(s).
Magalhaes1876:1:21:21 has 1 ambiguous form(s).
Magalhaes1876:1:42:42 has 1 ambiguous form(s).
Magalhaes1876:1:43:43 has 1 ambiguous form(s).
Magalhaes1876:2:5:48 has 1 ambiguous form(s).
Magalhaes1876:2:6:49 has 1 ambiguous form(s).
Magalhaes1876:2:8:51 has 1 ambiguous form(s).
Amorim1928:19:51:51 has 1 ambiguous form(s).
Amorim1928:21:41:41 has 1 ambiguous form(s).
Amorim1928:21:61:61 has 1 ambiguous form(s).
Amorim1928:21:113:113 has 1 ambiguous form(s).
Amorim1928:6:378:378 has 1 ambiguous form(s).
Amorim1928:2:101:5500 has 1 ambiguous form(s).
Amorim1928:2:500:6000 has 1 ambiguous form(s).
Amorim1928:12:50:50 has 1 ambiguous form(s).
Amorim1928:18:25:25 has 1 ambiguous form(s).
Amorim1928:18:26:26 has 1 ambiguous form(s).
Hartt1938:0:0:9 has 1 ambiguous form(s).
Hartt1938:0:0:25 has 1 ambiguous form(s).
Costa1909:0:0:1001 has 1 ambiguous form(s).
Studart1926:0:0:36103 has 1 ambiguous form(s). |
@juliana-gurgel , contei 147 formas ambíguas em um total de 114 sentenças. Veja se confere: >>> sum(ambi.values())
147
>>> len(ambi.keys())
114
>>> |
@juliana-gurgel , veja que, quando conto as ocorrências da string grep -Ec "Mood=Imp,Ind" corpus/universal-dependencies/yrl_complin-ud-test.conllu
148
Eu descobri a razão da discrepância. Contudo, como exercício, deixo pra vocês a descobrirem por conta própria. É algo bastante instrutivo para se compreender a importância de executar uma busca mais estruturada, como por meio dos comandos em Python acima, que recorrem à biblioteca CoNLL-U Parser, ou por meio do |
@leoalenc, nós cogitamos usar a biblioteca Conllu para procurar esse falso positivo, mas imaginamos que seria complicado visto que estamos aprendendo a usá-la e não tínhamos muita noção de onde começar a procurar. Optamos por continuar usando o recurso de busca por RegEx do próprio VSCode, mas de um modo menos trivial. Notamos que o atributo Tomaremos mais cuidado no futuro fazendo buscas mais bem estruturadas. Obrigada pela ajuda, @heliolbs! |
@juliana-gurgel , parabéns. Resposta correta. Realmente, o uso de |
Professor @leoalenc, na reunião de hoje, eu, @heliolbs e @dominickmaia discutimos a pendência da desambiguação de Feito o levantamento de todas as sentenças com ambiguidades de Após o senhor validar as diretrizes, por meio de comentário nesta issue, pensamos em selecionar uma amostra de 11 sentenças (10% do total) para que todos façam a desambiguação e revisão de forma independente. Depois, compararíamos as 3 revisões e abriríamos uma issue para a sua validação. Sobre essa questão, temos mais uma dúvida: podemos incluir a revisão de toda a anotação da sentença junto com a tarefa de desambiguação? |
@juliana-gurgel , acho que o procedimento deva ser este:
Se ao longo da atividade de desambiguação a pessoa identificar algum erro ou tiver dúvida sobre a anotação de qualquer token, deve abrir issue neste repositório, caso a questão ainda não tenha sido levantada, do contrário, deve postar comentário em issue aberta sobre o assunto. |
@juliana-gurgel , e sempre registrando todas as atividades aqui. |
@juliana-gurgel , @heliolbs e @dominickmaia , esta é a amostra C2 referida acima, mas numa concepção melhorada. Em vez de 10% das sentenças, contém cerca de 10% das ambiguidades. Num próximo commit vou subir o script. >>> import BuildSample
|
@juliana-gurgel , @dominickmaia e @heliolbs , concluí que faz mais sentido gerar as duas amostras de cerca de 10% e cerca de 30% das ambiguidades por meio do mesmo script. Seguem os dois conjuntos de sentenças: reload(BuildSample)
|
>>> set10=set([sentid for sentid, num in BuildSample.sample_10])
>>> set30=set([sentid for sentid, num in BuildSample.sample_30])
>>> set10.intersection(set30)
set()
>>> BuildSample.total_ambiguities
142
>>> 0.1*142
14.200000000000001
>>> 0.3*142
42.6
>>> sample10=BuildSample.sample_10
>>> type(sample10)
<class 'list'>
>>> sample10
[('NTLN2019:8:1:47', 2), ('Magalhaes1876:2:5:48', 1), ('Cruz2011:0:0:101', 1), ('Avila2021:0:0:562', 1), ('Avila2021:0:0:644', 2), ('Avila2021:35:2:438', 1), ('Avila2021:0:0:448', 1), ('Avila2021:0:0:397', 1), ('Cruz2011:0:0:73', 1), ('Avila2021:0:0:155', 1), ('Magalhaes1876:2:6:49', 1), ('Casasnovas2006:10:12:131', 1)] |
@juliana-gurgel , @heliolbs e @dominickmaia , atentem para o fato de que a primeira amostra de 10% das ambiguidades não é mais válida, por conta da seleção aleatória. É esta que vale:
|
@juliana-gurgel , @dominickmaia e @heliolbs , amostra de 10% neste commit. |
>>> set10=set([sentid for sentid, num in BuildSample.sample_10])
>>> set30=set([sentid for sentid, num in BuildSample.sample_30])
>>> set10.intersection(set30)
set()
>>> BuildSample.total_ambiguities
142
>>> 0.1*142
14.200000000000001
>>> 0.3*142
42.6
>>> sample10=BuildSample.sample_10
>>> type(sample10)
<class 'list'>
>>> sample10
[('NTLN2019:8:1:47', 2), ('Magalhaes1876:2:5:48', 1), ('Cruz2011:0:0:101', 1), ('Avila2021:0:0:562', 1), ('Avila2021:0:0:644', 2), ('Avila2021:35:2:438', 1), ('Avila2021:0:0:448', 1), ('Avila2021:0:0:397', 1), ('Cruz2011:0:0:73', 1), ('Avila2021:0:0:155', 1), ('Magalhaes1876:2:6:49', 1), ('Casasnovas2006:10:12:131', 1)]
>>> sentences10=[]
>>> sentences10=list(filter(lambda x: x.metadata['sent_id'] in [sentid for sentid, num in sample10],sents))
>>> sentences10[0]
TokenList<Pemaã, ,, wirandé, asú, amukameẽ, penhẽ, arã, iwité, ,, pemunhã, arã, pe, kupixá, pe, rimirikú-itá, irumu, .>
>>> sentences10[0].metadata['sent_id']
'Avila2021:0:0:155'
>>> Yauti.writeSentsConllu(sentences10,"/home/leonel/complin/nheengatu/test/sample10.conllu") |
The text was updated successfully, but these errors were encountered: