Skip to content

techiaith/ataleiriau

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Ataleiriau Cymraeg

Dyma restr o 488 o ataleiriau Cymraeg. Seiliwyd y detholiad hwn o ataleiriau ar y math o eirffurfiau isel eu gwybodaeth semantig sydd i'w cael yn rhestr spaCy o ataleiriau Saesneg.

https://github.com/techiaith/ataleiriau/blob/main/cy_ataleiriau_stopwords.txt

Cynhyrchwyd y rhestr hon â llaw gan ieithyddion gyda'r nod o greu rhestr o ataleiriau a fyddai yn cynhyrchu'r un effaith (yn fras) o'i defnyddio ar destun Cymraeg ag y byddai defnyddio'r rhestr Saesneg yn ei chael ar destun Saesneg cyfatebol (ond gweler y nodyn isod). Fodd bynnag, argymhellwn deilwra eich bod wastad yn teilwra eich rhestr o ataleiriau i'r dasg benodol sydd gennych mewn golwg.

Er mwyn rhoi ystyriaeth arbennig i'r ffurfiau priodol sy'n codi yn benodol yn y cyd-destun Cymraeg, cyfeiriwyd hefyd at y geiriau uchaf eu hamlder o fewn ein corpws ymchwil enfawr 'CYMES' (gweler https://zenodo.org/record/7007552#.Yv5conaZNhE%3F) fel y gellid llenwi unrhyw fylchau coll.

Mae'r rhestr hon fymryn yn hirach na rhestr Saesneg spaCy oherwydd yr angen i gynnwys ffurfiau treigledig a rhagor o rediadau berfol na'r hyn a fyddai yn addas yn y Saesneg. Er bod ffurfiau treigledig a rhediadau berfol ac arddodiadol wedi eu cynnwys, dim ond y rhai mwyaf mynych eu defnydd a gynhwyswyd - ni cheir yn y rhestr rediadau berfol llawn, er enghraifft.

Cynhwyswyd hefyd ychydig dros 50 o eiriau mwyaf cyffredin y Saesneg gan eu bod yn tueddu i godi mewn testunau Cymraeg. Mae'r rheiny ar ddiwedd y rhestr, wedi'r llythrennau acennog.

Nodyn: Mae rhai o eriau cyffredin y Gymraeg, fel 'oes', 'maen' a 'rwyf' yn amwys ac yn gallu cyfeirio at wrthrychau yn ogystal â gweithredu fel ffurfiau berfol cyffredin. Credwn eu bod yn gymwys eu cynnwys, ond rhaid arfer gofal wrth ddefnyddio'r rhestr hon, a gwneud newidiadau priodol iddi yn ôl yr angen.

Welsh Stopwords List

This is a list of 488 Welsh stopwords. The selection was based on those found in spaCy's English stopword list, and includes wordforms which are not rich in semantic information.

https://github.com/techiaith/ataleiriau/blob/main/cy_ataleiriau_stopwords.txt

The forms in this list were hand-picked by linguists with the aim that their use would have a similar effect on a Welsh text as that which would be produced by using the English list on an equivalent piece of Welsh text (but see the note below). However, we recommend that you always adapt your stopword list to your intended task.

In order to ensure that the appropriate wordforms which specifically arise in the Welsh context were included, reference was also made to the word frequency list of our large research corpus 'CYMES' (see https://zenodo.org/record/7007552#.Yv5conaZNhE%3F) so that any missing gaps could be filled.

This list is slightly longer than the spaCy list due to the need to include both mutated forms and a greater number of conjugated verbs than would be appropriate in English. Although mutated forms and conjugated verbs and prepositions are included, only the most frequently used are incorporated in the list. We do not included, for example, every conjugated form of the verbs.

As they tend to occur in Welsh texts, we also included just over 50 of the most common English words. Those appear at the end of the list, after the accented letters.

Note: Some common Welsh words, such as 'oes', 'maen' and 'rwyf' are ambiguous and can refer to objects as well as represent common verbal forms. We believe that it is appropriate that these forms are included, but advise that the stopword list be used carefully and that appropriate changes are made according to the required use case.

About

Rhestr o ataleiriau Cymraeg | Welsh Stopwords List

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published