Skip to content

Latest commit

 

History

History
110 lines (110 loc) · 321 KB

language_resources.md

File metadata and controls

110 lines (110 loc) · 321 KB
No. Data Source Name, EN Data Source Name, National Language National Language ID Data type Source Type Language(s) Domain Source IPR / Licensing / Security considerations Data Holder
1 ParIce - English-Icelandic parallel corpus - - parallel corpus IS EN the bible, books, EEA documents, Patient information leaflets (EMA),
European Southern Observatory (ESO), Texts from the localization files
of KDE (KDE4) (from OPUS), OpenSubtitles (from OPUS), Sagas, Statistics
Iceland, Tatoeba, Ubuntu
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/16 Creative Commons - Attribution 4.0 International (CC BY 4.0) info about data source in the following article: https://aclanthology.org/W19-6115.pdf
2 ParIce Dev/Test/Train Split 20.05 - - parallel corpus IS EN the bible, books, EEA documents, Patient information leaflets (EMA),
European Southern Observatory (ESO), Texts from the localization files
of KDE (KDE4) (from OPUS), OpenSubtitles (from OPUS), Sagas, Statistics
Iceland, Tatoeba, Ubuntu
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/24 Creative Commons - Attribution 4.0 International (CC BY 4.0) info about data source in the following article: https://aclanthology.org/W19-6115.pdf
3 En-Is Synthetic Parallel Corpus - - parallel corpus IS EN - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/70 Icelandic Gigaword Corpus Part1 data sources: Wikipedia, Newscrawl and Europarl corpora; Icelandic Gigaword Corpus
4 En-Is Semi-Synthetic Parallel Name Robustness Corpus - - parallel corpus IS EN person names https://repository.clarin.is/repository/xmlui/handle/20.500.12537/74 Creative Commons - Attribution 4.0 International (CC BY 4.0) data source: based on the ParIce corpus
5 cities_is2en - - parallel corpus IS EN city names https://repository.clarin.is/repository/xmlui/handle/20.500.12537/66 Creative Commons - Attribution 4.0 International (CC BY 4.0) data source: information provided by the Icelandic Ministry for Foreign Affairs and the Árni Magnússon Institute for Icelandic Studies
6 Gold Alignments for English-Icelandic Word Alignments - - parallel lexical conceptual resource IS EN - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/103 - -
7 UD Icelandic PUD - - parallel corpus IS EN news, wikipedia https://universaldependencies.org/treebanks/is_pud/index.html CC BY-SA 4.0 -
8 OPUS (The Open Parallel Corpus) - - parallel website/corpus multilingual - https://opus.nlpl.eu/ - -
9 WikiMatrix v1 - - parallel corpus IS EN - https://opus.nlpl.eu/WikiMatrix-v1.php CC-BY-SA 4.0 data source: wikimedia
10 wikimedia v20210402 - - parallel corpus IS EN - https://opus.nlpl.eu/wikimedia-v20210402.php CC–BY-SA 4.0 data source: wikimedia
11 XLEnt v1.1 - - parallel corpus IS EN - https://opus.nlpl.eu/XLEnt-v1.1.php - -
12 TildeMODEL - - parallel corpus IS EN document texts of European Economic and Social Committee document portal;
press releases; banking; medicin; travel; tourism; texts of Lithuanian National
Philharmonic Society web site; Müpa Budapest - web site of Hungarian national
culture house and concert venue; texts of fold.lv portal http://www.fold.lv/en/
of the best of Latvian and foreign creative industries; from texts of
http://czechtourism.com/ portal
https://opus.nlpl.eu/TildeMODEL-v2018.php CC-BY - Creative Commons with Attribution data sources:
http://dm.eesc.europa.eu/; http://europa.eu/rapid/; http://ebc.europa.eu/; http://www.ema.europa.eu/; http://www.worldbank.org/; https://www.airbaltic.com/en/destinations/; http://liveriga.com/; http://www.filharmonija.lt/; https://www.mupa.hu/en/; http://www.fold.lv/en/; http://czechtourism.com/
13 CCAligned v1 - - parallel corpus IS EN - https://opus.nlpl.eu/CCAligned-v1.php - -
14 JW300 v1b - - parallel corpus IS EN - https://opus.nlpl.eu/JW300-v1b.php For all practical purpose, the license is CC-BY-NC-SA.
Still, jw.org maintains custom terms of use
[https://www.jw.org/en/terms-of-use/\];
in doubt, make sure to observe their license!
data source: jw.org
15 QED - - parallel corpus IS EN education https://opus.nlpl.eu/QED-v2.0a.php "The QED Corpus is made public for RESEARCH purpose only. The
corpus is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Copyright Qatar Computing Research Institute. All rights reserved."
-
16 Mozilla-I10n v1 - - parallel corpus IS EN Mozilla localisation/internationalisation data https://opus.nlpl.eu/Mozilla-I10n-v1.php Mozilla Public License 2.0 -
17 Eubookshop - - parallel corpus IS EN documents from the EU bookshop https://opus.nlpl.eu/EUbookshop-v2.php - data source: http://bookshop.europa.eu
18 TED2020 v1 - - parallel corpus IS EN - https://opus.nlpl.eu/TED2020-v1.php License: Please respect the TED Talks Usage Policy data source: crawl of nearly 4000 TED and TED-X transcripts
19 Paracrawl - - parallel corpus multilingual - https://paracrawl.eu/ Creative Commons CC0 license ("no rights reserved") -
20 Paracrawl Synthesized Data - - parallel corpus multilingual covid-19 https://paracrawl.eu/manufactured-data -
21 Parallel English-Icelandic corpus from the Icelandic
Directorate for International Development
Cooperation website
- - parallel corpus IS EN - https://data.europa.eu/data/datasets/elrc_504?locale=en Creative Commons Attribution 4.0 International data source: Icelandic Directorate for International Development Cooperation website
22 Ríkiskaup (Central Public
Procurement) - Translation
Memory 2020
- - parallel corpus IS EN - https://elrc-share.eu/repository/browse/rikiskaup-central-public-
procurement-translation-memory-2020/cd8551a8c78511eb9c1a001
55d0267069d21c63733144b2fa4b9c9cfcbababc4/
IPR Holders: Ríkiskaup (Central Public Procurement)
https://elrc-share.eu/static/metashare/licences/CC-BY-4.0.pdf
data source: internal bi-lingual documents from Rikiskaup (https://www.rikiskaup.is/)
23 University of Iceland's TM - - parallel corpus IS EN includes translations of rules, procedures, contracts, policies, announcements,
letters, speeches and news
https://elrc-share.eu/repository/browse/university-of-icelands-
tm/7cd401ccc79a11eb9c1a00155d026706fb642f0237ec4bd9a590b5
bc81441512/
IPR Holders: Abigail Charlotte Cooper; University of Iceland
https://elrc-share.eu/static/metashare/licences/CC-BY-4.0.pdf
-
24 The Icelandic Met Office -
Weather forecasts and
warnings
- - parallel corpus IS EN Meteorological reports https://elrc-share.eu/repository/browse/the-icelandic-met-office
-weather-forecasts-and-warnings/6963b446c56411eb9c1a00155d02
67068a479fe4536c4c9e83eb051f37198b7e/
IPR Holders: Icelandic Meteorological Office
https://elrc-share.eu/static/metashare/licences/CC-BY-4.0.pdf
https://vedur.is/
25 Government Offices in Iceland - Reports - - parallel corpus IS EN eGovernment https://elrc-share.eu/repository/browse/government-offices-in
-iceland-reports/6963b445c56411eb9c1a00155d026706c22c67dbe
78743059ee4a388fab2cc7c/
IPR Holders: Government Offices of Iceland
https://elrc-share.eu/static/metashare/licences/CC-BY-4.0.pdf
data source: www.government.is; www.stjornarradid.is
26 Government Offices in
Iceland – Legislation
and regulations
- - parallel corpus IS EN eJustice/LAW https://elrc-share.eu/repository/browse/government-offices-in-
iceland-legislation-and-regulations/6ad42ad5c56411eb9c1a00155
d026706aac46ebe9197417e999d6f2b768c0bf7/
IPR Holders: Government Offices of Iceland
https://elrc-share.eu/static/metashare/licences/CC-BY-4.0.pdf
data source: documents on the Icelandic and English websites of the Government Offices in Iceland
27 Bilingual corpus made out of PDF
documents from the European
Medicines Agency, (EMEA),
https://www.ema.europa.eu,
(February 2020) (EN-IS)
- - parallel corpus IS EN SOCIAL QUESTIONS Health (Eurovoc 2841) https://elrc-share.eu/repository/browse/bilingual-corpus-made
-out-of-pdf-documents-from-the-european-medicines-agency-
emea-httpswwwemaeuropaeu-february-2020-en-is/2911078886
2811ea913100155d0267069f685ed8fd1e4ae088600d9c99af303c/

multilingual version of this parallel corpus:
https://elrc-share.eu/repository/browse/multilingual-
corpus-made-out-of-pdf-documents-from-the-
european-medicines-agency-emea-httpswwwemaeuropaeu-
february-2020/3cf9da8e858511ea913100155d0267062d01c2d84
7c349628584d10293948de3/
https://elrc-share.eu/static/metashare/licences/CC-BY-4.0.pdf data source: PDF documents from the European Medicines Agency (https://www.ema.europa.eu)
28 Bilingual English-Icelandic parallel
corpus from the official Nordic
cooperation website
- - parallel corpus IS EN INTERNATIONAL ORGANISATIONS (Eurovoc 76), POLITICS (Eurovoc 04),
INTERNATIONAL RELATIONS (Eurovoc 08)
https://elrc-share.eu/repository/browse/bilingual-english-
icelandic-parallel-corpus-from-the-official-nordic-cooperation
-website/0e9d06707ad311e8b7d400155d026706ce2fbb1eb16c412
a8ba9080d7490657d/
IPR Holders: Nordic Council; Nordic Council of Ministers
Open Under-PSI (more info about the license)
data source: Nordic Co-operation website http://www.norden.org
29 Bilingual English-Icelandic
parallel corpus from Harpa
Reykjavik Concert Hall and
Conference Centre website
- - parallel corpus IS EN - https://elrc-share.eu/repository/browse/bilingual-english-
icelandic-parallel-corpus-from-harpa-reykjavik-concert-hall
-and-conference-centre-website/b56c64c2e4d411e7b7d400
155d0267060908d10ea65d42b58b429dd9301a7582/
IPR Holders: Harpa Reykjavik Concert Hall and Conference Centre
Open Under-PSI (more info about the license)
data source: contents of https://en.harpa.is and https://www.harpa.is
30 Bilingual English-Icelandic parallel corpus
from Icelandic Financial Supervisory Authority
- - parallel corpus IS EN BUSINESS & COMPETITION (Eurovoc 40) https://elrc-share.eu/repository/browse/bilingual-english
-icelandic-parallel-corpus-from-icelandic-financial-supervisory
-authority/f2a5b200e4c311e7b7d400155d02670665375c5479674
4de9689b6b49deb74ed/
IPR Holders: Financial Supervisory Authority Iceland
Open Under-PSI (more info about the license)
data source: contents of https://en.fme.is/ and https://www.fme.is/
31 Bilingual English-Icelandic parallel corpus
from Icelandic Post and Telecom Administration website
- - parallel corpus IS EN LAW (Eurovoc 12) https://elrc-share.eu/repository/browse/bilingual-english-
icelandic-parallel-corpus-from-icelandic-post-and-telecom-
administration-website/d6cc14a8e4c711e7b7d400155d02670
668f4c1b127ee42ab9108ee2d0f2eb4b7/
IPR Holders: Post and Telecom Administration in Iceland
Open Under-PSI (more info about the license)
data source: contents of https://www.pfs.is/
32 Bilingual English-Icelandic parallel corpus
from Nordisk eTax website
- - parallel corpus IS EN FINANCE (Eurovoc 24) https://elrc-share.eu/repository/browse/bilingual-english-
icelandic-parallel-corpus-from-nordisk-etax-website/c0970ab
4eadd11e7b7d400155d026706fd923049f0ab48678edf9c4ae3fcdf71/
IPR Holders: Nordisk eTax
Open Under-PSI (more info about the license)
data source: contents of https://www.nordisketax.net/
33 Bilingual is-en parallel corpus from
Icelandic Medicines Agency website
- - parallel corpus IS EN - https://elrc-share.eu/repository/browse/bilingual-is-en-parallel-
corpus-from-icelandic-medicines-agency-website/4a6ddf7ae56011
e7b7d400155d026706bfb8c1760a814dea8b0ce4300da6b504/
IPR Holders: Icelandic Medicines Agency
Open Under-PSI (more info about the license)
data source: contents of https://www.lyfjastofnun.is/ and https://www.ima.is
34 Bilingual is-en parallel corpus from
National Gallery of Iceland website
- - parallel corpus IS EN - https://elrc-share.eu/repository/browse/bilingual-is-en-parallel-
corpus-from-national-gallery-of-iceland-website/0958d46ee4d311e
7b7d400155d0267061663be21d26647d1accefea64fc4db3b/
IPR Holders: NATIONAL GALLERY OF ICELAND
Open Under-PSI (more info about the license)
data source: contents of http://www.listasafn.is
35 Bilingual is-en parallel corpus from
The Icelandic Directorate of Immigration website
- - parallel corpus IS EN - https://elrc-share.eu/repository/browse/bilingual-is-en-parallel-corpus-
from-the-icelandic-directorate-of-immigration-website/2467fa26e56111e7
b7d400155d026706bfd15d2901e94fdf979f4f1fff86c318/
IPR Holders: Útlendingastofnun, The Directorate of Immigration, Iceland
Open Under-PSI (more info about the license)
data source: contents of http://www.utl.is
36 Bilingual is-en parallel corpus
from THE LITERATURE WEB website
- - parallel corpus IS EN - https://elrc-share.eu/repository/browse/bilingual-is-en-parallel-
corpus-from-the-literature-web-website/b5a7f5fee4d511e7b7d400
155d026706cfd7be18e5bd497fb00355ba4d23741d/
IPR Holders: City of Reykjavík
Open Under-PSI (more info about the license)
data source: contents of https://bokmenntaborgin.is/
37 Parallel English-Icelandic corpus from the
contents of Icelandic National Debt
Management Agency website
- - parallel corpus IS EN ECONOMICS (Eurovoc 16), FINANCE (Eurovoc 24) https://elrc-share.eu/repository/browse/parallel-english-
icelandic-corpus-from-the-contents-of-icelandic-national-
debt-management-agency-website/827c09c0e4cd11e7b7
d400155d026706b3e67c6af2754f67a335ce5d6068d223/
IPR Holders: Central Bank of Iceland
Open Under-PSI (more info about the license)
data source: contents of http://www.lanamal.is
38 Parallel English-Icelandic corpus from
the Icelandic Directorate for International
Development Cooperation website
- - parallel corpus IS EN INTERNATIONAL RELATIONS (Eurovoc 08) https://elrc-share.eu/repository/browse/parallel-english-
icelandic-corpus-from-the-icelandic-directorate-for-
international-development-cooperation-website/eaca6b
40e4c611e7b7d400155d0267065d1af2425274432fafd35c1f93ff097e/
IPR Holders: Government Offices of Iceland
Open Under-PSI (more info about the license)
data source: contents of http://www.iceida.is/
39 EAC Translation memory - Forms Data - - parallel corpus IS EN electronic forms, EDUCATION & COMMUNICATIONS (Eurovoc 32) https://elrc-share.eu/repository/browse/eac-translation-
memory-forms-data/0ed2d886c1f711eb9c1a00155d0267
06c66213a889bf4297a834b3b0f21c84e1/
IPR Holders: Directorate General for Education and Culture
CC-BY-4.0
-
40 EAC Translation memory - Reference Data - - parallel corpus IS EN Electronic Reference Data https://elrc-share.eu/repository/browse/eac-translation-
memory-reference-data/67911206c56411eb9c1a001
55d02670635d5c6e318714fa4803d8099f75f7bcb/
IPR Holders: Directorate General for Education and Culture
CC-BY-4.0
-
41 META-NORD Sofie Parallel Treebank - - parallel corpus DA EN ET DE IS NO SV - https://clarino.uib.no/iness/page?page-id=Sofie&session-id=251398323083844

the following error message appears: "Fake or stale session id" - to find the
corpus: select "Treebank selection" under "Treebanks" on the left side of the
webpage, select"Icelandic" under "Languages" and "Sofie" under "Treebank
Collections" and select "Show only parallel Treebanks", then click "Sofie"
under "Collection" at the bottom.
License: http://license.no/ data source: first chapters of the novel Sofies verden by Jostein Gaarder, published by Aschehoug forlag
42 EAC Translation Memory - - parallel corpus BG CS DA ES ET FI FR HU IS IT LT
LV NL PL RO SK SL SV TR EN EL
PT DE HR MT NB NO
law, culture, education https://inventory.clarin.gr/corpus/733 Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
https://creativecommons.org/licenses/by/4.0/
-
43 ECDC Translation Memory - - parallel corpus BG CS DA EX ET FI FR HU IS IT
LT LV MT NB NL PL RO SK SL
SV TR EN PT DE EL
Medicine & Health https://inventory.clarin.gr/corpus/729 Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
https://creativecommons.org/licenses/by/4.0/
-
44 PELCRA mutlilingual parallel corpora (CC-BY) - - parallel corpus DE EN ES FR IT PL CS DA FI IS
NL NO PT RU SV TR UK BG EL
ET HU LT LV MT RO SK SL AR
BE GA HR
Law, Science, Political Science https://inventory.clarin.gr/corpus/665 Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
https://creativecommons.org/licenses/by/4.0/
data source: CORDIS news; ESO website; European Parliament website; EUROPA website
45 META-NORD Acquis Parallel Treebank - - parallel corpus ET IS SV NO EN DA FI law https://clarino.uib.no/iness/clarino-
metadata?session-id=251398323083844&identifier=Acquis

the following error message appears: "Fake or stale session
id" - to find the corpus: click "Home page" and select
"Treebank selection" under "Treebanks" on the
left side of the webpage, select "Icelandic" under
"Languages" and "Acquis" under "Treebank Collections" and select
"Show only parallel Treebanks", then click "Acquis" under "Collection"
at the bottom.
Creative_Commons-BY (CC-BY) data source: Directive 2002/74/EC from the Acquis Communautaire (AC)
46 GreynirCorpus (2021-06-23) - - monolingual, parsed corpus corpus IS mostly news sources https://repository.clarin.is/repository/xmlui/handle/20.500.12537/119 Creative Commons - Attribution 4.0 International (CC BY 4.0) -
47 The Icelandic Contemporary
Treebank (IceConTree) Version 1.1
Samtímalegi íslenski
trjábankinn
IS monolingual, parsed corpus corpus IS parliamentary text, speech, law text, text from media, text from radio,
text from the internet, text from television, encyclopedia
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/112 Creative Commons - Attribution 4.0 International (CC BY 4.0) -
48 The Icelandic Parsed Historical Corpus (IcePaHC) Sögulegi íslenski
trjábankinn
IS monolingual, parsed corpus corpus IS narratives and religious material but some samples from other genres https://repository.clarin.is/repository/xmlui/handle/20.500.12537/62 Creative Commons - Attribution 4.0 International (CC BY 4.0) data source: consists of texts from the Icelandic Gigaword Corpus
49 NeuralMIcePaHC (2020-05-07) - - monolingual, parsed corpus corpus IS Icelandic texts from the 13th to 20th century, mostly Icelandic sagas https://repository.clarin.is/repository/xmlui/handle/20.500.12537/20 Creative Commons - Attribution 4.0 International (CC BY 4.0) -
50 Icelandic Gigaword Corpus 1 (IGC1) - version 20.05 Risamálheildin 1 -
Útgáfa 20.05
IS monolingual, tagged and lemmatized corpus corpus IS official texts (e.g. parliamentary speeches as far back as 1911, law text,
adjudications); big text collections from news media and various texts
from the text collection of the Árni Magnússon Institute for Icelandic
Studies." (http://igc.arnastofnun.is/)
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/41 Icelandic Gigaword Corpus Part1 -
51 Icelandic Gigaword Corpus 2 (IGC2) - version 20.05 Risamálheildin 2 -
Útgáfa 20.05
IS monolingual, tagged and lemmatized corpus corpus IS parliamentary speeches, law text, adjudications); news media and
various texts from the text collection of the Árni Magnússon Institute
for Icelandic Studies
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/33 Creative Commons - Attribution 4.0 International (CC BY 4.0) -
52 IGC-Adjud-21.05 (The Icelandic Gigaword
Corpus: Adjudications)
- - monolingual, tagged and lemmatized corpus corpus IS - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/101 Creative Commons - Attribution 4.0 International (CC BY 4.0) data source: judgements that have been published on the websites of the three levels of jurisdiction in Iceland
53 IGC-Laws-21.05 (The Icelandic Gigaword
Corpus: Laws, bills and proposals)
- - monolingual, tagged and lemmatized corpus corpus IS 1) the Icelandic laws, 2) explanatory reports and observations
extracted from bills submitted to Althingi, and 3) parliamentary
proposals and resolutions
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/116 Creative Commons - Attribution 4.0 International (CC BY 4.0) -
54 IGC-Parla-21.05 (The Icelandic Gigaword
Corpus: Parliamentary speeches)
- - monolingual, tagged and lemmatized corpus corpus IS parliamentary speeches that have been encoded according to
the Parla-CLARIN recommendations
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/111 Creative Commons - Attribution 4.0 International (CC BY 4.0) -
55 IGC - evaluation set 20.09 - - monolingual, tagged and lemmatized corpus corpus IS adjudications, books, educational websites, legal tests,
news, opinions, parliamentary speeches, sport news and
radio and tv news scripts
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/51 Icelandic Mim Gold Standard for PoS Tagging data source: Icelandic Gigaword Corpus (version 2018)
56 Icelandic Frequency
Dictionary 2020.05 -
training/testing sets
Orðtíðnibókin 2020.05
þjálfunar-/prófunarsafn
IS monolingual, tagged and lemmatized corpus corpus IS Icelandic fiction, translated fiction, biographies and memoirs,
non-fiction (field of humanities, field of science) and books for
children and teenagers (original texts, translations)
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/38 Icelandic Frequency Dictonary -
57 MIM-GOLD 21.05 MÍM-Gull 21.05 IS monolingual, tagged and lemmatized corpus corpus IS texts are from The Tagged Icelandic Corpus (MÍM) https://repository.clarin.is/repository/xmlui/handle/20.500.12537/113 Icelandic Mim Gold Standard for PoS Tagging data source: texts are from The Tagged Icelandic Corpus (MÍM)
58 MIM-GOLD 21.05 - train/test MÍM-Gull 21.05 -
þjálfunar-/prófunargögn
IS monolingual, tagged and lemmatized corpus corpus IS texts are from The Tagged Icelandic Corpus (MÍM) https://repository.clarin.is/repository/xmlui/handle/20.500.12537/114 Icelandic Mim Gold Standard for PoS Tagging -
59 Tagged Icelandic Corpus Mörkuð íslensk málheild IS monolingual, tagged and lemmatized corpus corpus IS among other things: newspapers; text from various printed periodicals;
official texts (speeches from the Icelandic Parliament (Alþingi), legal texts
and adjudications, and texts from the websites of government ministries)
http://www.malfong.is/index.php?pg=mim&lang=en
https://clarin.is/en/resources/mim/
Special User License -
60 Talromur Talrómur IS monolingual, speech corpus corpus IS - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/104 Creative Commons - Attribution 4.0 International (CC BY 4.0) audio was recorded by Reykjavík University and The
Icelandic National Broadcasting Service
61 RÚV TV data Rúv TV gagnasafnið IS monolingual, speech corpus corpus IS news commentary, literature discussions, and the prime time news https://repository.clarin.is/repository/xmlui/handle/20.500.12537/93 Creative Commons - Attribution 4.0 International (CC BY 4.0) data source: TV data from RÚV; published by the Icelandic
National Broadcasting Service - Ríkisútvarpið (RÚV) and
made by both RÚV and Reykjavik University
62 The RÚV Corpus RÚV-málheildin IS monolingual, speech corpus corpus IS read news items that includes a large vocabulary http://www.malfong.is/index.php?pg=ruv&lang=en - -
63 Islex Recordings Hljóðskrár ISLEX IS monolingual, speech corpus corpus IS - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/6
http://www.malfong.is/index.php?lang=en&pg=islexrecordings
CC-BY-NC-ND license -
64 The Hjal Corpus Hjal IS monolingual, speech corpus corpus IS - https://clarin.is/en/resources/hjal/
http://www.malfong.is/index.php?pg=hjal&lang=en
CC BY 3.0 license -
65 Parliament Speech Corpus Alþingisumræður IS monolingual, speech corpus corpus IS government budget, taxation, water laws, energy, schools and transportation http://www.malfong.is/index.php?pg=althingi&lang=en CC BY 3.0 license data source: recordings were obtained directly from the Icelandic Parliament (Althingi)
66 Althingi’s Parliamentary Speeches Alþingisgögnin IS monolingual, speech corpus corpus IS Althingi recordings http://www.malfong.is/index.php?pg=althingisraedur&lang=en Creative Commons - Attribution 4.0 International (CC BY 4.0) data source: Althingi recordings
67 The Jensson Corpus Jenson-málheildin IS monolingual, speech corpus corpus IS - http://www.malfong.is/index.php?pg=jensson&lang=en -
68 The Thor Corpus Þór-málheildin IS monolingual, speech corpus corpus IS weather http://www.malfong.is/index.php?pg=thor&lang=en - data source: the text was translated from MIT´s JUPITER corpus
69 The Malromur Corpus Málrómur IS monolingual, speech corpus corpus IS - https://clarin.is/en/resources/malromur/
http://www.malfong.is/index.php?pg=malromur&lang=en
Creative Commons - Attribution 4.0 International (CC BY 4.0) data source: part of text is from mbl.is
70 General Pronunciation Dictionary for ASR Almenn framburðarorðabók
fyrir talgreiningu
IS monolingual, speech corpus corpus IS - http://www.malfong.is/index.php?lang=en&pg=framb_talgr Creative Commons - Attribution 4.0 International (CC BY 4.0) -
71 Samromur 21.05 Samrómur 21.05 IS monolingual, speech corpus corpus IS - https://www.openslr.org/112/ CC BY 4.0 -
72 Pronunciation Dictionary for Icelandic Framburðarorðabókin IS monolingual, language description corpus IS news, novels, Ístal Corpus https://clarin.is/en/resources/prondict/
http://www.malfong.is/index.php?pg=framburdur&lang=en
CC BY 3.0 license data source: newspaper Morgunblaðið, recent novels, and the Ístal Corpus
73 Patterns and Sentences Mynstur og setningar IS monolingual, language description corpus IS extracted from novels http://www.malfong.is/index.php?pg=mynsturogsetningar&lang=en CC BY 3.0 license -
74 The Icelandic Dyslexia
Error Corpus (IceDEC) Version 1.0
Íslenska
lesblinduvillumálheildin
IS monolingual, error corpus corpus IS - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/107 Creative Commons - Attribution 4.0 International (CC BY 4.0) -
75 Icelandic Error Corpus (IceEC) Version 1.1 Íslenska villumálheildin -
Útgáfa 1.1
IS monolingual, error corpus corpus IS student essays, online news texts and Icelandic Wikipedia articles https://repository.clarin.is/repository/xmlui/handle/20.500.12537/105 Creative Commons - Attribution 4.0 International (CC BY 4.0) -
76 The Icelandic Child Language Error
Corpus (IceCLEC) Version 1.0
Villumálheild
íslensks barnamáls
IS monolingual, error corpus corpus IS - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/108 Creative Commons - Attribution 4.0 International (CC BY 4.0) -
77 The Icelandic L2 Error
Corpus (IceL2EC) Version 1.1
Villumálheild íslensku
sem annars máls
IS monolingual, error corpus corpus IS - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/106 Creative Commons - Attribution 4.0 International (CC BY 4.0) -
78 Icelandic Error Corpus Nonwords Óorð íslensku
villumálheildarinnar
IS monolingual, error corpus corpus IS - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/63 Creative Commons - Attribution 4.0 International (CC BY 4.0) -
79 Icelandic Search Query
Errors (IceSQuEr) 0.1
Íslenskar leitarvillur IS monolingual, error corpus corpus IS - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/78 Creative Commons - Attribution 4.0 International (CC BY 4.0) data source: users' search queries that do not
give results in the Database of Icelandic Morphology (https://bin.arnastofnun.is/)
80 nonwords - - monolingual, error corpus corpus IS - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/50 Creative Commons - Attribution 4.0 International (CC BY 4.0) data source: The list was prepared using a word list from the DMII (The Database from Modern Icelandic Inflection)
81 The Icelandic Confusion
Set Corpus (ICoSC) 2.0 (2020-05-06)
- - monolingual, error corpus corpus IS - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/19 Creative Commons - Attribution 4.0 International (CC BY 4.0) -
82 MIM-GOLD-NER – named
entity recognition corpus
Nafnkennslamálheildin IS monolingual corpus IS NE version of MIM-GOLD https://repository.clarin.is/repository/xmlui/handle/20.500.12537/42 Icelandic Gigaword Corpus Part1 -
83 IceSum - Icelandic Text
Summarization Corpus
- - monolingual corpus IS local, world, business and sports news https://repository.clarin.is/repository/xmlui/handle/20.500.12537/96 Creative Commons - Attribution 4.0 International (CC BY 4.0) data source: news articles from mbl.is
84 The Saga Corpus Fornritin IS multilingual corpus IS old-norse Old Icelandic narrative texts: Family Sagas, Sturlunga Saga, Sagas of the Kings
of Norway and the Book of Settlement
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/32 Creative Commons - Attribution 4.0 International (CC BY 4.0) data sources:

Family Sagas --> Bragi Halldórsson, Jón Torfason and Örnólfur Thorsson (eds.). 1985-1986. Íslendinga sögur. Svart á hvítu. Reykjavík.

Heimskringla --> Bergljót Kristjánsdóttir, Bragi Halldórsson, Jón Torfason and Örnólfur Thorsson (eds.). 1991. Heimskringla. Mál og menning. Reykjavík.

Book of Settlement --> Jakob Benediktsson (ed.). 1968. Íslenzk fornrit I. Íslendingabók - Landnámabók. Hið íslenzka fornritafélag.

Sturlunga Saga --> Örnólfur Thorsson, Bergljót Kristjánsdóttir, Bragi Halldórsson, Gísli Sigurðsson, Guðrún Ása Grímsdóttir, Guðrún Ingólfsdóttir, Jón Torfason and Sverrir Tómasson (eds.). 1988. Sturlunga saga. Svart á hvítu. Reykjavík.
85 Icelandic Taboo Database
(iceTaboo) Version 1.0
- - monolingual corpus IS - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/64 Creative Commons - Attribution 4.0 International (CC BY 4.0) -
86 Icelandic Web Text Corpus Íslenskur orðasjóður IS monolingual website/corpus IS - https://corpora.uni-leipzig.de/en?corpusId=isl-is_web_2019 -
87 Icelandic Multi-SimLex - - monolingual lexical conceptual resource IS - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/121 Creative Commons - Attribution 4.0 International (CC BY 4.0) -
88 IceBATS - The Icelandic
Bigger Analogy Test Set
- - monolingual lexical conceptual resource IS - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/120 Creative Commons - Attribution 4.0 International (CC BY 4.0) -
89 Icegrams (2020-09-30) - - monolingual language description IS - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/80 The MIT License (MIT) "The Icegrams trigram corpus is built from the 2017 edition of the Icelandic Gigaword Corpus"
90 Icelandic Hyphenation Dictionary Íslenskur orðskiptingalisti og
orðskiptingamynstur
IS monolingual lexical conceptual resource IS - https://clarin.is/en/resources/hyphenation/
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/86
Creative Commons - Attribution 4.0 International (CC BY 4.0) -
91 MerkOr MerkOr - íslenskur
merkingarbrunnur
IS monolingual corpus, tool IS - https://clarin.is/en/resources/merkor/ LGPL-3.0 License -
92 Terminology Database
of the Ministry of Foreign Affairs
Hugtakasafn
þýðingarmiðstöðvar
utanríkisráðuneytisins
IS parallel, multilingual website/terminology IS EN DA NO SV FR DE LA law, administration, names of international agreements, institutions,
committees, councils etc, glossary of the Icelandic International Development
Agency (ICEIDA)
https://clarin.is/en/resources/translation/
https://hugtakasafn.utn.stjr.is/umhts.adp, https://hugtakasafn.utn.stjr.is/
- -
93 Icelandic Wordnet Íslenskt orðanet IS monolingual website IS - https://clarin.is/en/resources/icewordnet/, https://ordanet.is/ - -
94 The Icelandic Wordweb 21.06 - - monolingual lexical conceptual resource IS - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/117 Creative Commons - Attribution 4.0 International (CC BY 4.0) -
95 IceWordNet - - monolingual (similar to a thesaurus) IS - https://clarin.is/en/resources/iwn/ CC BY 3.0 license data source: English words in the Princeton Core WordNet were translated into Icelandic; the synonyms of the Icelandic words were listed with the help from the Icelandic Thesaurus and the web site snara.is.
96 Dictionary of Modern Icelandic Íslensk
nútímamálsorðabók
IS monolingual website/lexical conceptual resource IS - https://clarin.is/en/resources/dmi/
https://islenskordabok.arnastofnun.is/
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/94
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) "based on the multilingual dictionary ISLEX "
97 The Institute of Lexicography
Written Language Archive
Ritmálssafn
Orðabókar Háskólans
IS monolingual website IS "citations from printed books and journals, and a number of
manuscripts, from 1540 onwards"
https://clarin.is/en/resources/archive/ - -
98 Islex - Icelandic-Scandinavian
multilingual dictionary
ISLEX IS multilingual lexical conceptual resource DA , FO , FI , IS , NB , NN , SV - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/10 - -
99 Database of Icelandic Morphology Beygingarlýsing
íslensks nútímamáls
IS monolingual language description IS - https://bin.arnastofnun.is/DMII/LTdata/
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/5
Creative Commons - Attribution 4.0 International (CC BY 4.0) -
100 The Icelandic Term Bank Íðorðabankinn IS multilingual terminological resource multilingual - https://clarin.is/en/resources/termbank/ CC-BY-SA licence -
101 Plaintext Wikipedia dump 2018 - - multilingual (see list of languages on
corpus webpage)
corpus multilingual texts from Wikipedia https://lindat.cz/repository/xmlui/handle/11234/1-2735 Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) data source: Wikipedia plain text data obtained from Wikipedia dumps
102 Deltacorpus 1.1 - - multilingual (see list of languages on
corpus webpage)
corpus multilingual - https://lindat.cz/repository/xmlui/handle/11234/1-1743 Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) data source: W2C corpus
103 W2C – Web to Corpus – Corpora - - multilingual (see list of languages on
corpus webpage)
corpus multilingual collected from wikipedia and the web https://lindat.cz/repository/xmlui/handle/11858/00-097C-0000-0022-6133-9#

https://vlo.clarin.eu/record/https\_58\_\_47\_\_47\_hdl.handle.net\_47\_11858\_47\_
00-097C-0000-0022-6133-9_64_format_61_cmdi?1&q=multilingual&fqType=
languageCode:or&fq=languageCode:code:isl&index=11&count=17
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) -
104 Concreteness and
imageability lexicon
MEGA.HR-Crossling
- - multilingual (see list of languages on
corpus webpage)
lexical conceptual resource multilingual - https://www.clarin.si/repository/xmlui/handle/11356/1187# Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) -
105 Linguistically annotated multilingual comparable
corpora of parliamentary debates ParlaMint.ana 2.1
- - multilingual (see list of languages on
corpus webpage)
corpus BG HR CS DA NL EN FR HU IS IT
LV LT PL SL ES TR
parliamentary debates mostly
starting in 2015 and extending to mid-2020
https://www.clarin.si/repository/xmlui/handle/11356/1431 Creative Commons - Attribution 4.0 International (CC BY 4.0) -
106 Multilingual comparable corpora
of parliamentary debates ParlaMint 2.1
- - multilingual (see list of languages on
corpus webpage)
corpus BG HR CS DA NL EN FR HU IS IT
LV LT PL SL ES TR
parliamentary debates mostly
starting in 2015 and extending to mid-2020
https://www.clarin.si/repository/xmlui/handle/11356/1432 Creative Commons - Attribution 4.0 International (CC BY 4.0) -
107 Universal Dependencies 2.8.1 - - multilingual (see list of languages on
corpus webpage)
corpus multilingual - https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3687# Licence Universal Dependencies v2.8 -
108 COVID-19 ANTIBIOTIC
dataset. Multilingual
(CEF languages)
- - multilingual (see list of languages on
corpus webpage)
corpus LV PL NL FI LT HR MT NO NB SL
SK EN SV IS RO PT HU IT BG ES
FR DA DE ET CS GA EL
Health (Eurovoc 2841), Social Questions https://portulanclarin.net/repository/browse/1c5ff916146911eb
b6ec02420a0004094d31b044c80f4109bb228ff6f55a68e8/
CC - BY data source: acquired from the website https://antibiotic.ecdc.europa.eu/