-
Notifications
You must be signed in to change notification settings - Fork 4
/
introduction.tex
36 lines (23 loc) · 8.78 KB
/
introduction.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
% !TEX root = thesis.tex
\section{Introduction}
\label{sec:intro}
At least half of the world's 7000-odd languages will be extinct this century \citep{krauss92, grenoble2011cambridge}. Just over half of these languages have writing systems.\footnote{\href{https://www.ethnologue.com/enterprise-faq/how-many-languages-world-are-unwritten-0}{https://www.ethnologue.com/enterprise-faq/how-many-languages-world-are-unwritten-0}. \last{May 1} Note that with all Ethnologue links to language data, there is an eventual paywall which inhibits access. Using a private browser session can normally circumvent this paywall adequately, although I am explicitly not recommending such a workaround here.} It is estimated that fewer than 5\% of the world's languages are used online or have significant digital presence \citep{kornai2013digital} - meaning, basically, that they are used on the web or have substantial digital recordings.
The majority of the world's computational technology has been built using English - with English manuals, English interfaces, and by English speakers. The most prevalent language spoken by users of this technology is also English. There are a few languages - around thirty - with the combination of large populations with internet access, official governmental status, and industrial economies which affords them some native computational technology, in particular on the World Wide Web, the largest global network for sharing code and written material.
English is the undisputed heavyweight as far as global written resources are concerned.\footnote{\href{https://w3techs.com/technologies/history_overview/content_language}{https://w3techs.com/technologies/history\_overview/content\_language}. \last{May 1}} Over half of the web's content is written in English. The next largest languages are Russian, German, Spanish, Japanese, and French - with a combined population of well over a billion speakers. Portuguese, Italian, and Chinese have the next largest amount of content - but each of them only covers between 2\% and 3\% of the web's content - followed by Polish, Turkish, Dutch, and Korean with over 1\%. Suffice to say, the graph of global written content is not skewed towards language diversity as a norm. This is not surprising in any event, as around 90\% of the world's languages are spoken by fewer than 10\% of its people \citep{bernard1992preserving}.
In part, these high resource languages depend upon extant corpora to enable further development of human language technology (HLT). It is difficult for languages which are newcomers to the digital world to get started. % This does not follow well
Put simply: a literacy system affords corpora, and corpora can be used by researchers to either build tools for that language or to adapt tools from other languages. These tools might be spell-checkers, parsers, input systems, or later on speech recognition and generation software, semantic analysers, or machine learning and translation systems, among others. But these tools only become useful as soon as there exists corpora for them; otherwise, such work is premature. Further, high resource languages can sometimes bootstrap their efforts by utilising code from related languages. A parser for French might be adapted for Spanish, given a large corpus to train on; whereas adapting code with scarce data is more difficult.
Most code in the world is probably developed in closed environments with consumer endpoints, by the military or large businesses. For instance, the World Wide Web (from here on, the web), the largest shared corpus of written language, started with support from the Massachusetts Institute of Technology (MIT) and the Defense Advanced Research Projects Agency (DARPA). (This helps to explain why most of the web is written in English.) Another example would be Google Translate, which uses massive bilingual corpora to provide automatic translation services for free online, but whose code is proprietary and owned by Google.
While the enterprise pathway for language resource development works well for large languages where populations of speakers can be leveraged to provide funding, the majority of the world's languages are not able to develop their own computational resources - either grammars, corpora, or code. Instead, they must rely on small groups of researchers, limited funding, and a grab-bag of written resources when they have them. For instance, the most consistent translations cross-linguistically are of the Christian Bible, which may not reflect the target language's culture.
In this thesis, I examine methodology that can be used by linguists, researchers, and language developers to help their languages "digitally ascend" (as \citet{kornai2013digital} puts it) - to bootstrap their corpora creation, write grammars, transform other language's tools and research to their own languages, and to ultimately enable their communities to speak and share their knowledge computationally. This methodology goes under the broad label of \textit{open source} software. Open source software is code which has been developed and made available for free and under a permissive license, without concessions about how it is to be used or who uses it. This allows coders to use code which they personally have not built without allocating funds for it, thus freeing up significant portions of research and development costs for making tools. At present, the majority of the world's code depends on some level on open source software - for instance, Linux and much of the web depend on open source code.
In the field of computational linguistics, however, there is a deficit of resources which are licensed and available as open source. This largely stems from the need to financially recoup expenses for development, on licenses mandated by research groups or military funders, and on a lack of awareness of how open source code works by developers. Another consideration is that an open source label does not ensure that the code is worth using, maintained, relevant, or in scope for a given domain.
% General comment: it is of course personal preference, but I don't much like the use of future tense in describing the content of sections of a document that exists. I'd suggest rewording, for example:%- "Below, I go into further depth..."%- "Section 4 defines what open source is and talks about issues..."%etc.%%Same comment applies in later chapters, I am certain.
Below, I go into further depth about the state of low resource languages (LRL) and computational resources in Section~\ref{sec:endlang}, and what different languages need in order to have digital presence in Section~\ref{sec:resources}. In Section~\ref{sec:open-source}, I define what open source is, and talk about issues relevant to open source code for LRLs. I then in Section~\ref{sec:lrl-code} talk about the state of the open source ecosystem for LRLs online, in particular focusing on a database of open source code that I have built with the help of researchers around the world.
I touch on some specific examples of languages which could benefit from open source code in Section~\ref{sec:case-studies}, focusing on Gaelic, a language with tens of thousands of speakers but few online resources, and Naskapi, a language with only a thousand speakers. The Naskapi case study is informed by original research, as I engaged in field research at the town where most Naskapi live and talked to linguists working on literacy efforts for this language. In Section~\ref{sec:methods}, I discuss how open source can help low resource languages, and in Section~\ref{sec:discussion} I expound further at a high level on what open source enables for linguists and language communities. Finally, in Section~\ref{sec:future-work} and Section~\ref{sec:conclusion} I discuss future work, and offer some concluding remarks.
%% TODO Talk to profs about the lack of ethics clearance for these conversations (mostly casual)
% Listing accomplishments of this thesis
This thesis is, to my knowledge, the only paper that looks specifically at what open source resources there are for low resource languages. I provide a quantitative assessment of the state of the field, and suggest a new type of crowd-sourced, curated, and decentralised database for language resource aggregation. I also discuss three in-depth case studies; one of what a specific problem in computational linguistics looks like when viewed from an open source perspective (the example problem involves using geographical information systems with language co\"{o}rdinates), and two case studies of the state of open source resources for entire languages: Gaelic and Naskapi. The Naskapi chapter also serves as a follow-up to \citepos{jancewicz2002applied} paper, looking at how the Naskapi community has changed technologically in the past fifteen years. Finally, I suggest a novel way of storing language resources in an open source fashion using the decentralised web.