Skip to content

Google Summer of Code 2018 Ideas List

Émilie Pagé-Perron edited this page Feb 1, 2018 · 56 revisions

On this page you will find project ideas for applications to the Google Summer of Code 2018. We encourage creativity, applicants are welcome to propose their own ideas in CDLI realm.

About the Cuneiform Digital Library Initiative

The Cuneiform Digital Library Initiative (CDLI) is driven by the mission to enable collection, preservation and accessibility of information concerning all ancient Near Eastern artifacts inscribed with cuneiform through images, textual information, and metadata. With over 334,000 artifacts in our catalogue, we house information about approximately two-thirds of all sources from cuneiform collections around the world as part of this mission. Our data is publicly available at https://cdli.ucla.edu and our audience primarily comprises scholars, students, museum staff, and informal learners.

Through its long history, CDLI is now integral to the Assyriological discipline fabric itself. It is used as a tertiary source, a data hub and a research data repository. Based on Google Analytics reports, CDLI website is visited on average by 3,000 monthly users through 10,000 sessions and 100,000 pageviews. 78% of these users are recurring visitors. The majority of users accesses CDLI collections and associated tools seeking information about a specific text or group of texts; CDLI has authoritative record about where the physical document is currently located, when and where is was originally created and deposited in ancient times, what is inscribed on the artifact and where it has been published. Search results display artifact images and associated linguistic annotations when available.

At CDLI, we are a group of developers, language scientists, machine learning engineers and cuneiform specialists who develop software infrastructure to process and analyze curated data. To this effect, we are actively developing two projects: Framework Update and Machine Translation and Automated Analysis of Cuneiform Languages. As part of these endeavors we are building a natural language processing platform to empower specialists of ancient languages for undertaking translation of Sumerian language texts thus enabling data driven study of languages, culture, history, economy and politics of the ancient Near Eastern civilizations. In this platform we are focusing on data normalization using Linked Open Data to foster best practices in data exchange, standardization and integration with other projects in digital humanities and computational philology.

To follow our research, see an overview of the tools we use to track our work.

The CDLI data comprises catalogue and text data that can be downloaded from our Github data repository, images which can be harvested or obtained on demand (including higher resolution images, for research purposes only) and textual annotations which are currently being prepared by the MTAAC research team.

Potential project ideas

API for data retrieval on CDLI (easy)

Since inception, CDLI has endeavored to be a community driven initiative with contributors all around the world. Over the course of time our database has grown and has been extensively used in other projects in the discipline. Currently such efforts require manual interventions to share relevant data. As part of this project we would be breaking grounds with data sharing in field of Assyriology. This would enable accessibility of CDLI data for academic research projects, linguistic research and digital humanities projects as a self-service. As part of this project we plan to offer the data in multiple formats (including XML, RDF, JSON) for the clients thru well documented APIs.

This project will be a stepping stone towards Linked Open Data as all our data would be linkable using RDF Format. Although Linked Open Data has been applied in fields of humanities, and language sciences however its use in field of Assyriology is almost absent. The same is true for Linguistic Linked Open Data in this regards. This would benefit and encourage initiatives like Modref project (which provides linked data from the CDLI along with two other digital libraries) and the British Museum Research Space service (which includes cuneiform objects). These services offer a Sparql endpoint to query their catalogue metadata which intern formalized using the CIDOC-CRM ontology (An ontology is designed to handle the classification and description of material culture artifacts). Linking CDLI with these services will permit the user to query artifacts of a diverse nature across multiple collections.

Outcomes:

Minimal viable product:

  • Understand the catalog and vocab data to be made available thru API.
  • Restructure the databases if needed.
  • Design and implement the API and retrieval services.
  • Test throughput and latency requirements.
  • Document the APIs in github.

If time permits:

Skills required/preferred:

  • Solid understanding of data structures.
  • Familiarity with MVC design and php.
  • Desire and eagerness to learn service oriented architecture.
  • Interest in Linked open data

Possible mentors:

  • Émilie Pagé-Perron
  • Saurabh Trikande

Integrating CDLI corpora to CLTK/NLTK (easy)

This project aims at developing new tools and resources for research with emphasis on re-usability and use of recognized standards. Currently there is no off-the-shelf tool to perform and undertake basic natural language processing tasks on cuneiform transliteration such as calculating average text length, line length, tokens (words, signs) frequencies, etc. Those are the most basic operations one should have access to perform to start evaluating a corpus for further processing and automated analysis.

One of the challenges at CDLI is to deal with the fact that the cuneiform corpus is always evolving: We know less about the Sumerian and Akkadian languages than Hindi or Ancient Geek so each new research brings improvements and is a value addition to the existing corpus. Currently Classical language toolkit(CLTK) deals with fixed versions of corpora. As part of this project, CLTK needs to be extended to address the particularity of our corpus.

This work can be done for a specific language (Sumerian or Akkadian) OR all languages OR for a specific temporal corpus. The design should be modular to enable expansion to the whole corpus in the realm of cuneiform texts.

Getting started:

Tasks:

  • Choose which methods should be implemented (with justification)
  • Design and develop a system for corpus versioning and its integration into the CLTK.
  • Implement and test chosen objectives.

Outcomes:

The deliverable of this projects is to enable a chosen corpus in CTLK with NLP functionalities. The modular design would enable future expansion to rest of the corpus. This will empower entire cuneiform research community with access to essential tools for linguistic analysis.

Skills required/preferred:

  • Familiarity with Python
  • Familiarity with the NLTK and the CLTK
  • Interest in Natural Language Processing
  • Interest in Sumerian and or Akkadian

Possible mentors:

  • Émilie Pagé-Perron
  • Saurabh Trikande

Computer vision challenge for the cuneiform script (hard)

Unlike other obsolete-language digital libraries combining text and artifact image, the current system used by CDLI requires user to absorb visual and text information simultaneously to interpret the mapping between them. Experts in cuneiform studies are usually able to discern this mapping only for their areas of expertise; non-experts and informal learners, on the other hand, have no direct means of affiliating image and annotation content. This poses a core challenge for CDLI project to make fundamental contribution to the question of cuneiform paleography, and more broadly to define new approaches to deal with the dilemma of automatically hyperlinking existing text annotation with corresponding delineation in image. With the advent of image processing methodologies, this text-image hyperlink concern can now be addressed with reasonable performance. This would involve building models using machine learning algorithms specifically trained over a large training set to understand the underlying structure in the tablet images so as to optimally perform image segmentation.

Image processing would not only let-go manual segmentation labor but also will enhance the system to have robust tagging mechanisms for further additions to the library. Previous research in this domain has been focused around accurately detecting and localizing boundaries in natural scenes using local image measurements that involved analyzing brightness, color, and texture associated with natural boundaries. However, in regards to ancient cuneiform artifacts, this problem involves learning from three-dimensional, in the majority of cases damaged tablets, which increases the noise in the training algorithm.

Outcomes:

The goal of this POC research project involves developing machine learning models which ingest cuneiform text and image to generate segments equivalent to the number of lines of transliteration. Appropriate segment indexing should enable us to further map the text and segments. This would require the student to formulate, test and evaluate strategies for line-by-line, and section-by-section encoding of cuneiform artifact image co-ordinates.

Skills required/preferred:

  • Interest in Computer Vision and Machine Learning (Some prior background is preferred).
  • Proficiency in python.
  • Research experience is a plus.
  • Passion to thrive in ambiguity.
  • Openness to ideas and experimentation instincts.

Possible mentors:

  • Saurabh Trikande
  • Jayanth Jaiswal

Multiple layer annotations querying (hard)

There is currently no accessible tool available to seamlessly integrate into a website for querying through multiple layers of linguistic annotations (morphology, syntax and semantics). The best standalone tool we have found is ANNIS, a complete and robust corpus analysis tool. ANNIS is an excellent example of desired functionalities however it has some limitations when wanting to provide an accessible interface with seamless experience. In addition, we want to query our data in RDF format for flexibility and further integration into linked open data toolbox for computational linguisitcs.

Getting started:

Tasks:

  • Setup a sparql endpoint
  • Define sparql query chunks to provide search for all layers of annotations
  • Assemble and test the search system
  • Prepare basic textual results display for humans
  • Prepare basic RDF output (XML, Turtle, JSON) for machines

Outcomes:

Feature to search through combined linguistic annotations and basic display integration to the current results display.

Skills required/preferred:

  • Familiarity with natural language annotation
  • Interest in Linked Open Data
  • Familiarity with Sparql

Possible mentors:

  • Émilie Pagé-Perron
  • Christian Chiarcos

Textual annotations viz (hard)

As part of the MTAAC project, we are producing rich textual annotations, both manually and automatically. Soon, 68 000 thousands texts will be enriched with morphological, syntactic and semantic annotations. Those annotations are stored in CDLI-CoNLL format and exploitable in RDF format to be manipulated as a graph. GraphViz enables the visualisation of graph data by generating svg images. As an intermediary, dot language is ideal to represent the visual aspects of a graph. This technology must be integrated into CDLI platform to enable users to visualize linguistic information about the texts at hand for research and teaching purposes.

Getting started :

Outcomes:

The resulting deliverable is a novel visualization of linguistic annotations (specifically syntax and semantics attached to a text) in clear and helpful svg representation.

Skills required/preferred:

  • Familiarity with linked open data
  • Knowledge of PHP

Possible mentors:

  • Émilie Pagé-Perron
  • Christian Chiarcos

Granular temporal data management (easy)

CDLI is currently improving complexity of it's data model, structuring the data to enable leveraging relationships. One salient classification aspect of cuneiform sources is dating information. Historical periods can be subdivided using rulers that reigned and corresponding dates provided on the texts. Depending on the period, texts can bear year name, month names and day. Currently the date is encoded in a text field of the CDLI catalogue as follows: RN.Y.M.D (Royal name, year, month, day). Royal name is spelled in full with conventional English designations, with “--” for lost information, “00” when information was not given by the scribe. Month intercalations were designated by scribes with "min," “the second,” or "diri," “extra.” A question mark following a space after the full date notation records doubts about any one, or all of the preceding RN.Y.M.D slots. We are considering expanding date information to include dynasty/era.

Candidate's design should accommodate requirements for an annotation pipeline currently being developed. Annotations providing information about the date should be compatible with processing of preexisting dating information extracted manually and available in catalogue data.

Getting started:

Tasks:

  • Convert existing data to a new data model
  • Extend search engine to handle dating information
  • Prepare views to navigate useful dating information

Outcomes:

  • Search and display capabilities integrating granular temporal data

Skills required/preferred:

  • Familiarity with Relational data or structured data, and PHP
  • Familiarity with HTML and CSS
  • Familiarity with the Sumerian and Akkadian languages

Possible mentors:

  • Émilie Pagé-Perron

Temporal and geographic viz (easy)

CDLI has rich geographical and temporal data at its disposal. At this time this information is not fully utilized. Although we are working on our data schema and model, there are significant challenges in exploiting new relationships.

This data should be presented to users in an interactive manner, giving them a new way to browse and discover information. The temporal and geographical data can be coupled with other information such as text genre, language, word frequency comparison and displayed thru a novel visualization technique.

Getting started:

Task:

  • Identify potential user cases
  • Choose the most accessible visualization plugins for each chosen display
  • Integrate the chosen technology with data outputs
  • Fine tune the displays, interlinking data further and increasing interactivity

Outcomes:

  • One or more display usable to discover and browse data in new and interactive ways

Skills required/preferred:

  • Familiarity with JS
  • Familiarity with JSON structured data
  • Familiarity with HTML and CSS
  • Familiarity with accessibility principles

Possible mentors:

  • Émilie Pagé-Perron

Your own project (Bring em' on!)

We are interested in expanding our technological fathom in processing, analyzing and distributing (including visualization and accessibility) of our catalog and textual derived data. If you have an idea which could be reused either to reproduce your research or enhance further developments in the disciplines of Assyriology, Computational Linguistics or Computer Science, reach out to us and we can work on preparing a project suitable for GSoC.

cdli@ucla.edu