See the paper and the corresponding original repo.
This README will continue to be updated!
This example creates a "conceptual" / miniature replication of the above paper using the tools provided by the ULTK library. Right now, the final analysis produces the following plot:
This README first explains the contents of this example directory, focusing on what the user has to provide that's specific to the indefinites case study, before then explaining the concrete steps taken to produce the above plot. After that, there is some discussion of what's missing from the above paper and other next steps for this example.
-
data
: various pieces of data to be used in the analysisraw/languages_real_40_updated.csv
: this contains Haspelmath's 40 language sample of indefinite systems, as compiled by Bill Croft. Each row corresponds to an expression, with columns for (i) ontological category (e.g. person) and then (ii) the nine functions from Haspelmath 1997. Annotations forPERSON
,DETER
, andneg.frag
were added by Milica Denic (see appendices in the above repo). Expressions/items are named with a number for language and a letter indicating the expression. For example,03s
corresponds to "some-":03
is for English ands
means the "some"-series.Beekhuizen_prior.csv
: the prior distribution over semantic flavors, estimated from the annotated corpus of Beekhuizen et al 2017natural_language_indefinites.csv
: a cleaned version of the raw data above (see item 1 in the Usage section below)
-
scripts
: a set of scripts for executing various components of the efficient communication analysis. These are explained in more detail in the Usage section below. -
outputs
: outputs from the analysis scripts -
referents.csv
: this file defines the set of points of communication (which will becomeReferent
s in ULTK terms).This is a very simple example, with six discrete points which have only on property (a name). In general, ULTK expects each row to correspond to a unique
Referent
and each column to a feature of that point, with no constraint on the number or type of features. Aname
column is expected. Seeultk.language.semantics.Universe.from_dataframe
for more details. -
meaning.py
: this file defines the meaning space (aUniverse
in ULTK terms) of the six flavors defined inreferents.csv
together with their prior -
grammar.yml
: defines the Language of Thought grammar (an ULTKGrammar
is created from this file in one line ingrammar.py
) for this domain, using basic propositional logic and the five semantic features identified in Haspelmath 1997. -
measures.py
: the measures of simplicity and informativity. These are basic wrappers around tools from ULTK, linking them to the indefinites grammar and universe. -
util.py
: utilities for reading and writing ULTKExpression
s andLanguage
s, as well as the natural language data.
From the src/examples
directory:
-
python -m indefinites.scripts.convert_haspelmath
: consumesdata/raw/languages_real_40_updated.csv
and generatesdata/natural_language_indefinites.csv
.It primarily transforms Haslpemath's functions into our six semantic flavors, as outlined in the paper linked above (along with a few other cleaning things).
-
python -m indefinites.scripts.generate_expressions
: generatesoutputs/generated_expressions.yml
This script generates the shortest expression (ULTK
GrammaticalExpression
s) for each possibleMeaning
(set ofReferent
s) in the LoT defined ingrammar.py
. In particular, ULTK provides methods for enumerating all grammatical expressions up to a given depth, with user-provided keys for uniqueness and for comparison in the case of a clash. By setting the former to get theMeaning
from an expression and the latter to compare along length of the expression, the enumeration method returns a mapping from meanings to shortest expressions which express them. -
python -m indefinites.scripts.estimate_pareto
: consumesoutputs/generated_expressions.yml
and generatesoutputs/dominating_languages.yml
andoutputs/explored_languages.yml
This calls
ultk.effcomm.EvolutionaryOptimizer.fit
to use an evolutionary algorithm to esimate the Pareto frontier. For simplicity, it uses the default mutations of simply adding and removing an expression from a language.dominating_languages.yml
contains the Pareto optimal languages, whileexplored_languages.yml
contains all languages generated during the optimization process. -
python -m indefinites.scripts.measure_natural_languages
: consumesdata/natural_language_indefinites.csv
,outputs/generated_expressions.yml
and generatesoutputs/natural_languages.yml
This measures complexity and communicative cost of the 40 natural languages and writes them to
natural_languages.yml
. -
python -m indefinites.scripts.combine_data
: consumesoutputs/*_languages.yml
and generatesoutputs/combined_data.csv
The resulting CSV containing data from all of the languages generated so far. It is a table with columns for language name, type (natural, explored, dominant), complexity, and communicative cost, with each row being a language.
-
python -m indefinites.scripts.analyze
: consumesoutputs/combined_data.csv
and generatesoutputs/plot.png
For now, this simply reads in the data and generates a complexity-cost trade-off plot, showing where the natural, explored, and dominant languages lie. Soon, it will include more thorough statistical analyses.
This example is a kind of "minimal" replication of Experiment 1 from Denic et al 2022. Because the primary focus of this reproduction is to demonstrate how to use ULTK and its primitives, there are some differences to keep in mind.
- We have not here implemented the controls for coverage and synonymy that are used in the paper. These could, however, easily be added by reading in
outputs/explored_languages.yml
, applying the optimization algorithm from the paper, and re-writing the subset of languages back out. - We have used only 3-operator complexity (not 2-operator complexity). Additionally, we have not added the restriction that the
R
features in the grammar can only occur withSE+
. See the discussion in the comments ingrammar.py
for more information. - We have not done the analyses with pragmatic speakers and listeners. ULTK, however, makes this very easy to do! See
ultk.effcomm.agent.PragmaticSpeaker
andultk.effcomm.agent.PragmaticListener
. In particular, theultk.effcomm.informativity.informativity
method that we use takes an argumentagent_type
. It defaults to "literal"; but passing inagent_type="pragmatic"
to this method in theestimate_pareto
andmeasure_natural_languages
scripts would measure with pragmatic agents.
- Add configuration files (ideally yml using hydra), to keep track of filenames and allow for easy experimentation with different parameter settings.
- Measure optimality and run statistical tests in
scripts/analyze.py
. - ...