Saffron is a tool for providing multi-stage analysis of text corpora by means of state-of-the-art natural language processing technology. Saffron consists of a set of self-contained and independent modules that individually provide distinct analysis of text. These modules are as follows
- Corpus Indexing: Analyses raw text documents in various formats and indexes them for later components
- Term Extraction: Extracts keyphrases that are the terms of each single document in a collection
- Concept Consolidation: Detects and removes variations from the list of terms of each document
- Author Consolidation: Detects and removes name variations from the list of authors of each document
- DBpedia Lookup: Links terms extracted from a document to URLs on the Semantic Web
- Author Connection: Associates authors with terms from the documents and identifies the importance of the term to each author
- Term Similarity: Measures the relevance of each term to each other term
- Author Similarity: Measures the relevance of each author to each other author
- Taxonomy Extraction: Organizes the terms into a single hierarchical graph that allows for easy browsing of the corpus and deep insights.
- RDF Extraction: Creates a knowledge graph (note that this process can take some time)
More detailed information on the configuration of Saffron can be found here.
Make sure you have Java
java -version
Saffron uses Apache Maven to run, it should therefore be installed (the recommended version is Maven 3.5.4).
Maven can be obtained through package managers such as APT or may be installed as follows:
- Download Maven
wget -O- https://archive.apache.org/dist/maven/maven-3/3.5.4/binaries/apache-maven-3.5.4-bin.tar.gz | sudo tar -xzv
- Locate and add Maven's bin directory path to the PATH variable in your ~/.bash_profile
export PATH="$HOME/apache-maven-3.5.4/bin:$PATH"
source ~/.bash_profile
- Check that Maven is installed
mvn -version
If using the Web Interface MongoDB can be used to store the data. If so, install MongoDb using the using the default settings.
Saffron use deep learning models for some of its modules, and these files can be quite big. You will need about 3 GB of free hard disk memory to install Saffron and its models.
To install Saffron:
- clone the github repository
git clone https://github.com/insight-centre/saffron.git ~/saffron-os
- Move to the project directory, and install the maven dependencies
cd ~/saffron-os
mvn clean install
- Run the whole pipeline of Saffron using the Command Line method below.
Note1: Running the pipeline the first time will download all the models needed by Saffron to work, so the first time it will take longer
Note2: After the last step, you may see the following text in the logs. This can be ignored and is not impacting the analysis.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
All steps of Saffron can be executed by running the saffron.sh
script, without using the Web Interface. This
script takes three arguments
- The corpus, which may be
- A folder containing files in TXT, DOC or PDF
- A zip, tar.gz or .tgz file containing files in TXT, DOC or PDF
- A Json metadata file describing the corpus (see Saffron Formats for more details on the format of the file)
- A Url (to crawl the corpus from)
- The output folder to which the results are written
- The configuration file (as described in Saffron Formats).
In addition, some optional arguments can be specified:
-c <RunConfiguration$CorpusMethod>
: The type of corpus to be used. One of CRAWL, JSON, ZIP (for the corpus as a zip, tar.gz or .tgz file containing files in TXT, DOC or PDF ). Default to JSON
-i <File>
: The inclusion list of terms and relations (in JSON)
-k <RunConfiguration$KGMethod>
: The method for knowledge graph construction: ie. whether to generate a taxonomy or a knowledge graph. Choose between TAXO and KG. Default to KG
--domain
: Limit the crawl to the domain of the seed URL (if using the CRAWL option for the corpus)
--max-pages <Integer>
: The maximum number of pages to extract when crawling (if using the CRAWL option for the corpus)
--name <String>
: The name of the run
For example, try this test command:
./saffron.sh ./examples/presidential_speech_dataset/corpus_with_authors.json ./web/data/output_KG ./examples/config.json -k TAXO
and verify that you obtain the output JSON files in the ./web/data/output_KG folder
More detail on Saffron, ie. how to install it, how to configure the different features, and the approaches it is based on can be found in the Wiki (https://github.com/insight-centre/saffron/wiki)
-
(optional) If you choose to use Mongo, install MongoDb (use the default settings)
And start a session by typing 'mongod' on a terminal. MongoDB has to be running.
The file saffron-web.sh contains some information, such as the name given to the database, the host and port it will run on. If using Mongo, you need to change the database name (default to saffron_test) edit the file saffron-web.sh and change the line: export MONGO_DB_NAME=saffron_test
To change the Mongo HOST and PORT, simply edit the same file on the following:
export MONGO_URL=localhost export MONGO_PORT=27017
-
All results (output JSON files) will be generated in
./web/data/
. However, you can change it to store in in the Mondo database only by setting the following line to false:export STORE_LOCAL_COPY=true
-
To start the Saffron Web server, simply choose a directory for Saffron to create the models and run the command as follows
./saffron-web.sh
-
Then open the following url in a browser to access the Web Interface
See the Wiki for more details on how to use the Web Interface
FORMATS.md gives the description of the input files needed to run Saffron and output files generated by Saffron
It is possible to run each module of Saffron using Docker (note that some modules depend on other modules).
A comprehensive documentation on how to do this is available in ./docs/Saffron_Docker_Documentation.pdf
If the Web Interface is used and STORE_LOCAL_COPY set to true, the output files are generated and stored in ./web/data/. Saffron generates the following files (see Saffron Formats for more details on each file)
terms.json
: The terms with weightsdoc-terms.json
: The document term map with weightsauthor-terms.json
: The connection between authors and termsauthor-sim.json
: The author-author similarity graphterm-sim.json
: The term-term similarity graphtaxonomy.json
: The final taxonomy over the corpus as JSON (if option chosen)taxonomy.json
: The final taxonomy over the corpus as RDF (if option chosen)rdf.json
: The final knowledge graph over the corpus as JSON (if option chosen)rdf.json
: The final knowledge graph over the corpus as RDF (if option chosen)config.json
: The configuration file for the run
To create a .dot file for the generated taxonomy, you can use the following command:
python taxonomy-to-dot.py taxonomy.json > taxonomy.dot
Check here to see how you can contribute to Saffron
Important:
If making any change that impact either the format of input files, the format of the output files, the format of the configuration file, or the command to run Saffron, please update the following files accordingly:
README.md
- Files within the
examples
folder (and sub-folders) FORMAT.md
and inform the development team of Saffron.
The Java classes describing the configuration can be found here JavaDoc
For the API documentation, see Saffron API Documentation