Automation tool to get us started using Datasaur.ai API and team workspace.
Before running any Robosaur commands, we need to generate our OAuth credentials and obtain our teamId from the URL.
Before running this quickstart, we need to open config.json and do these two things:
- Replace all
<TEAM_ID>with the correct teamId - Replace
<DATASAUR_CLIENT_ID>and<DATASAUR_CLIENT_SECRET>with the correct values
Then we can run this command to create multiple projects at once:
npm ci # install Robosaur dependencies, run once on setup
npm run start -- create-projects quickstart/token-based/config/config.jsonTo export the newly created projects, we can run this command:
npm run start -- export-projects quickstart/token-based/config/config.jsonconfig.json is a sample configuration file for creating "TOKEN_BASED" projects.
To create "ROW_BASED" projects, we need a slightly different configuration file, an example is provided in config.json.
You can try to create row-based projects using the same commands as above just by changing the configuration files:
npm run start -- create-projects quickstart/row-based/config/config.jsonFor more in-depth breakdown, please refer to row-based.md
Robosaur is developed using TypeScript and Node.js. We recommend using these versions:
- Node.js v16.13.2
- NPM v8 (should be bundled with Node.js)
Currently Robosaur supports two command: create-projects & export-projects. Please note that export-projects is designed to only process projects previously created by create-projects command.
For the explanation in this readme, we will use the file config.json as reference
$ npm run start -- create-projects -h
Usage: robosaur create-projects [options] <configFile>
Create Datasaur projects based on the given config file
Options:
--dry-run Simulates what the script is doing without creating the projects
-h, --help display help for commandRobosaur will try to create a project for each folder inside the create.files.path folder.
{
"files": {
"source": "local",
"path": "quickstart/token-based/documents"
}
}In this example, there should be two folders, Project 1 & Project 2, each with a single file
$ ls -lR quickstart/token-based/documents
total 0
drwxr-xr-x 3 user group Project 1
drwxr-xr-x 3 user group Project 2
quickstart/token-based/documents/Project 1:
total 8
-rw-r--r-- 1 user group lorem.txt
quickstart/token-based/documents/Project 2:
total 8
-rw-r--r-- 1 user group little prince.txt$ npm run start -- export-projects -h
Usage: robosaur export-projects [options] <configFile>
Export all projects based on the given config file
Options:
-u --unzip Unzips the exported projects, only storing the final version accepted by reviewers
-h, --help display help for commandRobosaur will try to export each projects previously created by the create-projects command. Each project will be saved as a separate zipfile under the supplied directory in export.prefix. For example, in config.json, this is set to be quickstart/token-based/export like so:
{
"export": {
"source": "local",
"prefix": "quickstart/token-based/export"
}
}By default, Robosaur will request for a full project export - with each labelers' version of the project document included. For simpler workflows, where we only need the final version of the document, we can use the --unzip option. With this option set, Robosaur will only save the final version of the document to the export destination.
Robosaur supports filtering which project to export by the project status. Overall, there are five different project statuses, from earliest to latest as follows: CREATED, IN_PROGRESS, REVIEW_READY, IN_REVIEW, COMPLETE
This can be set in the export.statusFilter inside the config JSON. In quickstart.json, the filter is set to an empty array []. This will cause Robosaur to export all projects, regardless of their state. On the other hand, if we want to export completed projects only, we can set it to be like this:
{
"export": {
"statusFilter": ["COMPLETE"]
}
}$ npm run start -- apply-tags -h
Usage: robosaur apply-tags [options] <configFile>
Applies tags to projects based on the given config file
Options:
--method <method> Update method between PUT and PATCH (default: "PUT")
-h, --help display help for commandRobosaur will try to apply tags to projects specified in the config file's payload, or from a separate csv file. The csv file can be from local or one of our supported Cloud Services.
With the default method, apply tags will replace all of the project tags with the input, just like PUT method on REST API. The same goes for PATCH, whereas it will only add new tags to a project. See the example below.
- Project A has Tag1.
- PUT ["Tag2"]: Project A will have only Tag2.
- PATCH ["Tag2"]: Project A will have Tag1 and Tag2.
If the tag in the config file is not present in the team, Robosaur will create the tag and apply it to the project automatically.
Example config format:
{
"applyTags": {
"teamId": "<TEAM_ID>",
"source": "inline",
"payload": [
{
"projectId": "<PROJECT_ID_1>",
"tags": ["<TAG_1>", "<TAG_2>"]
},
{
"projectId": "<PROJECT_ID_2>",
"tags": ["<TAG_3>"]
}
]
}
}Example csv format:
tags,projectId
"<TAG_1>,<TAG_4>",<PROJECT_ID_1>
<TAG_2>,<PROJECT_ID_1>
<TAG_3>,<PROJECT_ID_2>
For both commands, Robosaur can behave a bit smarter with the help of a JSON statefile.
In multiple project creation using the create-projects command, the statefile can help keeping track which projects have been created previously, and Robosaur will not create the project again if it had been successfully created before.
In project export using export-projects, the JSON statefile is treated as source of truth. Only projects found in the statefile will be checked against the statusFilter and exported.
Robosaur will also record the project state when it was last exported, and subsequent runs will only export the project if there had been a forward change in the project status
Robosaur now supports exporting project not created by Robosaur (stateless). To do this add the following options to the configuration file:
-
"executionMode"Specifies whether the projects to be exported is created with Robosaur or not. Fill with
"stateless"for projects created outside Robosaur and"stateful"for projects created with Robosaur. The default value is"stateful". -
"projectFilter"Specifies which projects to be exported. Contains the following value:
-
"kind"(required)TOKEN_BASED,ROW_BASED, orDOCUMENT_BASED -
"date"-
"newestDate"(required)Ignores all projects created after this date.
-
"oldestDate"Ignores all projects created before this date.
-
-
"tags"Filter projects by its tag names.
-
Example:
...
"export": {
"source": "local",
"prefix": "quickstart/token-based/export",
"teamId": "1",
"statusFilter": [],
"executionMode": "stateless",
"projectFilter": {
"kind": "TOKEN_BASED",
"date": {
"newestDate": "2022-03-11",
"oldestDate": "2022-03-07"
},
"tags": ["OCR"]
},
"format": "JSON_ADVANCED",
"fileTransformerId": null
},
...In this part we will explain each part of the Robosaur config file. We will use config.json as an example. An in-depth breakdown is also available as a TypeScript file in src/config/interfaces.ts here
"datasaur"
Contains our OAuthclientIdandclientSecret. These credentials are only enabled for Growth and Enterprise plans. For more information, please reach out to Datasaur"projectState"
Where we want our statefile to be saved.projectState.pathcan be a full or a relative path to a JSON file. For now, keepsourceaslocalfor allsources.
-
Project creation (
create-projects)-
"files"
Where our project folders are located. A bit different fromprojectState.path,create.files.pathshould be a folder path - relative or full. -
"assignment"
Where our assignment file is located.create.assignment.pathis similar toprojectState.path, it should be a full or relative path pointing to a JSON file.
assignment.strategyaccepts one of two options:"ALL"or"AUTO""ALL": each labeler will receive a copy of all documents"AUTO": Datasaur will assign documents in a round-robin way, with each labeler receiving at least 1 copy of a document.
For example, if we have a project with 3 different documents -
#1,#2,#3- and 2 labeler,AliceandBob, using"ALL"means bothAliceandBobwill get those 3 documents."AUTO"meansAlicewill get #1,Bobwill get #2, and we then loop-back toAlicewho will get #3
So, with"AUTO"->Alicegets 2 documents, andBobgets 1 document
-
"project"
This is the Datasaur project configuration.
More options can be seen by creating a project via the web UI, and then clicking theView Scriptbutton.
In general, we want to keep these mostly unchanged, except forproject.teamIdandproject.fileTransformerIddocFileOptions- Configuration specific forROW_BASEDconfigs. Refer torow-based.mdfor more information.splitDocumentOption- Allows splitting each document to several parts, based on thestrategyandnumberoption. For more information, see https://datasaurai.gitbook.io/datasaur/basics/workforce-management/split-files
-
-
Project export (
export-projects)"export"
This changes Robosaur's export behavior.
export.prefixis the folder path where Robosaur will save the export result - make sure Robosaur has write permission to the folder.
export.format&export.fileTransformerIdaffects how Datasaur will export our projects. See this gitbook link for more details.
There are numerous "source": "local" in many places, and we said to keep them as-is. For most use cases, creating and exporting projects to and from local storage is the simplest approach. However, Robosaur also supports project creation from files located in S3 buckets, GCS buckets, and Azure Blob Storage containers! All we need to do is set the correct credentials, and change the source to s3, gcs, or azure.
Here are the examples for credentials and other configs:
-
Google Cloud Storage -
config/google-cloud-storage/config.json{ "credentials": { "gcs": { "gcsCredentialJson": "config/google-cloud-storage/credential.json" } }, "files": { "source": "gcs", "bucketName": "my-bucket", "prefix": "projects" } }To fully use Robosaur with a GCS bucket, we can use the
Storage Object Adminrole.
The specific IAM permissions required are as follows:- storage.objects.list
- storage.objects.get
- storage.objects.create - to save export results to GCS bucket
- storage.objects.delete - used with storage.objects.create to update the statefile
-
Amazon S3 Buckets -
config/s3/config.json{ "credentials": { "s3": { "s3Endpoint": "s3.amazonaws.com", "s3Port": 443, "s3AccessKey": "accesskey", "s3SecretKey": "secretkey", "s3UseSSL": true, "s3Region": "bucket-region-or-null" } }, "projectState": { "source": "s3", "bucketName": "my-bucket", "path": "path/to/stateFile.json" } }s3Regionis an optional parameter, indicating where your S3 bucket is located.
However, we have encountered some cases where we gotS3: Access Deniederror when it is not defined.
We recommend setting this property whenever possible.
Usually, these are identified by access keys starting withASIA...To fully use Robosaur with S3 buckets, these are the IAM Roles required:
- s3:GetObject
- s3:GetObjectAcl
- s3:PutObject
- s3:PutObjectAcl
- s3:DeleteObject
-
Azure Blob Storage -
config/azure-blob-storage/config.json{ "credentials": { "azure": { "connectionString": "my-connection-string", "containerName": "my-azure-container" } }, "projectState": { "source": "azure", "bucketName": "my-azure-container", "path": "path/to/stateFile.json" } }Both
connectionStringandcontainerNameare required.You can obtain your
connectionStringby copying one of the connection strings from your Azure Storage Account.The
containerNameis where you would upload your projects, inside aprojectsfolder.
Robosaur's config file uses a different format compared to the View Script option during Project Creation Wizard (PCW). To use the script generated from PCW use the option --use-pcw on create-projects command.
npm run start -- create-projects <path-to-config-file> --use-pcwWhen using --use-pcw option, provide both pcwPayloadSource and pcwPayload inside project in the config file.
pcwPayloadSource supports these values:
"inline": copy and paste the script from PCW directly insidepcwPayload
Example:
{
...
"project": {
...
"pcwPayloadSource": {
"source": "inline"
},
"pcwPayload": {
<pasted from PCW>
}
...
}
...
}"local": store the PCW script in a local file.pcwPayloadshould be astringcontaining path to the script file.
Example:
{
...
"project": {
...
"pcwPayloadSource": {
"source": "local"
},
"pcwPayload": "path/to/script/file.json"
...
}
...
}"gcs","s3", or"azure": store the PCW script in an object cloud storage.pcwPayloadSourceshould contain another value calledbucketNameandpcwPayloadshould be astringcontaining a path to the file in the object cloud storage. Don't forget to provide credentials to the chosen cloud provider (refer here).
Example:
{
...
"credentials": {
"gcs": { "gcsCredentialJson": "config/google-cloud-storage/credential.json" }
},
"project": {
...
"pcwPayloadSource": {
"source": "gcs",
"bucketName": "my-bucket-name"
},
"pcwPayload": "path/to/script/file.json"
...
}
...
}Robosaur does not support providing documents through PCW payload. The documents option inside pcwPayload is still required to get the question sets and/or the docFileOptions, but all documents provided will be ignored. Instead, documents should be provided through the usual Robosaur method (refer here).
{
...
"assignment": { // use this only if you want to distribute by projects
...
},
"project": {
...
"pcwPayloadSource": {
"source": "inline",
},
"pcwAssignmentStrategy": "AUTO", // remove this if you want to distribute by projects
"pcwPayload": {
...
"documentAssignments": [
{
"teamMemberId": "1",
"documents": [ // this will be ignored
{
"fileName": "lorem.txt",
"part": 0
}
],
"role": "LABELER_AND_REVIEWER"
},
{
"teamMemberId": "2",
"documents": [ // this will be ignored
{
"fileName": "lorem.txt",
"part": 0
}
],
"role": "LABELER"
}
],
...
}
...
}
...
}Robosaur supports using PCW's labeler and reviewer assignment settings, but note that assignment option should not be provided. Also, documents option used for assigning specific documents for each labelers will be ignored. Instead, use pcwAssignmentStrategy option to specify AUTO or ALL assignment method.