Research on text analysis using crowd work and text highlights

This repository contains my master’s project. In that study, I aimed to verify if the crowd can solve challenging problems of general text analysis. The repository includes a project report and source code to reproduce experiments and analyze data.

The purpose of this repository is twofold. First, to cover the project findings since they may point out to a useful direction and save some time to those who work on a similar problem. Second, I implemented the experiments using Meteor javascript framework. I couldn’t find many Meteor examples when I worked on my code. Hopefully the source code published here would be useful for other developers in their implementations when they searched for Meteor code snippets.

Details

As a crowd platform I used TurkServer from Amazon that works with Meteor. Please see the report file to get a description of the experiments that I ran on the platform. The repository contains a directory with Meteor code for each experiment and each task.

As a backend I used MongoDB to store the workers' results. Here is the database schema:

UHexperiments - main collection with the following fields:
- groupId - unique id to locate any assignment. It is used to join all collections.
- goBefore - unique id of Treatment-2-annotators assignment, which is used by Treatment-2-labeler worker
- workerId - worker id which can be used to pay bonuses
bigLog - collection which contains all worker actions on the web page.
mainText - collection which contains texts of each worker.
marked - collection which contains highlights of each worker.
exp3AB - collection which maps annotators and labelers in Treatment 2 group. You need to insert groupIds of Treatment-2-annotators workers before running labeling experiments. It will look like exp3AB.insert({idRandom: 688, idB: "", idA: "Hk9wHJkq2sLNM2xXr"}); Where:
- idRandom is a unique number;
- idB will get value after Treatment-2-labelers worker complete their assignment;
- idA contains groupId of Treatment-2-annotators worker. There may be many Treatment-2-labelers workers assigned the same idA number. All of them will be listed in idB field.

Data files in the repository

Gold standards_corrected.docx - contains all texts for all tasks with marked clues/traps, and correct answers. The highlights in the text are made for reference. To check the exact words which determine clues/traps see Python script
FN_FP_stat.docx - false positives and false negatives counts and ratios for clues and traps for all tasks
regression_*.csv - data which I used for regression models

Below are the steps which I performed in order to extract workers' highlights:

You need to install MongoDB shell
Go to the shell folder and execute:

>> mongo "INSTANCE" --authenticationDatabase admin --ssl --username “USER” --password “PASS”
>> use “DB”
>> show collections
>> DBQuery.shellBatchSize = 300

I suggest always executing the last command. Otherwise, the DB shrinks the number of output rows and you may miss some results.

If you need to extract the workers' highlights, you need to prepare queries like:

db.marked.find({groupId: "i6t4bGDwb4BkgmBnY"});
db.marked.find({groupId: "6vv7LPu8Wwsw8AY5d"});
db.marked.find({groupId: "FAnEpNQYrG2bHoyTX"});
db.marked.find({groupId: "oQikxxG9G3962SsnZ"});
db.marked.find({groupId: "HQ5STHeAp3wtKKxRt"});
db.marked.find({groupId: "Ae2ELKhYkDDg7v4tA"});
...

The DB will return you rows like following:

{ "_id" : "jfXQxv9gET7NJuMaY", "name" : "Kitzbuel", "groupId" : "bwXvLKDq8thSjSyBq", "pId" : 13, "startPos" : 48, "endPos" : 53 }
{ "_id" : "FSRyjopiePZ6RFaGw", "name" : "Kitzbuel", "groupId" : "bwXvLKDq8thSjSyBq", "pId" : 13, "startPos" : 100, "endPos" : 104 }
{ "_id" : "rzwm4CGc2mnzCnPvY", "name" : "Kitzbuel", "groupId" : "bwXvLKDq8thSjSyBq", "pId" : 17, "startPos" : 236, "endPos" : 241 }
{ "_id" : "X9n4bN74ELwt3cL7M", "name" : "Kitzbuel", "groupId" : "bwXvLKDq8thSjSyBq", "pId" : 13, "startPos" : 178, "endPos" : 187 }

The above rows can be parsed by results_parser.py parser that counts false positives, false negatives ect. for clues, traps, empty words. It counts each clue/trap/empty as highlighted if any part of it is highlighted by the worker.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
meteor_app		meteor_app
FN_FP_stat.docx		FN_FP_stat.docx
Gold standards_corrected.docx		Gold standards_corrected.docx
Labor division and highlights - report.md		Labor division and highlights - report.md
Labor division and highlights - report.pdf		Labor division and highlights - report.pdf
README.md		README.md
regression_bin.csv		regression_bin.csv
regression_bin_cont.csv		regression_bin_cont.csv
results_parser.py		results_parser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Research on text analysis using crowd work and text highlights

Details

Data files in the repository

About

Releases

Packages

Languages

epishova/crowd_highlights

Folders and files

Latest commit

History

Repository files navigation

Research on text analysis using crowd work and text highlights

Details

Data files in the repository

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages