readme updated

taoyds · taoyds · commit 9399915ce9df · 2020-10-26T01:50:12.000Z
diff --git a/README.md b/README.md
@@ -1,90 +1,116 @@
-You must run the evaluation in this directory. database/ is the directory that contains all the databases (test suite) for evaluating execution accuracy.
-You need to download its content from https://drive.google.com/file/d/1IJvpd30D3qP6BZu_1bwUSi7JCyynEOMp/view?usp=sharing , and the method to generate these databases can be seen at https://arxiv.org/abs/2010.02840.
+# Evaluation for Text-to-SQL Using Distilled Test Suites
 
+This repo contains test suite evaluation metric for 11 text-to-SQL tasks. Compared with other current metrics, test suite calculates a tighter upper-bound for semantic accuracy efficiently. It is proposed in our EMNLP 2020 paper: [Semantic Evaluation for Text-to-SQL with Distilled Test Suites](https://arxiv.org/abs/2010.02840). It is now the official metric of [Spider](https://yale-lily.github.io/spider), [SParC](https://yale-lily.github.io/sparc), and [CoSQL](https://yale-lily.github.io/cosql), and is also now available for Academic, ATIS, Advising, Geography, IMDB, Restaurants, Scholar, and Yelp (building on the amazing work by [Catherine and Jonathan](https://github.com/jkkummerfeld/text2sql-data)).
 
-# Evaluation for SPIDER, CoSQL, SParC
+Notice: Please refer to [Ruiqi's repo](https://github.com/ruiqi-zhong/TestSuiteEval) for the code of generating neighbor queries, sampling databases, and constructing a testsuite defined in the paper. We look forward to similar evaluations in other semantic parsing domains as well.
 
-Below is the example command to calculate our new metric for SPIDER, CoSQL and SParC.
-Example command: python3 evaluation.py  --gold=evaluation_examples/gold.txt --pred=evaluation_examples/predict.txt --db=database/
 
-You should obtain an average accuracy of 8.1% if your are setting up the directories properly.
+## Setting Up
 
-Here "gold" argument is the path to the gold sql file and "pred" argument is the path to prediction file.
-These two files should be in the same format and have same number of lines/turns, with the only difference that the database name is appended after each line (seperated by tab) in the gold file.
+To run the test suite execution evaluation, first download the test suites (databases) for the 11 text-to-SQL tasks from [here](https://drive.google.com/file/d/1IJvpd30D3qP6BZu_1bwUSi7JCyynEOMp/view?usp=sharing), and put them in `database/` directory.
 
-Here are some other arguments of evaluation.py:
 
-```
-optional arguments:
-  --table TABLE         the tables.json schema file
-  --etype {all,exec,match}
-                        evaluation type, exec for test suite accuracy, match
-                        for the original exact set match accuracy
-  --plug_value          whether to plug in the gold value into the predicted
-                        query; suitable if your model does not predict values.
-  --keep_distinct       whether to keep distinct keyword during evaluation.
-                        default is false.
-  --progress_bar_for_each_datapoint
-                        whether to print progress bar of running test inputs
-                        for each datapoint
-```
+## Official Evaluation for Spider, SParC, and CoSQL
 
-The default argument for "etype" is now "exec". This is different from the original execution evaluation metric, and the details can be seen in the linked paper above.
-To evaluate both/the original official metric exact set match, pass in the argment "all"/"match".
+We will report the test suite accuracy for the official [Spider](https://yale-lily.github.io/spider), [SParC](https://yale-lily.github.io/sparc), and [CoSQL](https://yale-lily.github.io/cosql) leaderboards (starting Oct. 2020). The original set match accuracy will be reported as a reference. 
 
-If you want to calculate exact set match, you must pass in the tables.json argument.
+Below is the example command to calculate the test suite accuracy for development sets of Spider, CoSQL and SParC.
 
-"plug value" will extract the values used in the gold query and plug them into the predicted query.
-We encourage people to report performances with value predictions and do not include this argument; however, if your system do not predict values, you can use this argument to evaluate your system.
-
-If "keep_distinct" is included, the distinct keywords will NOT be removed during evaluation - while in the original exact set match metric, difference in the "distinct" keyword was not considered.
+```
+python3 evaluation.py --gold [gold file] --pred [predicted file] --etype [evaluation type] --db [database dir] --table [table file] --plug_value --keep_distinct --progress_bar_for_each_datapoint
+
+
+arguments:
+     [gold file]       gold file where each line is `a gold SQL \t db_id` for Spider, SParC, and CoSQL, and interactions are seperated by one empty line for SParC and CoSQL. See an example at evaluation_examples/gold.txt
+    [predicted file]   predicted sql file where each line is a predicted SQL, and interactions are seperated by one empty line. See an example at evaluation_examples/predict.txt
+    [database dir]     the directory that contains all the databases and test suites
+    [table file]       table.json file which includes foreign key info of each database.
+    [evaluation type]  "exec" for test suite accuracy (default), "match" for the original exact set match accuracy, and "all" for both
+    --plug_value       whether to plug in the gold value into the predicted query; suitable if your model does not predict values.
+    --keep_distinct    whether to keep distinct keyword during evaluation. default is false.
+    --progress_bar_for_each_datapoint   whether to print progress bar of running test inputs for each datapoint
+```
 
-Include "--progress_bar_for_each_datapoint" if you suspect that the execution got stuck on a specific test input; it will print the progress of running on each test input.
+#### Test Suite Execution Accuracy without Values
+If your system does NOT predict values in the SQL queries, you should add `--plug value` which will extract the values used in the gold query and plug them into the predicted query.
+```
+python3 evaluation.py 
+    --gold [gold file] 
+    --pred [predicted file] 
+    --db [database dir]
+    --etype exec 
+    --plug_value 
+```
+To also compute the original set match accuracy:
+```
+python3 evaluation.py 
+    --gold [gold file] 
+    --pred [predicted file] 
+    --db [database dir]
+    --table [table file]
+    --etype all 
+    --plug_value 
+```
 
-For example, consider the following command:
+#### Test Suite Execution Accuracy with Values
+We encourage people to report performances with value predictions and do not include `--plug value` argument.
 ```
-python3 evaluation.py  --gold=evaluation_examples/gold.txt --pred=evaluation_examples/predict.txt --db=database/ --keep_distinct --etype=all --table=tables.json --plug_value --progress_bar_for_each_datapoint
+python3 evaluation.py 
+    --gold [gold file] 
+    --pred [predicted file] 
+    --db [database dir]
+    --etype exec 
 ```
 
-The evaluation script will calculate both execution and exact set match accuracy, does not remove the distinct keyword, and print the progress bar for each datapoint.
+#### Other Agruments
+If `--keep_distinct` is included, the distinct keywords will NOT be removed during evaluation. For a fair comparison with the original exact set match metric, `--keep_distinct` should not be added.
+
+To include `--progress_bar_for_each_datapoint` if you suspect that the execution got stuck on a specific test input; it will print the progress of running on each test input.
+
+
+## Evaluation for Other Classical Text-to-SQL Datasets
 
-# Evaluation for Classical Text-to-Sql Datasets
+The prior work on classical text-to-sql datasets (ATIS, Academic, Advising, Geography, IMDB, Restaurants, Scholar, Yelp) usually reports the exact string match accuracy and execution accuracy over a single database content, which either exaggerates or deflates the real semantic accuracy.
 
-The test set for classical text-to-sql datasets (ATIS, Academic, Advising, Geography, IMDB, Restaurants, Scholar, Yelp) are adopted from this repo: https://github.com/jkkummerfeld/text2sql-data ,
-We used all the test splits if the test split is defined, and the entire dataset otherwise.
-We also rewrite the SQLs to conform with the style in the SPIDER dataset. 
+The test set for classical text-to-sql datasets are adopted from [this repo](https://github.com/jkkummerfeld/text2sql-data). We used all the test splits if the test split is defined, and the entire dataset otherwise. We also rewrite the SQLs to conform with the style in the Spider dataset. 
 
-All the test datapoints are saved in classical_test.pkl. 
-Each test datapoint is represented as a dictonary have keys and values
+All the test datapoints are saved in `classical_test.pkl`. Each test datapoint is represented as a dictonary have the following keys and values:
 
-- db_id: which one of the eight original classical datasets does it belong to. database/[db_id]/[db_id].sqlite contains an empty database with the associated schema.
-- query: the ground truth SQL query (or any semantically equivalent variant) the model needs to predict.
-- variables: the constants that are used in the SQL query. 
-    We also include a field called "ancestor_of_occuring_column", where we find out all the column that contains this value and recursively find its "ancestor column" (if a column refers to a parent column/has a foreign key reference)
-    This field is especially useful if your algorithm originally uses database content to help generate model predictions.
-- testsuite: a set of database paths on which we will compare denotation on
-- texts: the associated natural language descriptions, with the constant value extracted.
+- `db_id`: which one of the eight original classical datasets does it belong to. database/[db_id]/[db_id].sqlite contains an empty database with the associated schema.
+- `query`: the ground truth SQL query (or any semantically equivalent variant) the model needs to predict.
+- `variables`: the constants that are used in the SQL query. We also include a field called `ancestor_of_occuring_column`, where we find out all the column that contains this value and recursively find its `ancestor column` (if a column refers to a parent column/has a foreign key reference). This field is especially useful if your algorithm originally uses database content to help generate model predictions.
+- `testsuite`: a set of database paths on which we will compare denotation on
+- `texts`: the associated natural language descriptions, with the constant value extracted.
 
 You can evaluate your model in whatever configurations you want. For example, you may choose to plug in the values into the text and ask the model itself to figure out which constants the user has given; 
 or you can relax the modelling assumption and assume the model has oracle access to the ground truth constant value; or you can further relax the assumption of knowing which "ancestor column" contains the constant provided.
 However, in any case, you **SHOULD NOT** change the gold query, since test suite generation is dependent on it.
 
-The "judge" function in evaluate_classical.py contains what you need to evaluate a single model prediction. 
-It takes in the ground truth information of a datapoint (an element in classical_test.pkl, represented as a dictionary) and a model prediction (as  a string) and returns True/False - whether the prediction is semantically correct.
+The `judge` function in evaluate_classical.py contains what you need to evaluate a single model prediction. 
+It takes in the ground truth information of a datapoint (an element in `classical_test.pkl`, represented as a dictionary) and a model prediction (as  a string) and returns True/False - whether the prediction is semantically correct.
 
-Suppose you have made a model prediction for every datapoint and write it into a .txt file (one prediction per line), you can use the following example command to calculate the accuracy
+Suppose you have made a model prediction for every datapoint and write it into a `.txt` file (one prediction per line), you can use the following example command to calculate the accuracy:
 
 ```
-python3 evaluate_classical.py --pred=evaluation_examples/classical_test_gold.txt --out_file=goldclassicaltest.pkl
+python3 evaluate_classical.py --gold [gold file] --pred [predicted file] --out_file [output file] --num_processes [process number]
+
+arguments:
+    [gold file]        path to gold file: classical_test.pkl
+    [predicted file]   the path to the predicted file. See an example evaluation_examples/classical_test_gold.txt 
+    [output file]      the output file path. e.g. goldclassicaltest.pkl
+    [process number]   number of processes to use
 ```
 
-And here are the explanations of the arguments:
+
+## Citation
 
 ```
---pred PRED           the path to the predicted queries
---out_file OUT_FILE   the output file path
---num_processes NUM_PROCESSES
-                    number of processes to use
+@InProceedings{ruiqi20,
+  author =  {Ruiqi Zhong and Tao Yu and Dan Klein},
+  title =   {Semantic Evaluation for Text-to-SQL with Distilled Test Suite},
+  year =    {2020},
+  booktitle =   {The 2020 Conference on Empirical Methods in Natural Language Processing},
+  publisher = {Association for Computational Linguistics},
+}
 ```
 
-
diff --git a/evaluate_classical.py b/evaluate_classical.py
@@ -64,8 +64,8 @@ def judge(args: Tuple[Dict[str, Any], str]) -> bool:
     return pass_all_testcase
 
 
-def main(preds: List[str], verbose: bool = True, num_processes: int = NUM_PROCESSES) -> List[bool]:
-    gold_dicts = pkl.load(open('classical_test.pkl', 'rb'))
+def main(gold_file: str = "classical_test.pkl", preds: List[str], verbose: bool = True, num_processes: int = NUM_PROCESSES) -> List[bool]:
+    gold_dicts = pkl.load(open(gold_file, 'rb'))
     assert len(gold_dicts) == len(preds), 'number of gold and prediction should be equal'
     group_name2idxes = defaultdict(list)
 
@@ -85,6 +85,7 @@ def main(preds: List[str], verbose: bool = True, num_processes: int = NUM_PROCES
 if __name__ == "__main__":
     start = time.time()
     parser = argparse.ArgumentParser()
+    parser.add_argument('--gold', dest='gold', type=str, help="the path to the predicted queries")
     parser.add_argument('--pred', dest='pred', type=str, help="the path to the predicted queries")
     parser.add_argument('--out_file', type=str, required=True, help='the output file path')
     parser.add_argument('--num_processes', default=NUM_PROCESSES, help='number of processes to use')
@@ -93,6 +94,6 @@ def main(preds: List[str], verbose: bool = True, num_processes: int = NUM_PROCES
     preds = load_predictions(args.pred)
     assert not os.path.exists(args.out_file), 'output file path %s already exists' % args.out_file
 
-    result = main(preds=preds, verbose=True, num_processes=args.num_processes)
+    result = main(gold_file=args.gold, preds=preds, verbose=True, num_processes=args.num_processes)
     pkl.dump(result, open(args.out_file, 'wb'))
     print('total time used: ', time.time() - start)