Skip to content

Commit 9399915

Browse files
committed
readme updated
1 parent 82793e5 commit 9399915

File tree

2 files changed

+87
-60
lines changed

2 files changed

+87
-60
lines changed

README.md

Lines changed: 83 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -1,90 +1,116 @@
1-
You must run the evaluation in this directory. database/ is the directory that contains all the databases (test suite) for evaluating execution accuracy.
2-
You need to download its content from https://drive.google.com/file/d/1IJvpd30D3qP6BZu_1bwUSi7JCyynEOMp/view?usp=sharing , and the method to generate these databases can be seen at https://arxiv.org/abs/2010.02840.
1+
# Evaluation for Text-to-SQL Using Distilled Test Suites
32

3+
This repo contains test suite evaluation metric for 11 text-to-SQL tasks. Compared with other current metrics, test suite calculates a tighter upper-bound for semantic accuracy efficiently. It is proposed in our EMNLP 2020 paper: [Semantic Evaluation for Text-to-SQL with Distilled Test Suites](https://arxiv.org/abs/2010.02840). It is now the official metric of [Spider](https://yale-lily.github.io/spider), [SParC](https://yale-lily.github.io/sparc), and [CoSQL](https://yale-lily.github.io/cosql), and is also now available for Academic, ATIS, Advising, Geography, IMDB, Restaurants, Scholar, and Yelp (building on the amazing work by [Catherine and Jonathan](https://github.com/jkkummerfeld/text2sql-data)).
44

5-
# Evaluation for SPIDER, CoSQL, SParC
5+
Notice: Please refer to [Ruiqi's repo](https://github.com/ruiqi-zhong/TestSuiteEval) for the code of generating neighbor queries, sampling databases, and constructing a testsuite defined in the paper. We look forward to similar evaluations in other semantic parsing domains as well.
66

7-
Below is the example command to calculate our new metric for SPIDER, CoSQL and SParC.
8-
Example command: python3 evaluation.py --gold=evaluation_examples/gold.txt --pred=evaluation_examples/predict.txt --db=database/
97

10-
You should obtain an average accuracy of 8.1% if your are setting up the directories properly.
8+
## Setting Up
119

12-
Here "gold" argument is the path to the gold sql file and "pred" argument is the path to prediction file.
13-
These two files should be in the same format and have same number of lines/turns, with the only difference that the database name is appended after each line (seperated by tab) in the gold file.
10+
To run the test suite execution evaluation, first download the test suites (databases) for the 11 text-to-SQL tasks from [here](https://drive.google.com/file/d/1IJvpd30D3qP6BZu_1bwUSi7JCyynEOMp/view?usp=sharing), and put them in `database/` directory.
1411

15-
Here are some other arguments of evaluation.py:
1612

17-
```
18-
optional arguments:
19-
--table TABLE the tables.json schema file
20-
--etype {all,exec,match}
21-
evaluation type, exec for test suite accuracy, match
22-
for the original exact set match accuracy
23-
--plug_value whether to plug in the gold value into the predicted
24-
query; suitable if your model does not predict values.
25-
--keep_distinct whether to keep distinct keyword during evaluation.
26-
default is false.
27-
--progress_bar_for_each_datapoint
28-
whether to print progress bar of running test inputs
29-
for each datapoint
30-
```
13+
## Official Evaluation for Spider, SParC, and CoSQL
3114

32-
The default argument for "etype" is now "exec". This is different from the original execution evaluation metric, and the details can be seen in the linked paper above.
33-
To evaluate both/the original official metric exact set match, pass in the argment "all"/"match".
15+
We will report the test suite accuracy for the official [Spider](https://yale-lily.github.io/spider), [SParC](https://yale-lily.github.io/sparc), and [CoSQL](https://yale-lily.github.io/cosql) leaderboards (starting Oct. 2020). The original set match accuracy will be reported as a reference.
3416

35-
If you want to calculate exact set match, you must pass in the tables.json argument.
17+
Below is the example command to calculate the test suite accuracy for development sets of Spider, CoSQL and SParC.
3618

37-
"plug value" will extract the values used in the gold query and plug them into the predicted query.
38-
We encourage people to report performances with value predictions and do not include this argument; however, if your system do not predict values, you can use this argument to evaluate your system.
39-
40-
If "keep_distinct" is included, the distinct keywords will NOT be removed during evaluation - while in the original exact set match metric, difference in the "distinct" keyword was not considered.
19+
```
20+
python3 evaluation.py --gold [gold file] --pred [predicted file] --etype [evaluation type] --db [database dir] --table [table file] --plug_value --keep_distinct --progress_bar_for_each_datapoint
21+
22+
23+
arguments:
24+
[gold file] gold file where each line is `a gold SQL \t db_id` for Spider, SParC, and CoSQL, and interactions are seperated by one empty line for SParC and CoSQL. See an example at evaluation_examples/gold.txt
25+
[predicted file] predicted sql file where each line is a predicted SQL, and interactions are seperated by one empty line. See an example at evaluation_examples/predict.txt
26+
[database dir] the directory that contains all the databases and test suites
27+
[table file] table.json file which includes foreign key info of each database.
28+
[evaluation type] "exec" for test suite accuracy (default), "match" for the original exact set match accuracy, and "all" for both
29+
--plug_value whether to plug in the gold value into the predicted query; suitable if your model does not predict values.
30+
--keep_distinct whether to keep distinct keyword during evaluation. default is false.
31+
--progress_bar_for_each_datapoint whether to print progress bar of running test inputs for each datapoint
32+
```
4133

42-
Include "--progress_bar_for_each_datapoint" if you suspect that the execution got stuck on a specific test input; it will print the progress of running on each test input.
34+
#### Test Suite Execution Accuracy without Values
35+
If your system does NOT predict values in the SQL queries, you should add `--plug value` which will extract the values used in the gold query and plug them into the predicted query.
36+
```
37+
python3 evaluation.py
38+
--gold [gold file]
39+
--pred [predicted file]
40+
--db [database dir]
41+
--etype exec
42+
--plug_value
43+
```
44+
To also compute the original set match accuracy:
45+
```
46+
python3 evaluation.py
47+
--gold [gold file]
48+
--pred [predicted file]
49+
--db [database dir]
50+
--table [table file]
51+
--etype all
52+
--plug_value
53+
```
4354

44-
For example, consider the following command:
55+
#### Test Suite Execution Accuracy with Values
56+
We encourage people to report performances with value predictions and do not include `--plug value` argument.
4557
```
46-
python3 evaluation.py --gold=evaluation_examples/gold.txt --pred=evaluation_examples/predict.txt --db=database/ --keep_distinct --etype=all --table=tables.json --plug_value --progress_bar_for_each_datapoint
58+
python3 evaluation.py
59+
--gold [gold file]
60+
--pred [predicted file]
61+
--db [database dir]
62+
--etype exec
4763
```
4864

49-
The evaluation script will calculate both execution and exact set match accuracy, does not remove the distinct keyword, and print the progress bar for each datapoint.
65+
#### Other Agruments
66+
If `--keep_distinct` is included, the distinct keywords will NOT be removed during evaluation. For a fair comparison with the original exact set match metric, `--keep_distinct` should not be added.
67+
68+
To include `--progress_bar_for_each_datapoint` if you suspect that the execution got stuck on a specific test input; it will print the progress of running on each test input.
69+
70+
71+
## Evaluation for Other Classical Text-to-SQL Datasets
5072

51-
# Evaluation for Classical Text-to-Sql Datasets
73+
The prior work on classical text-to-sql datasets (ATIS, Academic, Advising, Geography, IMDB, Restaurants, Scholar, Yelp) usually reports the exact string match accuracy and execution accuracy over a single database content, which either exaggerates or deflates the real semantic accuracy.
5274

53-
The test set for classical text-to-sql datasets (ATIS, Academic, Advising, Geography, IMDB, Restaurants, Scholar, Yelp) are adopted from this repo: https://github.com/jkkummerfeld/text2sql-data ,
54-
We used all the test splits if the test split is defined, and the entire dataset otherwise.
55-
We also rewrite the SQLs to conform with the style in the SPIDER dataset.
75+
The test set for classical text-to-sql datasets are adopted from [this repo](https://github.com/jkkummerfeld/text2sql-data). We used all the test splits if the test split is defined, and the entire dataset otherwise. We also rewrite the SQLs to conform with the style in the Spider dataset.
5676

57-
All the test datapoints are saved in classical_test.pkl.
58-
Each test datapoint is represented as a dictonary have keys and values
77+
All the test datapoints are saved in `classical_test.pkl`. Each test datapoint is represented as a dictonary have the following keys and values:
5978

60-
- db_id: which one of the eight original classical datasets does it belong to. database/[db_id]/[db_id].sqlite contains an empty database with the associated schema.
61-
- query: the ground truth SQL query (or any semantically equivalent variant) the model needs to predict.
62-
- variables: the constants that are used in the SQL query.
63-
We also include a field called "ancestor_of_occuring_column", where we find out all the column that contains this value and recursively find its "ancestor column" (if a column refers to a parent column/has a foreign key reference)
64-
This field is especially useful if your algorithm originally uses database content to help generate model predictions.
65-
- testsuite: a set of database paths on which we will compare denotation on
66-
- texts: the associated natural language descriptions, with the constant value extracted.
79+
- `db_id`: which one of the eight original classical datasets does it belong to. database/[db_id]/[db_id].sqlite contains an empty database with the associated schema.
80+
- `query`: the ground truth SQL query (or any semantically equivalent variant) the model needs to predict.
81+
- `variables`: the constants that are used in the SQL query. We also include a field called `ancestor_of_occuring_column`, where we find out all the column that contains this value and recursively find its `ancestor column` (if a column refers to a parent column/has a foreign key reference). This field is especially useful if your algorithm originally uses database content to help generate model predictions.
82+
- `testsuite`: a set of database paths on which we will compare denotation on
83+
- `texts`: the associated natural language descriptions, with the constant value extracted.
6784

6885
You can evaluate your model in whatever configurations you want. For example, you may choose to plug in the values into the text and ask the model itself to figure out which constants the user has given;
6986
or you can relax the modelling assumption and assume the model has oracle access to the ground truth constant value; or you can further relax the assumption of knowing which "ancestor column" contains the constant provided.
7087
However, in any case, you **SHOULD NOT** change the gold query, since test suite generation is dependent on it.
7188

72-
The "judge" function in evaluate_classical.py contains what you need to evaluate a single model prediction.
73-
It takes in the ground truth information of a datapoint (an element in classical_test.pkl, represented as a dictionary) and a model prediction (as a string) and returns True/False - whether the prediction is semantically correct.
89+
The `judge` function in evaluate_classical.py contains what you need to evaluate a single model prediction.
90+
It takes in the ground truth information of a datapoint (an element in `classical_test.pkl`, represented as a dictionary) and a model prediction (as a string) and returns True/False - whether the prediction is semantically correct.
7491

75-
Suppose you have made a model prediction for every datapoint and write it into a .txt file (one prediction per line), you can use the following example command to calculate the accuracy
92+
Suppose you have made a model prediction for every datapoint and write it into a `.txt` file (one prediction per line), you can use the following example command to calculate the accuracy:
7693

7794
```
78-
python3 evaluate_classical.py --pred=evaluation_examples/classical_test_gold.txt --out_file=goldclassicaltest.pkl
95+
python3 evaluate_classical.py --gold [gold file] --pred [predicted file] --out_file [output file] --num_processes [process number]
96+
97+
arguments:
98+
[gold file] path to gold file: classical_test.pkl
99+
[predicted file] the path to the predicted file. See an example evaluation_examples/classical_test_gold.txt
100+
[output file] the output file path. e.g. goldclassicaltest.pkl
101+
[process number] number of processes to use
79102
```
80103

81-
And here are the explanations of the arguments:
104+
105+
## Citation
82106

83107
```
84-
--pred PRED the path to the predicted queries
85-
--out_file OUT_FILE the output file path
86-
--num_processes NUM_PROCESSES
87-
number of processes to use
108+
@InProceedings{ruiqi20,
109+
author = {Ruiqi Zhong and Tao Yu and Dan Klein},
110+
title = {Semantic Evaluation for Text-to-SQL with Distilled Test Suite},
111+
year = {2020},
112+
booktitle = {The 2020 Conference on Empirical Methods in Natural Language Processing},
113+
publisher = {Association for Computational Linguistics},
114+
}
88115
```
89116

90-

evaluate_classical.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -64,8 +64,8 @@ def judge(args: Tuple[Dict[str, Any], str]) -> bool:
6464
return pass_all_testcase
6565

6666

67-
def main(preds: List[str], verbose: bool = True, num_processes: int = NUM_PROCESSES) -> List[bool]:
68-
gold_dicts = pkl.load(open('classical_test.pkl', 'rb'))
67+
def main(gold_file: str = "classical_test.pkl", preds: List[str], verbose: bool = True, num_processes: int = NUM_PROCESSES) -> List[bool]:
68+
gold_dicts = pkl.load(open(gold_file, 'rb'))
6969
assert len(gold_dicts) == len(preds), 'number of gold and prediction should be equal'
7070
group_name2idxes = defaultdict(list)
7171

@@ -85,6 +85,7 @@ def main(preds: List[str], verbose: bool = True, num_processes: int = NUM_PROCES
8585
if __name__ == "__main__":
8686
start = time.time()
8787
parser = argparse.ArgumentParser()
88+
parser.add_argument('--gold', dest='gold', type=str, help="the path to the predicted queries")
8889
parser.add_argument('--pred', dest='pred', type=str, help="the path to the predicted queries")
8990
parser.add_argument('--out_file', type=str, required=True, help='the output file path')
9091
parser.add_argument('--num_processes', default=NUM_PROCESSES, help='number of processes to use')
@@ -93,6 +94,6 @@ def main(preds: List[str], verbose: bool = True, num_processes: int = NUM_PROCES
9394
preds = load_predictions(args.pred)
9495
assert not os.path.exists(args.out_file), 'output file path %s already exists' % args.out_file
9596

96-
result = main(preds=preds, verbose=True, num_processes=args.num_processes)
97+
result = main(gold_file=args.gold, preds=preds, verbose=True, num_processes=args.num_processes)
9798
pkl.dump(result, open(args.out_file, 'wb'))
9899
print('total time used: ', time.time() - start)

0 commit comments

Comments
 (0)