Skip to content

Commit 20ece19

Browse files
donghaorentafsiri
andcommitted
First commit
Co-Authored-By: Donghao Ren <1595237+donghaoren@users.noreply.github.com> Co-Authored-By: Yannick Assogba <26408+tafsiri@users.noreply.github.com>
0 parents  commit 20ece19

File tree

11 files changed

+1568
-0
lines changed

11 files changed

+1568
-0
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
__pycache__/
2+
data/
3+
logs/

CODE_OF_CONDUCT.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Code of Conduct
2+
3+
## Our Pledge
4+
5+
In the interest of fostering an open and welcoming environment, we as
6+
contributors and maintainers pledge to making participation in our project and
7+
our community a harassment-free experience for everyone, regardless of age, body
8+
size, disability, ethnicity, sex characteristics, gender identity and expression,
9+
level of experience, education, socio-economic status, nationality, personal
10+
appearance, race, religion, or sexual identity and orientation.
11+
12+
## Our Standards
13+
14+
Examples of behavior that contributes to creating a positive environment
15+
include:
16+
17+
* Using welcoming and inclusive language
18+
* Being respectful of differing viewpoints and experiences
19+
* Gracefully accepting constructive criticism
20+
* Focusing on what is best for the community
21+
* Showing empathy towards other community members
22+
23+
Examples of unacceptable behavior by participants include:
24+
25+
* The use of sexualized language or imagery and unwelcome sexual attention or
26+
advances
27+
* Trolling, insulting/derogatory comments, and personal or political attacks
28+
* Public or private harassment
29+
* Publishing others' private information, such as a physical or electronic
30+
address, without explicit permission
31+
* Other conduct which could reasonably be considered inappropriate in a
32+
professional setting
33+
34+
## Our Responsibilities
35+
36+
Project maintainers are responsible for clarifying the standards of acceptable
37+
behavior and are expected to take appropriate and fair corrective action in
38+
response to any instances of unacceptable behavior.
39+
40+
Project maintainers have the right and responsibility to remove, edit, or
41+
reject comments, commits, code, wiki edits, issues, and other contributions
42+
that are not aligned to this Code of Conduct, or to ban temporarily or
43+
permanently any contributor for other behaviors that they deem inappropriate,
44+
threatening, offensive, or harmful.
45+
46+
## Scope
47+
48+
This Code of Conduct applies within all project spaces, and it also applies when
49+
an individual is representing the project or its community in public spaces.
50+
Examples of representing a project or community include using an official
51+
project e-mail address, posting via an official social media account, or acting
52+
as an appointed representative at an online or offline event. Representation of
53+
a project may be further defined and clarified by project maintainers.
54+
55+
## Enforcement
56+
57+
Instances of abusive, harassing, or otherwise unacceptable behavior may be
58+
reported by contacting the open source team at [opensource-conduct@group.apple.com](mailto:opensource-conduct@group.apple.com). All
59+
complaints will be reviewed and investigated and will result in a response that
60+
is deemed necessary and appropriate to the circumstances. The project team is
61+
obligated to maintain confidentiality with regard to the reporter of an incident.
62+
Further details of specific enforcement policies may be posted separately.
63+
64+
Project maintainers who do not follow or enforce the Code of Conduct in good
65+
faith may face temporary or permanent repercussions as determined by other
66+
members of the project's leadership.
67+
68+
## Attribution
69+
70+
This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org), version 1.4,
71+
available at [https://www.contributor-covenant.org/version/1/4/code-of-conduct.html](https://www.contributor-covenant.org/version/1/4/code-of-conduct.html)

CONTRIBUTING.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Contribution Guide
2+
3+
Thanks for your interest in contributing. This project was released to accompany a research paper for purposes of reproducibility, and beyond its publication there are limited plans for future development of the repository.
4+
5+
While we welcome new pull requests and issues please note that our response may be limited. Forks and out-of-tree improvements are strongly encouraged.
6+
7+
## Before you get started
8+
9+
By submitting a pull request, you represent that you have the right to license your contribution to Apple and the community, and agree by submitting the patch that your contributions are licensed under the [LICENSE](LICENSE).
10+
11+
We ask that all community members read and observe our [Code of Conduct](CODE_OF_CONDUCT.md).

LICENSE

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
Copyright (C) 2024 Apple Inc. All Rights Reserved.
2+
3+
IMPORTANT: This Apple software is supplied to you by Apple
4+
Inc. ("Apple") in consideration of your agreement to the following
5+
terms, and your use, installation, modification or redistribution of
6+
this Apple software constitutes acceptance of these terms. If you do
7+
not agree with these terms, please do not use, install, modify or
8+
redistribute this Apple software.
9+
10+
In consideration of your agreement to abide by the following terms, and
11+
subject to these terms, Apple grants you a personal, non-exclusive
12+
license, under Apple's copyrights in this original Apple software (the
13+
"Apple Software"), to use, reproduce, modify and redistribute the Apple
14+
Software, with or without modifications, in source and/or binary forms;
15+
provided that if you redistribute the Apple Software in its entirety and
16+
without modifications, you must retain this notice and the following
17+
text and disclaimers in all such redistributions of the Apple Software.
18+
Neither the name, trademarks, service marks or logos of Apple Inc. may
19+
be used to endorse or promote products derived from the Apple Software
20+
without specific prior written permission from Apple. Except as
21+
expressly stated in this notice, no other rights or licenses, express or
22+
implied, are granted by Apple herein, including but not limited to any
23+
patent rights that may be infringed by your derivative works or by other
24+
works in which the Apple Software may be incorporated.
25+
26+
The Apple Software is provided by Apple on an "AS IS" basis. APPLE
27+
MAKES NO WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION
28+
THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY AND FITNESS
29+
FOR A PARTICULAR PURPOSE, REGARDING THE APPLE SOFTWARE OR ITS USE AND
30+
OPERATION ALONE OR IN COMBINATION WITH YOUR PRODUCTS.
31+
32+
IN NO EVENT SHALL APPLE BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL
33+
OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
34+
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
35+
INTERRUPTION) ARISING IN ANY WAY OUT OF THE USE, REPRODUCTION,
36+
MODIFICATION AND/OR DISTRIBUTION OF THE APPLE SOFTWARE, HOWEVER CAUSED
37+
AND WHETHER UNDER THEORY OF CONTRACT, TORT (INCLUDING NEGLIGENCE),
38+
STRICT LIABILITY OR OTHERWISE, EVEN IF APPLE HAS BEEN ADVISED OF THE
39+
POSSIBILITY OF SUCH DAMAGE.

README.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# Synthetic data generator for multi-step key retrieval tasks
2+
3+
This folder contains code to generate synthetic data for one-step, two-step, three-step, and concatenation tasks.
4+
5+
These tasks are used in the following paper: <https://arxiv.org/abs/2407.21049>.
6+
7+
Install dependencies with:
8+
9+
```bash
10+
pip install click datasets transformers
11+
```
12+
13+
or if you want to use the dependency versions specified in requirements.txt do:
14+
15+
```bash
16+
pip install -r requirements.txt
17+
```
18+
19+
To generate the dataset for the main experiment from the paper, run:
20+
21+
```bash
22+
python generate_data.py krc
23+
```
24+
25+
To generate the dataset for the experiment on improving performance with call graph comments, run:
26+
27+
```bash
28+
python generate_data.py krfix
29+
python generate_data.py krfix_one_hop
30+
```
31+
32+
The generated dataset will be saved to the `data` folder.
33+
34+
The dataset will contain a set of gzip-compressed JSON. Each JSON is an array of prompts and associated metadata. Below is an example:
35+
36+
```js
37+
{
38+
// The prompt
39+
"prompt": "<the prompt string>",
40+
// A prefix regex to constraint decoding
41+
"force_decode_regex": "^[ \t]*(['\"]|$)",
42+
"metadata": {
43+
// The name of the model used to tokenize the prompt
44+
"model_name": "bigcode/starcoderbase",
45+
46+
// The expected output string (as a python string literal)
47+
"expected": " \"eooyfwmxln\"",
48+
// Total number of tokens in the prompt, with the current model's tokenizer
49+
"prompt_token_count": 7929,
50+
// The permutation of the task-relevant snippets
51+
"permutation": [0, 1],
52+
// The positions of the task-relevant snippets among all snippets
53+
"positions": [2, 10],
54+
// The string ranges of each task-relevant snippet, always in the original order (before any permutation).
55+
// The snippet can be retrieved with prompt[range[0]:range[1]]
56+
"string_ranges": [
57+
[1295, 1339],
58+
[6032, 6079]
59+
],
60+
// The token ranges of each task-relevant snippet, always in the original order (before any permutation)
61+
"token_ranges": [
62+
[351, 373],
63+
[1834, 1860]
64+
],
65+
// The task
66+
"variant": "two-step",
67+
// The number of distractor functions
68+
"num_distractors": 1,
69+
// The max number of tokens in the prompt
70+
"max_prompt_tokens": 8000,
71+
72+
// The max number of snippets from HumanEval (a very large number effectively removes the limit)
73+
"max_humaneval_snippets": 1000000,
74+
// The string length range of HumanEval snippets to include
75+
"humaneval_min_length": 250,
76+
"humaneval_max_length": 1000,
77+
78+
// For krfix, the call graph comment type and template
79+
"call_graph_comment_type": "calls,called_by",
80+
"call_graph_template_variant": "calls_called_by"
81+
82+
// Configuration of the task snippets (fixed for the experiment)
83+
"return_type": "string",
84+
"return_length": 10,
85+
"function_name": "random",
86+
"function_name_part_length": 6,
87+
"function_name_min_parts": 2,
88+
"function_name_max_parts": 3,
89+
}
90+
}
91+
```

0 commit comments

Comments
 (0)