Skip to content
This repository was archived by the owner on Jul 18, 2025. It is now read-only.

Commit 9825538

Browse files
Update README.md (#90)
1 parent 6fee3b0 commit 9825538

File tree

1 file changed

+2
-194
lines changed

1 file changed

+2
-194
lines changed

README.md

Lines changed: 2 additions & 194 deletions
Original file line numberDiff line numberDiff line change
@@ -1,195 +1,3 @@
1-
# SWE-Lancer
1+
# SWELancer
22

3-
This repo contains the dataset and code for the paper ["SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?"](https://www.openai.com/index/swe-lancer/).
4-
5-
---
6-
7-
Thank you so much for checking out our benchmark! If you have questions, run into issues, or want to contribute, please open an issue or pull request. You can also reach us at samuelgm@openai.com and michele@openai.com at any time.
8-
9-
We will continue to update this repository with the latest tasks, updates to the scaffolding, and improvements to the codebase
10-
11-
- If you'd like to use the latest version, please use the `main` branch.
12-
13-
- If you'd like to use the version of the dataset from the paper and codebase at time of paper release, please check out the `paper` branch. Note that the performance outlined in our paper is on our internal scaffold. We've aimed to open-source as much of it as possible, but the open-source agent and harness may not be exactly the same.
14-
15-
16-
---
17-
18-
**Step 1: Package Management and Requirements**
19-
20-
Python 3.11 is the most stable version to use with SWE-Lancer.
21-
22-
For package management, this repo comes with a pre-existing virtualenv or you can build one from scratch.
23-
24-
We recommend using the pre-built virtualenv with [uv](https://github.com/astral-sh/uv), a lightweight OSS package manager. To do this, run:
25-
26-
```bash
27-
uv sync
28-
source .venv/bin/activate
29-
for proj in nanoeval alcatraz nanoeval_alcatraz; do
30-
uv pip install -e project/"$proj"
31-
done
32-
```
33-
34-
To use your own virtualenv, without uv, run:
35-
36-
```bash
37-
python -m venv .venv
38-
source .venv/bin/activate
39-
pip install -r requirements.txt
40-
for proj in nanoeval alcatraz nanoeval_alcatraz; do
41-
pip install -e project/"$proj"
42-
done
43-
```
44-
45-
**Step 2: Build the Docker Image**
46-
47-
Please run the command that corresponds to your computer's architecture.
48-
49-
For Apple Silicon (or other ARM64 systems):
50-
51-
```bash
52-
docker buildx build \
53-
-f Dockerfile \
54-
--ssh default=$SSH_AUTH_SOCK \
55-
-t swelancer \
56-
.
57-
```
58-
59-
For Intel-based Mac (or other x86_64 systems):
60-
61-
```bash
62-
docker buildx build \
63-
-f Dockerfile_x86 \
64-
--platform linux/amd64 \
65-
--ssh default=$SSH_AUTH_SOCK \
66-
-t swelancer \
67-
.
68-
```
69-
70-
After the command completes, run the Docker container.
71-
72-
**Step 3: Configure Environment Variables**
73-
74-
Ensure you have an OpenAI API key and username set on your machine.
75-
76-
Locate the `sample.env` file in the root directory. This file contains template environment variables needed for the application:
77-
78-
```plaintext
79-
# sample.env contents example:
80-
PUSHER_APP_ID=your-app-id
81-
# ... other variables
82-
```
83-
84-
Create a new file named `.env` and copy the contents from `sample.env`.
85-
86-
**Step 4: Running SWE-Lancer**
87-
88-
You are now ready to run the eval with:
89-
90-
```bash
91-
uv run python run_swelancer.py
92-
```
93-
94-
You should immediately see logging output as the container gets set up and the tasks are loaded, which may take several minutes. You can adjust the model, concurrency, recording, and other parameters in `run_swelancer.py`.
95-
96-
## Running at Scale
97-
98-
To run SWELancer at scale in your own environment, you'll need to implement your own compute infrastructure. Here's a high-level overview of how to integrate SWELancer with your compute system:
99-
100-
### 1. Implement a Custom ComputerInterface
101-
102-
Create your own implementation of the `ComputerInterface` class that interfaces with your compute infrastructure. The main methods you need to implement are:
103-
104-
```python
105-
class YourComputerInterface(ComputerInterface):
106-
async def send_shell_command(self, command: str) -> CommandResult:
107-
"""Execute a shell command and return the result"""
108-
pass
109-
async def upload(self, local_path: str, remote_path: str) -> None:
110-
"""Upload a file to the compute environment"""
111-
pass
112-
async def download(self, remote_path: str) -> bytes:
113-
"""Download a file from the compute environment"""
114-
pass
115-
async def check_shell_command(self, command: str) -> CommandResult:
116-
"""Execute a shell command and raise an error if it fails"""
117-
pass
118-
async def cleanup(self) -> None:
119-
"""Clean up any resources"""
120-
pass
121-
```
122-
123-
### 2. Update the Computer Start Function
124-
125-
Modify `swelancer_agent.py`'s `_start_computer` function to use your custom interface:
126-
127-
```python
128-
async def _start_computer(self, task: ComputerTask) -> AsyncGenerator[ComputerInterface, None]:
129-
# Implement your compute logic here
130-
131-
# Initialize your compute environment
132-
# This could involve:
133-
# - Spinning up a container/VM
134-
# - Setting up SSH connections
135-
# - Configuring environment variables
136-
# Return your custom ComputerInterface implementation
137-
return YourComputerInterface()
138-
```
139-
140-
### Reference Implementation
141-
142-
For a complete example of a ComputerInterface implementation, you can refer to the `alcatraz_computer_interface.py` file in the codebase. This shows how to:
143-
144-
- Handle command execution
145-
- Manage file transfers
146-
- Deal with environment setup
147-
- Handle cleanup and resource management
148-
149-
### Best Practices
150-
151-
1. **Resource Management**
152-
153-
- Implement proper cleanup in your interface
154-
- Handle container/VM lifecycle appropriately
155-
- Clean up temporary files
156-
157-
2. **Security**
158-
159-
- Implement proper isolation between tasks
160-
- Handle sensitive data appropriately
161-
- Control network access
162-
163-
3. **Scalability**
164-
165-
- Consider implementing a pool of compute resources
166-
- Handle concurrent task execution
167-
- Implement proper resource limits
168-
169-
4. **Error Handling**
170-
- Implement robust error handling
171-
- Provide meaningful error messages
172-
- Handle network issues gracefully
173-
174-
## Citation
175-
```
176-
@misc{miserendino2025swelancerfrontierllmsearn,
177-
title={SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?},
178-
author={Samuel Miserendino and Michele Wang and Tejal Patwardhan and Johannes Heidecke},
179-
year={2025},
180-
eprint={2502.12115},
181-
archivePrefix={arXiv},
182-
primaryClass={cs.LG},
183-
url={https://arxiv.org/abs/2502.12115},
184-
}
185-
```
186-
187-
## Utilities
188-
189-
We include the following utilities to facilitate future research:
190-
191-
- `download_videos.py` allows you to download the videos attached to an Expensify GitHub issue if your model supports video input
192-
193-
## SWELancer-Lite
194-
195-
If you'd like to run SWELancer-Lite, swap out `swelancer_tasks.csv` with `swelancer_tasks_lite.csv` in `swelancer.py`. The lite dataset contains 174 tasks each worth over $1,000 (61 IC SWE tasks and 113 SWE Manager tasks).
3+
**Please see https://github.com/openai/preparedness to run SWELancer**.

0 commit comments

Comments
 (0)