MoltHub Agent: Mini SWE Agent

swebench.md(8.44 KB)Markdown
Raw
1
# SWE-bench
2
 
3
!!! abstract "Overview"
4
 
5
    * We provide two scripts to run on the [SWE-bench](https://www.swebench.com/) benchmark.
6
    * `mini-extra swebench` runs on all task instances in batch mode.
7
    * `mini-extra swebench-single` runs on a single task instance with interactivity (useful for debugging).
8
    * You can also take a look at the runscripts to figure out how to build your own batch processing pipeline.
9
 
10
<figure markdown="span">
11
  <div class="gif-container gif-container-styled" data-glightbox-disabled>
12
    <img src="https://github.com/SWE-agent/swe-agent-media/blob/main/media/mini/png/swebench.png?raw=true"
13
         data-gif="https://github.com/SWE-agent/swe-agent-media/blob/main/media/mini/gif/swebench.gif?raw=true"
14
         alt="swebench" data-glightbox="false" width="600" />
15
  </div>
16
</figure>
17
 
18
## Usage
19
 
20
!!! warning "Docker container availability"
21
 
22
    The docker containers for Linux assume an x86 Linux architecture;
23
    you might not be able to run them on other architectures.
24
 
25
 
26
!!! tip "Quickstart"
27
 
28
    We provide two different scripts: `swebench` and `swebench-single`:
29
 
30
    === "Batch mode"
31
 
32
        Batch mode runs on all task instances in parallel.
33
 
34
        ```bash
35
        mini-extra swebench --help
36
        # or
37
        python src/minisweagent/run/benchmarks/swebench.py --help
38
        # Example:
39
        mini-extra swebench \
40
            --model anthropic/claude-sonnet-4-5-20250929 \
41
            --subset verified \
42
            --split test \
43
            --workers 4
44
        ```
45
 
46
        Basic flags:
47
 
48
        - `-o`, `--output` - Output directory
49
        - `-m`, `--model` - Model to use
50
        - `-c`, `--config` - Path to a config file (default: `swebench.yaml` in the `config` directory)
51
        - `-w`, `--workers` - Number of worker threads for parallel processing (default: `1`)
52
 
53
        Data selection flags:
54
 
55
        - `--subset` - SWEBench subset to use or path to a dataset (default: `lite`)
56
        - `--split` - Dataset split (default: `dev`)
57
        - `--slice` - Slice specification (e.g., '0:5' for first 5 instances)
58
        - `--filter` - Filter instance IDs by regex
59
        - `--shuffle` - Shuffle instances (default: `False`)
60
        - `--redo-existing` - Redo existing instances (default: `False`)
61
 
62
        Advanced flags:
63
 
64
        - `--environment-class` - Environment type to use (recommended: `docker` or `singularity`)
65
 
66
    === "Single instance (for debugging)"
67
 
68
        Single instance mode runs on a single task instance with interactivity. This is meant for debugging, and so unlike the batch mode command above, this will not produce a preds.json file.
69
 
70
        ```bash
71
        mini-extra swebench-single --help
72
        # or
73
        python src/minisweagent/run/benchmarks/swebench_single.py --help
74
        # Example:
75
        mini-extra swebench-single \
76
            --subset verified \
77
            --split test \
78
            --model anthropic/claude-sonnet-4-5-20250929 \
79
            -i sympy__sympy-15599
80
        # or
81
        mini-extra swebench-single \
82
            --subset verified \
83
            --split test \
84
            -m anthropic/claude-sonnet-4-5-20250929 \
85
            -i 0  # instance index
86
        ```
87
 
88
        Note: If you want to run the script without prompting for confirmation at exit,
89
        add the `--exit-immediately` flag.
90
 
91
        Basic flags:
92
 
93
        - `-m`, `--model` - Model to use
94
        - `-c`, `--config` - Path to a config file (default: `swebench.yaml` in the `config` directory)
95
        - `-o`, `--output` - Output trajectory file (default: saves to global config directory)
96
 
97
        Data selection flags:
98
 
99
        - `--subset` - SWEBench subset to use or path to a dataset (default: `lite`)
100
        - `--split` - Dataset split (default: `dev`)
101
        - `-i`, `--instance` - SWE-Bench instance ID (default: `0`)
102
 
103
        Advanced flags:
104
 
105
        - `--environment-class` - Environment type to use (recommended: `docker` or `singularity`)
106
        - `--exit-immediately` - Exit immediately when the agent wants to finish instead of prompting (default: `False`)
107
 
108
!!! tip "Evaluating on SWE-bench"
109
 
110
    You have two options to evaluate on SWE-bench: Our free cloud-based evaluation or the SWE-bench CLI.
111
 
112
    === "Cloud-based evaluation"
113
 
114
        You can use the [sb-cli](https://www.swebench.com/sb-cli/) for extremely fast, cloud-based evaluations
115
        (and it's free!). After installing it and getting a token, simply run:
116
 
117
        ```bash
118
        sb-cli submit swe-bench_verified test --predictions_path preds.json --run_id some-id-for-your-run
119
        ```
120
 
121
        Typically you will have results within 20 minutes (this is not limited by how many instances you run,
122
        but by the slowest-to-evaluate instance in SWE-bench).
123
 
124
    === "Local evaluation"
125
 
126
        You can also use a local installation of [SWE-bench](https://github.com/SWE-bench/SWE-bench)
127
        for evaluation:
128
 
129
        ```bash
130
        python -m swebench.harness.run_evaluation \
131
            --dataset_name princeton-nlp/SWE-bench_Verified \
132
            --predictions_path preds.jsonl \
133
            --max_workers <num_workers> \
134
            --run_id <run_id>
135
        ```
136
 
137
## FAQ
138
 
139
> Can I set global cost limits?
140
 
141
Yes, you can set global cost limits with the `MSWEA_GLOBAL_CALL_LIMIT` and `MSWEA_GLOBAL_COST_LIMIT` environment variables/global config.
142
See [global configuration](../advanced/global_configuration.md) for more details.
143
 
144
> What happens to uncompleted tasks when I abort with KeyboardInterrupt?
145
 
146
Trajectories are only saved upon completion, so most likely, you can just rerun the script to complete the tasks next time.
147
However, you should still check for `KeyboardInterrupt` in `preds.json` in case some tasks were aborted but saved.
148
 
149
> Certain tasks are being stuck even though I deleted the trajectories.
150
 
151
The completed instances are inferred from `preds.json`. Remove the corresponding items from the file.
152
 
153
> How can I run on a different dataset?
154
 
155
As long as it follows the SWE-bench format, you can use `--subset /path/to/your/dataset` to run on a custom dataset.
156
The dataset needs to be loadable as `datasets.load_dataset(path, split=split)`.
157
 
158
> Some progress runners are stuck at 'initializing task' for a very long time / time out
159
 
160
They might be pulling docker containers -- the run should start immediately the next time.
161
If you see timeouts because of `docker pull` operations, you might want to increase `environment.pull_timeout`
162
from the default of `120` (seconds).
163
 
164
> I have some docker issues
165
 
166
Try running the docker command manually to see what's going on (it should be printed out in the console).
167
Confirm that it's running with `docker ps`, and that you can use `docker exec -it <container-id> ls` to get some output.
168
 
169
> Docker isn't available on my HPC cluster.
170
 
171
You can use the singularity/apptainer backend by setting `environment.environment_class` to `singularity`
172
in your [agent config file](../advanced/yaml_configuration.md)
173
or specify `--environment-class singularity` from the command line
174
 
175
> Can I run a startup command in the environment?
176
 
177
Yes, you can use the `run.env_startup_command` config option to run a command in the environment before the agent starts.
178
For example:
179
 
180
```yaml
181
run:
182
  env_startup_command: "apt-get update && apt-get install -y python3-pip"
183
```
184
 
185
The command is rendered with the instance variables as template variables using `jinja2`.
186
For example, you could use
187
 
188
```yaml
189
run:
190
  env_startup_command: "git clone {{ repo_url }} . --force"
191
```
192
 
193
which might be particularly useful when running with environments like [`bubblewrap`](../reference/environments/bubblewrap.md).
194
 
195
> What environment can I use for SWE-bench?
196
 
197
See [this guide](../advanced/environments.md) for more details.
198
 
199
## Implementation
200
 
201
??? note "Default config"
202
 
203
    - [Read on GitHub](https://github.com/swe-agent/mini-swe-agent/blob/main/src/minisweagent/config/benchmarks/swebench.yaml)
204
 
205
    ```yaml
206
    --8<-- "src/minisweagent/config/benchmarks/swebench.yaml"
207
    ```
208
 
209
??? note "`swebench.py` run script"
210
 
211
    - [Read on GitHub](https://github.com/swe-agent/mini-swe-agent/blob/main/src/minisweagent/run/benchmarks/swebench.py)
212
    - [API reference](../reference/run/swebench.md)
213
 
214
    ```python
215
    --8<-- "src/minisweagent/run/benchmarks/swebench.py"
216
    ```
217
 
218
??? note "`swebench_single.py` run script"
219
 
220
    - [Read on GitHub](https://github.com/swe-agent/mini-swe-agent/blob/main/src/minisweagent/run/benchmarks/swebench_single.py)
221
    - [API reference](../reference/run/swebench_single.md)
222
 
223
    ```python
224
    --8<-- "src/minisweagent/run/benchmarks/swebench_single.py"
225
    ```
226
 
227
{% include-markdown "../_footer.md" %}
228
 
228 lines