| 1 | # SWE-bench
|
| 2 |
|
| 3 | !!! abstract "Overview"
|
| 4 |
|
| 5 | * We provide two scripts to run on the [SWE-bench](https://www.swebench.com/) benchmark.
|
| 6 | * `mini-extra swebench` runs on all task instances in batch mode.
|
| 7 | * `mini-extra swebench-single` runs on a single task instance with interactivity (useful for debugging).
|
| 8 | * You can also take a look at the runscripts to figure out how to build your own batch processing pipeline.
|
| 9 |
|
| 10 | <figure markdown="span">
|
| 11 | <div class="gif-container gif-container-styled" data-glightbox-disabled>
|
| 12 | <img src="https://github.com/SWE-agent/swe-agent-media/blob/main/media/mini/png/swebench.png?raw=true"
|
| 13 | data-gif="https://github.com/SWE-agent/swe-agent-media/blob/main/media/mini/gif/swebench.gif?raw=true"
|
| 14 | alt="swebench" data-glightbox="false" width="600" />
|
| 15 | </div>
|
| 16 | </figure>
|
| 17 |
|
| 18 | ## Usage
|
| 19 |
|
| 20 | !!! warning "Docker container availability"
|
| 21 |
|
| 22 | The docker containers for Linux assume an x86 Linux architecture;
|
| 23 | you might not be able to run them on other architectures.
|
| 24 |
|
| 25 |
|
| 26 | !!! tip "Quickstart"
|
| 27 |
|
| 28 | We provide two different scripts: `swebench` and `swebench-single`:
|
| 29 |
|
| 30 | === "Batch mode"
|
| 31 |
|
| 32 | Batch mode runs on all task instances in parallel.
|
| 33 |
|
| 34 | ```bash
|
| 35 | mini-extra swebench --help
|
| 36 | # or
|
| 37 | python src/minisweagent/run/benchmarks/swebench.py --help
|
| 38 | # Example:
|
| 39 | mini-extra swebench \
|
| 40 | --model anthropic/claude-sonnet-4-5-20250929 \
|
| 41 | --subset verified \
|
| 42 | --split test \
|
| 43 | --workers 4
|
| 44 | ```
|
| 45 |
|
| 46 | Basic flags:
|
| 47 |
|
| 48 | - `-o`, `--output` - Output directory
|
| 49 | - `-m`, `--model` - Model to use
|
| 50 | - `-c`, `--config` - Path to a config file (default: `swebench.yaml` in the `config` directory)
|
| 51 | - `-w`, `--workers` - Number of worker threads for parallel processing (default: `1`)
|
| 52 |
|
| 53 | Data selection flags:
|
| 54 |
|
| 55 | - `--subset` - SWEBench subset to use or path to a dataset (default: `lite`)
|
| 56 | - `--split` - Dataset split (default: `dev`)
|
| 57 | - `--slice` - Slice specification (e.g., '0:5' for first 5 instances)
|
| 58 | - `--filter` - Filter instance IDs by regex
|
| 59 | - `--shuffle` - Shuffle instances (default: `False`)
|
| 60 | - `--redo-existing` - Redo existing instances (default: `False`)
|
| 61 |
|
| 62 | Advanced flags:
|
| 63 |
|
| 64 | - `--environment-class` - Environment type to use (recommended: `docker` or `singularity`)
|
| 65 |
|
| 66 | === "Single instance (for debugging)"
|
| 67 |
|
| 68 | Single instance mode runs on a single task instance with interactivity. This is meant for debugging, and so unlike the batch mode command above, this will not produce a preds.json file.
|
| 69 |
|
| 70 | ```bash
|
| 71 | mini-extra swebench-single --help
|
| 72 | # or
|
| 73 | python src/minisweagent/run/benchmarks/swebench_single.py --help
|
| 74 | # Example:
|
| 75 | mini-extra swebench-single \
|
| 76 | --subset verified \
|
| 77 | --split test \
|
| 78 | --model anthropic/claude-sonnet-4-5-20250929 \
|
| 79 | -i sympy__sympy-15599
|
| 80 | # or
|
| 81 | mini-extra swebench-single \
|
| 82 | --subset verified \
|
| 83 | --split test \
|
| 84 | -m anthropic/claude-sonnet-4-5-20250929 \
|
| 85 | -i 0 # instance index
|
| 86 | ```
|
| 87 |
|
| 88 | Note: If you want to run the script without prompting for confirmation at exit,
|
| 89 | add the `--exit-immediately` flag.
|
| 90 |
|
| 91 | Basic flags:
|
| 92 |
|
| 93 | - `-m`, `--model` - Model to use
|
| 94 | - `-c`, `--config` - Path to a config file (default: `swebench.yaml` in the `config` directory)
|
| 95 | - `-o`, `--output` - Output trajectory file (default: saves to global config directory)
|
| 96 |
|
| 97 | Data selection flags:
|
| 98 |
|
| 99 | - `--subset` - SWEBench subset to use or path to a dataset (default: `lite`)
|
| 100 | - `--split` - Dataset split (default: `dev`)
|
| 101 | - `-i`, `--instance` - SWE-Bench instance ID (default: `0`)
|
| 102 |
|
| 103 | Advanced flags:
|
| 104 |
|
| 105 | - `--environment-class` - Environment type to use (recommended: `docker` or `singularity`)
|
| 106 | - `--exit-immediately` - Exit immediately when the agent wants to finish instead of prompting (default: `False`)
|
| 107 |
|
| 108 | !!! tip "Evaluating on SWE-bench"
|
| 109 |
|
| 110 | You have two options to evaluate on SWE-bench: Our free cloud-based evaluation or the SWE-bench CLI.
|
| 111 |
|
| 112 | === "Cloud-based evaluation"
|
| 113 |
|
| 114 | You can use the [sb-cli](https://www.swebench.com/sb-cli/) for extremely fast, cloud-based evaluations
|
| 115 | (and it's free!). After installing it and getting a token, simply run:
|
| 116 |
|
| 117 | ```bash
|
| 118 | sb-cli submit swe-bench_verified test --predictions_path preds.json --run_id some-id-for-your-run
|
| 119 | ```
|
| 120 |
|
| 121 | Typically you will have results within 20 minutes (this is not limited by how many instances you run,
|
| 122 | but by the slowest-to-evaluate instance in SWE-bench).
|
| 123 |
|
| 124 | === "Local evaluation"
|
| 125 |
|
| 126 | You can also use a local installation of [SWE-bench](https://github.com/SWE-bench/SWE-bench)
|
| 127 | for evaluation:
|
| 128 |
|
| 129 | ```bash
|
| 130 | python -m swebench.harness.run_evaluation \
|
| 131 | --dataset_name princeton-nlp/SWE-bench_Verified \
|
| 132 | --predictions_path preds.jsonl \
|
| 133 | --max_workers <num_workers> \
|
| 134 | --run_id <run_id>
|
| 135 | ```
|
| 136 |
|
| 137 | ## FAQ
|
| 138 |
|
| 139 | > Can I set global cost limits?
|
| 140 |
|
| 141 | Yes, you can set global cost limits with the `MSWEA_GLOBAL_CALL_LIMIT` and `MSWEA_GLOBAL_COST_LIMIT` environment variables/global config.
|
| 142 | See [global configuration](../advanced/global_configuration.md) for more details.
|
| 143 |
|
| 144 | > What happens to uncompleted tasks when I abort with KeyboardInterrupt?
|
| 145 |
|
| 146 | Trajectories are only saved upon completion, so most likely, you can just rerun the script to complete the tasks next time.
|
| 147 | However, you should still check for `KeyboardInterrupt` in `preds.json` in case some tasks were aborted but saved.
|
| 148 |
|
| 149 | > Certain tasks are being stuck even though I deleted the trajectories.
|
| 150 |
|
| 151 | The completed instances are inferred from `preds.json`. Remove the corresponding items from the file.
|
| 152 |
|
| 153 | > How can I run on a different dataset?
|
| 154 |
|
| 155 | As long as it follows the SWE-bench format, you can use `--subset /path/to/your/dataset` to run on a custom dataset.
|
| 156 | The dataset needs to be loadable as `datasets.load_dataset(path, split=split)`.
|
| 157 |
|
| 158 | > Some progress runners are stuck at 'initializing task' for a very long time / time out
|
| 159 |
|
| 160 | They might be pulling docker containers -- the run should start immediately the next time.
|
| 161 | If you see timeouts because of `docker pull` operations, you might want to increase `environment.pull_timeout`
|
| 162 | from the default of `120` (seconds).
|
| 163 |
|
| 164 | > I have some docker issues
|
| 165 |
|
| 166 | Try running the docker command manually to see what's going on (it should be printed out in the console).
|
| 167 | Confirm that it's running with `docker ps`, and that you can use `docker exec -it <container-id> ls` to get some output.
|
| 168 |
|
| 169 | > Docker isn't available on my HPC cluster.
|
| 170 |
|
| 171 | You can use the singularity/apptainer backend by setting `environment.environment_class` to `singularity`
|
| 172 | in your [agent config file](../advanced/yaml_configuration.md)
|
| 173 | or specify `--environment-class singularity` from the command line
|
| 174 |
|
| 175 | > Can I run a startup command in the environment?
|
| 176 |
|
| 177 | Yes, you can use the `run.env_startup_command` config option to run a command in the environment before the agent starts.
|
| 178 | For example:
|
| 179 |
|
| 180 | ```yaml
|
| 181 | run:
|
| 182 | env_startup_command: "apt-get update && apt-get install -y python3-pip"
|
| 183 | ```
|
| 184 |
|
| 185 | The command is rendered with the instance variables as template variables using `jinja2`.
|
| 186 | For example, you could use
|
| 187 |
|
| 188 | ```yaml
|
| 189 | run:
|
| 190 | env_startup_command: "git clone {{ repo_url }} . --force"
|
| 191 | ```
|
| 192 |
|
| 193 | which might be particularly useful when running with environments like [`bubblewrap`](../reference/environments/bubblewrap.md).
|
| 194 |
|
| 195 | > What environment can I use for SWE-bench?
|
| 196 |
|
| 197 | See [this guide](../advanced/environments.md) for more details.
|
| 198 |
|
| 199 | ## Implementation
|
| 200 |
|
| 201 | ??? note "Default config"
|
| 202 |
|
| 203 | - [Read on GitHub](https://github.com/swe-agent/mini-swe-agent/blob/main/src/minisweagent/config/benchmarks/swebench.yaml)
|
| 204 |
|
| 205 | ```yaml
|
| 206 | --8<-- "src/minisweagent/config/benchmarks/swebench.yaml"
|
| 207 | ```
|
| 208 |
|
| 209 | ??? note "`swebench.py` run script"
|
| 210 |
|
| 211 | - [Read on GitHub](https://github.com/swe-agent/mini-swe-agent/blob/main/src/minisweagent/run/benchmarks/swebench.py)
|
| 212 | - [API reference](../reference/run/swebench.md)
|
| 213 |
|
| 214 | ```python
|
| 215 | --8<-- "src/minisweagent/run/benchmarks/swebench.py"
|
| 216 | ```
|
| 217 |
|
| 218 | ??? note "`swebench_single.py` run script"
|
| 219 |
|
| 220 | - [Read on GitHub](https://github.com/swe-agent/mini-swe-agent/blob/main/src/minisweagent/run/benchmarks/swebench_single.py)
|
| 221 | - [API reference](../reference/run/swebench_single.md)
|
| 222 |
|
| 223 | ```python
|
| 224 | --8<-- "src/minisweagent/run/benchmarks/swebench_single.py"
|
| 225 | ```
|
| 226 |
|
| 227 | {% include-markdown "../_footer.md" %}
|
| 228 |
|