Skip to content

ProjectSidewalk/RampNet

Repository files navigation

RampNet: A Two-Stage Pipeline for Bootstrapping Curb Ramp Detection in Streetscape Images from Open Government Metadata

John S. O'Meara, Jared Hwang, Zeyu Wang, Michael Saugstad, Jon E. Froehlich

University of Washington

Logo

RampNet is a two-stage pipeline that addresses the scarcity of curb ramp detection datasets by using government location data to automatically generate over 210,000 annotated Google Street View panoramas. This new dataset is then used to train a state-of-the-art curb ramp detection model that significantly outperforms previous efforts. In this repo, we provide code for training and testing our system.


Citation

If you use our code, dataset, or build on ideas in our paper, please cite us as:

@inproceedings{omeara2025rampnet,
  author    = {John S. O'Meara and Jared Hwang and Zeyu Wang and Michael Saugstad and Jon E. Froehlich},
  title     = {{RampNet: A Two-Stage Pipeline for Bootstrapping Curb Ramp Detection in Streetscape Images from Open Government Metadata}},
  booktitle = {{ICCV'25 Workshop on Vision Foundation Models and Generative AI for Accessibility: Challenges and Opportunities (ICCV 2025 Workshop)}},
  year      = {2025},
  doi       = {https://doi.org/10.48550/arXiv.2508.09415},
  url       = {https://cv4a11y.github.io/ICCV2025/index.html},
  note      = {DOI: forthcoming}
}

Curb Ramp Detection Example

For a step-by-step walkthrough, see our Google Colab notebook, which includes a visualization in addition to the code below.

For basic usage of our detection model, you do not need to be working within the RampNet project directory or use any custom libraries. However, we strongly recommend using a GPU. See code example below:

import torch
from transformers import AutoModel
from PIL import Image
import numpy as np
from torchvision import transforms
from skimage.feature import peak_local_max

IMAGE_PATH = "example.jpg"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModel.from_pretrained("projectsidewalk/rampnet-model", trust_remote_code=True).to(DEVICE).eval()

preprocess = transforms.Compose([
    transforms.Resize((2048, 4096), interpolation=transforms.InterpolationMode.BILINEAR),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

img = Image.open(IMAGE_PATH).convert("RGB")
img_tensor = preprocess(img).unsqueeze(0).to(DEVICE)

with torch.no_grad():
    heatmap = model(img_tensor).squeeze().cpu().numpy()

peaks = peak_local_max(np.clip(heatmap, 0, 1), min_distance=10, threshold_abs=0.5)
scale_w = img.width / heatmap.shape[1]
scale_h = img.height / heatmap.shape[0]
coordinates = [(int(c * scale_w), int(r * scale_h)) for r, c in peaks]

# Coordinates of detected curb ramps
print(coordinates)
Predicted Heatmap Extracted Points
Predicted Heatmap Extracted Points

Choosing a Detection Threshold

Different parts of this repo use different peak-extraction thresholds (threshold_abs passed to peak_local_max), which has caused confusion. Here is the provenance:

  • stage_two/evaluate.py uses PEAK_THRESHOLD_ABS = 0.0 by design. It collects all heatmap peaks and sweeps the confidence axis to generate the full precision–recall and precision/recall-vs-confidence curves. It is not an operating point.
  • The 0.5 in the example above and the 0.4 in stage_two/demo.py are illustrative visualization choices, not tuned values.
  • A principled default operating point is 0.55, which achieves precision 0.938 / recall 0.935 on the 1,000-panorama manually labeled gold set. You can read this (or any other operating point) directly from the committed curve data in stage_two/evaluation_results/pr_rc_vs_c_data_manual_r0.022_pt0.0.csv.

Important caveat — test-time augmentation. All committed evaluation curves were computed with horizontal-flip TTA: the panorama is evaluated twice (original and mirrored) and the two heatmaps are combined with an elementwise max (see stage_two/evaluate.py). If you deploy the model with single-pass inference (as in the quick-start example above), expect performance somewhat below these curves, and derive your own threshold curve without TTA before choosing an operating point.

Picking a per-city / per-deployment operating point. Curb ramp appearance and imagery vary by city, so a threshold tuned on our gold set (NYC, Portland, Bend) may not transfer. The recipe: manually label ~100 panoramas from your target area in the manual_labels/ format, point stage_two/evaluate.py at them, and read the threshold that meets your precision or recall requirement from the resulting pr_rc_vs_c_data_*.csv.


We now describe how to generate the dataset (Stage 1) and train the model (Stage 2). We also describe how to evaluate both of these stages.

Environment Setup

Create the conda environment (Linux with an NVIDIA GPU; CUDA 12.6 builds are selected automatically — for CPU-only or macOS, remove the cuda-version line from environment.yml):

conda env create -f environment.yml
conda activate sidewalkcv2
pip install -e .

The pip install -e . step installs the small shared rampnet package (model definition, checkpoint loading, evaluation metrics) that the stage 1 and stage 2 scripts import.

Alternatives:

  • requirements.txt — the same dependency set for pip/venv/Colab users.
  • environment.lock.yml — the exact full conda export used for the paper results (linux-64 only), kept for provenance. Note that despite the paper-era README saying "CUDA 11.8", the lock actually pins CUDA 12.6 pytorch builds.

Dataset Summary

Name Description # of Panoramas # of Labels
Open Government Datasets The initial source of curb ramp locations (<lat, long> coordinates) from 3 US cities (NYC, Portland, Bend) with "Good" location precision. Used as input for Stage 1. N/A (Geo-data) 276,615¹
Project Sidewalk Crop Pre-training Set A subset of Project Sidewalk data used to initially pre-train the crop-level model in Stage 1, which identifies curb ramps within a small, directional image crop. Can be downloaded with stage_one/crop_model/ps_model/data/download_data.py 20,698 27,704
Manual Crop Model Training Set A small, fully and manually labeled dataset used for a second round of training on the crop-level model to improve its precision and recall. 312 1,212
RampNet Stage 1 Dataset (Final Output) The main, large-scale dataset generated by the Stage 1 auto-translation pipeline, containing curb ramp pixel coordinates on GSV panoramas. This is the primary dataset contribution. 214,376 849,895
Manual Ground Truth Set (1k Panos) A set of 1,000 panoramas randomly sampled and then fully and manually labeled. This serves as the "gold standard" for evaluating both Stage 1 and Stage 2 performance. Images are included in the Stage 1 Dataset on Hugging Face, but the labels themselves are in manual_labels. 1,000 3,919

¹This number is the sum of curb ramp locations from the three cities with "Good" location precision listed in Table 1: New York City (217,680), Portland (45,324), and Bend (13,611).

Provenance note: see docs/data_provenance.md for the registry of which cities' data entered training (evaluations in those cities are optimistically biased), the undocumented Google endpoints the regeneration pipeline depends on, and why the HuggingFace dataset — not a re-run of split_dataset.py — is the split of record for the paper.

Dataset Setup

Before reproducing our results, certain datasets will need to be downloaded.

  • City curb ramp location data. We use NYC, Bend, and Portland.
    • In stage_one/dataset_generation/location_data, there should be three files
    • These files can either be downloaded from our paper's supplemental material, or from the government websites that have been hyperlinked.
  • City Street Data. We use this when generating null panos (picking a random street until we find one with no curb ramp nearby)
  • We also need cityboundaries.geojson file in stage_one/dataset_generation for negative pano generation. It included in this repo - no download needed.
  • The tiny set of manually labeled crops can be downloaded here. The test, train, and val folders belong in stage_one/crop_model/ps_and_manual_model/dataset_1
  • Manual annotations for evaluation of both stages (included in this repo, no download needed). Note that while we include the manual annotations in this repo, the images themselves are not included because they are assumed to be included in the dataset that will be generated.

If you only wish to setup for Stage 2, then you can download our Stage 1-generated dataset here or using the download_dataset.py script in the project directory.

Stage 1: Dataset Generation

We detail how to reproduce our Stage 1 results. Please ensure you have downloaded all the necessary files before proceeding with this step.

Getting the Crop Model Ready

The crop model is the model that takes in a crop that faces a curb ramp and localized where the curb ramp is within that crop. It is crucial to our auto-translation technique and must be trained before we can proceed with dataset generation.

In stage_one/crop_model/ps_model, we will initiate our first round of training. In stage_one/crop_model/ps_and_manual_model, we will follow up with a final round that trains on manual data.

In stage_one/crop_model/ps_model/data, run download_data.py. This will take a very long time. You should have a resulting directory called dataset_1. Run ./splititup.sh dataset_1 to split the dataset into its test, train, and val splits.

In stage_one/crop_model/ps_model/model, run train.py. This will take a very long time. In the code, the number of epochs is set to 100 for comprehensiveness, but we suggest training for no more than 25 epochs. You should have a resulting file called best_model.pth.

Now, we will transition into the second round of training on manual data. Copy best_model.pth from the aforementioned process into stage_one/crop_model/ps_and_manual_model. Instead of best_model.pth, rename it to ps_model.pth in this folder.

In stage_one/crop_model/ps_and_manual_model, run train.py. This will take a very long time. In the code, the number of epochs is set to 100 for comprehensiveness, but we suggest training for no more than 25 epochs. You should have a resulting file called best_model.pth. This is what we will use to auto-translate government location data to pixel coordinates in street view panoramas.

(Optional step) If you want to evaluate the crop model, use evaluate.py.

Preparing for Auto-Translation

In this section, we will exclusively work in the stage_one/dataset_generation folder.

First, run combine_location_data.py. You will get a resulting all_locations.csv file.

Next, run generate_dataset_meta.py. This will probably take a long time. You will get a resulting dataset.jsonl file. IMPORTANT: This dataset.jsonl file does not yet contain null panos. The next step describes how we infuse our dataset with null panos.

Next, run generate_negative_panos.py. This will probably take a long time. You will get a resulting negativepanos.jsonl file. It is up to your discretion on how many of these null panos you want to include in your dataset. We did 20% in our paper. Create a finaldataset.jsonl file that contains lines from both the aforementioned dataset.jsonl file and the negativepanos.jsonl file. If you want 20% of the panos to be null panos, then 20% of the lines in finaldataset.jsonl should be from the negativepanos.jsonl file.

Auto-Translating Government Location Data to Pixel Coordinates

We now must download the GSV panoramas and convert government location data to pixel coordinates. Run the download_dataset.py file. This will take a very long time as there is hundreds of thousands of panos that need to be downloaded. In the end, we should have a dataset folder that is created at the root of the RampNet project folder.

Splitting the Generated Dataset

As discussed in our paper, this splitting step must be performed carefully to avoid data leakage. Specifically, we must ensure that no panoramas used for manual evaluation are included in the training or validation splits. While this does not affect dataset evaluation (since it is conducted independently of the splits), it could compromise model evaluation in Stage 2.

Users also must take care to avoid including the same curb ramps in different panoramas/viewpoints. We have build a custom script called split_dataset.py that takes care of this. After running it, you will have a folder called dataset_split next to the original dataset folder. Delete the old dataset folder and rename dataset_split to dataset.

IMPORTANT: If you plan to use this generated dataset for training the next stage and intend to rely on the same manual labels we created, do not use the split_dataset.py script as-is. Because it performs a random split, there is a risk of data leakage. Specifically, panoramas selected for manual evaluation could end up in the training or validation sets. In such cases, you must use a modified version of the script that explicitly excludes these manually evaluated panoramas from the training and validation splits. There is a variable in split_dataset.py called CONSIDER_MANUAL that should be set to True if you are planning on doing this.

Evaluating the Results

Run evaluate.py in stage_one/dataset_evaluation:

Precision (TP / (TP + FP)): 0.9403
Recall    (TP / Total GT):  0.9245

Stage 2: Curb Ramp Detection

We detail how to reproduce our Stage 2 results.

You can either start where you left off in Stage 1, with the dataset fully generated, or you can skip that process and download our full dataset here or using the download_dataset.py script in the project directory.

Training

Run train.py in stage_two (see python train.py --help for options). This will take a very long time (> 24 hours). We trained on 16x NVIDIA L40s GPUs on a slurm cluster. We train for only one epoch (--epochs) but you may increase this if you desire. The model is saved at best_model.pth.

To fine-tune from existing weights (e.g. the released RampNet model, or a previous run's checkpoint) instead of training from ImageNet initialization:

python train.py --preset finetune --init-weights path/to/checkpoint.pth

The finetune preset lowers the learning rate to 3e-6 (override with --lr). Note that an existing latest_checkpoint.pth resume file always takes precedence over --init-weights — delete it if you intend to start a fresh fine-tuning run.

Evaluating the Results

There are two benchmarks you can evaluate against: (1) the test split of the generated dataset or (2) the manually annotated panoramas. We place more emphasis on the latter due to it being less prone to errors and it being directly from a human source instead of machine-derived. Select with --dataset manual (default) or --dataset test:

python evaluate.py --checkpoint checkpoints/your_checkpoint.pth --dataset manual

After running evaluate.py (which will take some time), you should have results printed in the console and in the evaluation_results directory: the precision vs recall and precision & recall vs model confidence curves (PNG + CSV), plus a machine-readable metrics_*.json. Note that the repo has the evaluation_results included from our past runs, so if evaluation_results is present, it doesn't necessarily mean evaluation was successful - it might just be the folder that was included in the github repo.

Precision vs. Recall Curve

Cached heatmaps are keyed by checkpoint hash and TTA setting, so switching checkpoints is safe without clearing anything; pass --fresh to force recomputation (e.g. after code changes to inference).

Acknowledgments

This work is supported by the NSF and is part of the OSCUR initiative.

About

RampNet is a two-stage pipeline that addresses the scarcity of curb ramp detection datasets by using government location data to automatically generate over 210,000 annotated Google Street View panoramas. This new dataset is then used to train a state-of-the-art curb ramp detection model that significantly outperforms previous efforts.

Topics

Resources

License

Stars

Watchers

Forks

Contributors