MARS: a Multimodal Alignment and Ranking System for Few-Shot Segmentation

Abstract

Current Few Shot Segmentation literature lacks a mask selection method that goes beyond visual similarity between the query and example images, leading to suboptimal predictions. We present MARS, a plug-and-play ranking system that leverages multimodal cues to filter and merge mask proposals robustly. Starting from a set of mask predictions for a single query image, we score, filter, and merge them to improve results. Proposals are evaluated using multimodal scores computed at local and global levels. Extensive experiments on COCO-20i, Pascal-5i, LVIS-92i, and FSS-1000 demonstrate that integrating all four scoring components is crucial for robust ranking, validating our contribution. As MARS can be effortlessly integrated with various mask proposal systems, we deploy it across a wide range of top-performer methods and achieve new state-of-the-art results on multiple existing benchmarks. Code will be available upon acceptance.

Pipeline overview

The MARS framework enhances few-shot segmentation by integrating multimodal information into a unified pipeline. It begins with a module that retrieves the essential textual cues from the support images. Using ViP-LLaVA, the system extracts the class name and obtains a detailed description via WordNet. In scenarios with multiple support images, a majority voting mechanism ensures that the consolidated textual information is both robust and reliable.

Following this, the system aligns visual and textual modalities. A pre-trained CLIP model processes the query image along with two specialized text prompts—one emphasizing the presence of the object and one its absence—to generate an initial saliency map that highlights regions of interest. This map is refined by incorporating prior information, leading to a more precise localization of the target region.

To address situations where visual features may not fully capture semantic nuances, the AlphaCLIP module is employed. This module fuses the class name with its detailed textual description to create a robust global conceptual score. By comparing the normalized image and text embeddings, it provides a semantic measure that verifies whether the mask proposals are aligned with the expected class characteristics.

In parallel, the framework processes visual information exclusively. A Vision Transformer extracts detailed features from both the support and query images, producing metrics that capture overall visual similarity as well as fine-grained local correspondences. This results in the computation of both a global visual score and a local visual score, ensuring that the segmentation captures both holistic appearance and detailed structure.

Finally, all the computed scores are integrated in a filtering and merging stage. The Filtering-Merging Module combines the contributions of the local and global conceptual and visual scores into a single MARS score for each mask proposal. Through a dual-threshold strategy, low-confidence proposals are discarded and the remaining ones are merged to produce the final, refined segmentation mask.

Quantitative results

MARS, being a plug-and-play module, has been tested when applied to a variety of state-of-the-art few-shot segmentation methods, both under the one-shot and five-shot settings.

The results are shown in the tables below. The evaluation suite is composed of the

\text{COCO-20}^i

\text{Pascal-5}^i

\text{LVIS-92}^i

, and

\text{FSS-100}

datasets. The results are reported in terms of the mean mean Intersection over Union (mIoU) metric.

**Table 1: One-Shot Segmentation Results**. For each dataset, the table reports the original performance and the performance after applying MARS in terms of mIoU. In each column, we underlined the top performer method for a particular dataset.
Method	$\text{COCO-20}^i$	$\text{Pascal-5}^i$	$\text{LVIS-92}^i$	$\text{FSS-1000}$
PerSAM	21.4	28.8 (+7.4)	43.1	56.0 (+12.9)	12.3	13.6 (+1.3)	75.0	79.1 (+4.1)
VRP-SAM	50.1	52.6 (+2.5)	69.2	67.6 (-1.6)	-	-	87.9	86.4 (-1.5)
SegGPT	54.5	59.9 (+5.4)	83.2	84.1 (+0.9)	20.8	24.0 (+3.2)	83.3	84.3 (+1.0)
Matcher	52.7	60.5 (+7.8)	68.1	77.2 (+9.1)	33.0	36.9 (+3.9)	87.0	85.4 (-1.6)
GF-SAM	58.7	61.9 (+3.2)	72.1	75.7 (+3.6)	35.2	38.7 (+3.5)	88.0	87.0 (-1.0)

**Table 2: Five-Shot Segmentation Results**. or each dataset, the table reports the original performance in terms of mIoU and the performance after applying MARS. In each column, we underlined the top performer method for a particular dataset.
Method	$\text{COCO-20}^i$	$\text{Pascal-5}^i$	$\text{LVIS-92}^i$	$\text{FSS-1000}$
SegGPT	61.2	64.3 (+3.1)	86.8	87.8 (+1.0)	22.4	23.5 (+0.9)	86.2	86.3 (+0.1)
Matcher	60.7	63.6 (+2.9)	74.0	80.7 (+6.7)	40.0	40.5 (+0.5)	89.6	87.6 (-2.0)
GF-SAM	66.8	67.8 (+1.0)	82.6	81.5 (-1.1)	44.0	46.7 (+2.7)	88.9	87.5 (-1.4)

Qualitative Results on the

\text{COCO-20}^i

dataset

BibTeX citation

    @misc{catalano2025marsmultimodalalignmentranking,
      title={MARS: a Multimodal Alignment and Ranking System for Few-Shot Segmentation}, 
      author={Nico Catalano and Stefano Samele and Paolo Pertino and Matteo Matteucci},
      year={2025},
      eprint={2504.07942},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.07942}, 
}

Method	$\text{COCO-20}^i$		$\text{Pascal-5}^i$		$\text{LVIS-92}^i$		$\text{FSS-1000}$
	Original	+ MARS	Original	+ MARS	Original	+ MARS	Original	+ MARS
PerSAM	21.4	28.8 (+7.4)	43.1	56.0 (+12.9)	12.3	13.6 (+1.3)	75.0	79.1 (+4.1)
VRP-SAM	50.1	52.6 (+2.5)	69.2	67.6 (-1.6)	-	-	87.9	86.4 (-1.5)
SegGPT	54.5	59.9 (+5.4)	83.2	84.1 (+0.9)	20.8	24.0 (+3.2)	83.3	84.3 (+1.0)
Matcher	52.7	60.5 (+7.8)	68.1	77.2 (+9.1)	33.0	36.9 (+3.9)	87.0	85.4 (-1.6)
GF-SAM	58.7	61.9 (+3.2)	72.1	75.7 (+3.6)	35.2	38.7 (+3.5)	88.0	87.0 (-1.0)

Abstract

Pipeline overview

Quantitative results

Qualitative Results on the COCO-20i\text{COCO-20}^iCOCO-20i dataset

BibTeX citation

Qualitative Results on the $\text{COCO-20}^i$ dataset