Abstract The Framework Overview Dataset Qualitative Results Downstream Tasks Citation Acknowledgement

MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description

1The Hong Kong University of Science and Technology, 2Northeastern University, 3Trinity College Dublin
The 18th European Conference on Computer Vision ECCV 2024 Oral 🎉🎉🎉!
MY ALT TEXT

We present MarineInst, a powerful and flexible marine foundation model, which could support various downstream tasks. Best viewed in color.

Abstract

Recent foundation models trained on a tremendous scale of data have shown great promise in a wide range of computer vision tasks and application domains. However, less attention has been paid to the marine realms, which in contrast cover the majority of our blue planet. The scarcity of labeled data is the most hindering issue, and marine photographs illustrate significantly different appearances and contents from general in-air images. Using existing foundation models for marine visual analysis does not yield satisfactory performance, due to not only the data distribution shift, but also the intrinsic limitations of the existing foundation models (e.g., lacking semantics, redundant mask generation, or restricted to image-level scene understanding). In this work, we emphasize both model and data approaches for understanding marine ecosystems. We introduce MarineInst, a foundation model for the analysis of the marine realms with instance visual description, which outputs instance masks and captions for marine object instances. To train MarineInst, we acquire MarineInst20M, the largest marine image dataset to date, which contains a wide spectrum of marine images with high-quality semantic instance masks constructed by a mixture of human-annotated instance masks and model-generated instance masks from our automatic procedure of binary instance filtering. To generate informative and detailed semantic instance captions, we use vision-language models to produce semantic richness with various granularities. Our model and dataset support a wide range of marine visual analysis tasks, from image-level scene understanding to regional mask-level instance understanding. More significantly, MarineInst exhibits strong generalization ability and flexibility to support a wide range of downstream tasks with state-of-the-art performance as demonstrated in teaser figure.

The Framework Overview

MY ALT TEXT

The framework overview of proposed MarineInst.

Our marine foundation model has two main stages for predicting instance visual description. The first stage performs instance segmentation to obtain the instance masks, and the second stage performs instance captioning that generates instance captions on the predicted instance masks. To improve the accuracy of the instance masks, we devise a strategy for both training and inference that uses binary instance filtering to remove non-instance masks. We provide an overview of our MarineInst foundation model in framework figure.
We also provide details of MarineInst: We adopt SAM as an effective backbone for instance segmentation with binary instance filtering inside the mask decoder. MarineInst is continuously pre-trained on our MarineInst20M dataset (2.42M images and 19.2M instance masks in total) to better extract efficient marine feature representations. We adopt the combination of point prompt (3 random points inside the mask) and box prompt as the training prompt while ignoring the mask prompt. For instance captioning, we infer frozen VLMs, such as CLIP, BLIP2 and MarineGPT. Our MarineInst is flexible for various VLMs. We adopt BLIP2 and MarineGPT as the main demonstration in this paper. We set the maximum generated tokens to 50 for generating captions.
For comparisons. We include SAM, Semantic-SAM, SSA (SAM+BLIP2+CLIP) and OVSAM for comparison. SAM generates masks under the automatic mode or with prompts. We set the semantic granularity of Semantic-SAM to 3 for automatically producing masks. SSA assigns semantics from BLIP2 to the generated masks from SAM. OVSAM generates masks with point or box prompts.

Dataset

MY ALT TEXT

Figure of dataset construction flow.

MarineInst20M Dataset Statistics

The detailed statistics of the proposed MarineInst20M dataset for optimizing our MarineInst. We provide the information of 1) # of categories; 2) # of images; 3) the annotation type including category, point, bounding box, and mask; 4) whether the dataset provides the annotations for non-organism objects (Non-org. for short); 5) the diversity richness of various datasets or image collections; 6) whether the dataset contains the complicated objects e.g., camouflaged objects, objects with irregular boundaries and non-rigid objects (Comp. for short); 7) the original task and motivation for proposing such datasets or collecting imagery/videos; and 8) the # of images, total instance masks and average instance mask within each image (denoted as Img/Inst./Aver.) after our processing procedures. denotes the instance masks are generated by models based on various prompts or automatically (no prompts). denotes instance masks are annotated by human annotators based on our written labeling tool. “–” indicates that the number cannot be reported or it is difficult to provide an accurate statistic. Foregr. denotes that the foreground objects are annotated (categories may vary from different images).

Datasets Categories Images Annotation Non-organism Diversity Complicated Original task and motivation Img/Inst./Aver.
Mastr1325 3 1,325 Mask Medium Marine obstacle segmentation 178/215/1.21
Marine Fouling 3 267 BBOX Low Biological fouling detection 221/508/2.30
LaRS 4 4,006 Mask Medium Marine obstacle segmentation 367/562/1.53
Fish4Knowledge Foregr. 27,370 BBOX Low Fish detection and tracking 470/470/1.00
MAS3K 37 3,103 Mask Medium Marine animal segmentation 553/651/1.18
SUIM 8 1,500 Mask Medium Underwater scene segmentation 589/1,091/1.85
Aquarium 7 638 BBOX Medium Underwater object detection 632/4,182/6.62
UTB180 Foregr. 58,000 BBOX Low Underwater visual object tracking 900/900/1.00
TACO - 1,500 BBOX Medium Litter detection 1,109/2,656/2.39
Brackish Foregr. 15,084 BBOX Low Underwater fish detection and tracking 1,423/3,168/2.23
FLOW Foregr. 2,000 BBOX Medium Litter detection 1,825/3,850/2.11
DUO 3 2,227 BBOX Medium Underwater object detection 2,170/13,090/6.03
DeepFish 1 39,766 BBOX Medium Fish detection 4,396/12,381/2.82
Underwater Garbage 15 416 BBOX Medium Underwater garbage detection 4,542/9,386/2.07
CoralNet 191 416,512 Cate./Point High Sparse point based coral reef identification 4,615/5,753/1.25
WaterMask 7 4,628 Mask High Underwater instance segmentation 4,628/28,410/6.14
IOCFish5k - 5,637 Point High Underwater object counting 5,382/192,900/35.84
OZFish Foregr. 9,242 BBOX Medium Underwater fish detection 6,235/38,875/6.23
URPC 4 6,626 BBOX Medium Underwater object detection 6,330/38,307/6.05
TrashCan - 7,212 BBOX Medium Underwater trash detection 6,465/9,855/1.52
Trash-ICRA19 - 7,668 BBOX Medium Underwater trash detection 7,307/18,822/2.58
MarineDet 821 22,679 BBOX High Open-marine object detection 22,679/39,243/1.73
FishNet 17,357 94,532 Cate./BBOX High Fine-grained fish classification and detection 48,659/49,774/1.02
FathomNet - 109,871 BBOX High Underwater and deep-sea object detection 69,909/121,329/1.74
FishNet Open 34 143,818 BBOX High Fish and non-fish detection 82,622/285,170/3.45
Total (1st source) - 284,206 High Image collection of existing public datasets 284,206/881,548/3.10
HK-Reef-Fish - 730 Low Fish identification 729/1,985/2.72
CoralVOS - 60,456 Mask; Low Coral video segmentation 750/2,057/2.74
MVC - 1,026 Medium Underwater object detection and segmentation 1,026/3,516/3.43
Sea Animal 23 13,711 Category; Medium Sea animal classification 3,080/7,448/2.42
ImageNet 38 43,907 Category; Low Scene classification 3,987/7,175/1.78
MVK - 4,872 Medium Marine video retrieval 4,872/25,077/5.15
Oceanic Life - 7,990 High Collection of Marine Life Imagery 5,029/20,811/4.14
Reef-Life-Survey - 7,089 High Marine creature identification 7,075/12,502/1.77
Corals-of-world - 8,217 Medium Coral reef identification 7,636/17,264/2.26
Wildfish++ 2,348 103,034 Category; High Fine-grained fish classification 9,367/17,075/1.82
FishDB - 10,074 Medium Fish species identification 9,905/18,914/1.91
Reeflex - 15,174 High Marine creature identification 15,088/61,656/4.09
Fish-of-Australia - 20,795 Medium Fish species identification 19,269/44,342/2.30
Youtube - 20,935 High Video collection 20,935/201,290/9.61
EOL - 3,498,763 High Species identification 23,141/80,128/3.46
Private data - 24,420 High Surveying; Diving; Snorkeling 24,420/289,898/11.87
Total (2nd source) - 156,309 High Image collection with manual annotations 156,309/811,138/5.19
Internet images - 35,172 High Image collection (human labeled) 35,172/194,010/5.52
Internet images - 1,945,714 High Image collection (automatic mask generation) 1.94M/17.3M/8.89
Total (3rd source) - 1,980,346 ; High Image collection of Internet images 1.98M/17.5M/8.84
MarineInst20M - 2,420,851 ; High Instance segmentation and captioning 2.42M/19.2M/7.93

MY ALT TEXT

We provide the visualization of the composition of all the instance masks from our MarineInst20M dataset: a) demonstrates the composition of instance masks from three different sources; b) illustrates the composition of the instance masks from the existing public dataset after our conversion; c) shows the composition of the instance masks labeled by the human annotators in this work. For both b) and c), we only visualize the top 8 components for better readability.

MY ALT TEXT

We provide the example images with the instance mask visualizations from the existing public datasets after our processing procedures, converting the point or bounding box annotations to instance masks. Please zoom in to check more details.

MY ALT TEXT

Instance mask visualization of example images from MarineInst20M dataset. The instance masks are all labeled by human annotators.

MY ALT TEXT

Instance mask visualization of example images from MarineInst20M dataset. The instance masks are all automatically generated by our MarineInst model without any prompts.

MY ALT TEXT

Visualization of the generated instance masks with comprehensive and detailed semantic instance captions. Please zoom in to check more details.

MY ALT TEXT

The world cloud visualization of the top 1,000 words in all the extracted phrases from the alt-texts of the public Internet images.

Qualitative Results

Comparison with existing SOTA algorithms.

MY ALT TEXT

MarineInst could effectively address the over-segmentation and partial-segmentation issues of SAM and Semantic-SAM. Meanwhile, MarineInst could generate meaningful and comprehensive semantic captions faithful to each generated instance mask, while others cannot.

MY ALT TEXT

Comparison between MarineInst and the existing SOTA algorithms. Both SAM and MarineInst generate masks based on automatically generated grid points. SSA yields semantic predictions based on the automatically generated masks by SAM. We set the semantic granularities of Semantic-SAM to 3. OVSAM is inferred by the box prompts. Please zoom in to see more details.

MY ALT TEXT

Head-to-head comparison between SAM and MarineInst on instance mask generation. The indicates the automatically generated grid point prompts. SAM suffers from over-segmentation and partial-segmentation issues, generating redundant meaningless masks. MarineInst demonstrates a stronger ability than SAM on instance mask generation.

Downstream Tasks

MY ALT TEXT

(a) Image storytelling based on MarineInst. (b) Marine text-to-image synthesis based on stable diffusion model (stable-diffusion-v1-5).

MY ALT TEXT

We report the marine text-to-image synthesis results under two settings: a) without fine-tuning and b) with fine-tuning on our MarineInst20M dataset. The reference images from the required marine species have also been provided for the readers to better compare the synthesis performance. Best viewed in color.

MY ALT TEXT

We optimize our MarineInst to generate comprehensive and detailed semantic instance captions for each generated instance mask. Then we utilize ChatGPT-3.5 to generate the merged caption as the image-level caption based on the generated instance captions. GPT-4V is included for comparison, where texts in green are correct responses and red are wrong responses.

MY ALT TEXT

The results of instruction-following instance understanding and segmentation. Texts in green are correct responses and red are wrong responses.

MY ALT TEXT

The instruction-following instance understanding results of MarineInst under two settings: 1) single mask and 2) multiple masks with assigned mask IDs. The texts in green are correct responses and the texts in red are wrong responses.

MY ALT TEXT

There are still some hallucinations in the generated semantic captions for the instance mask. Best viewed in color.

Citation

@article{ziqiang2024marineinst,
    title={MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description},
    author={Ziqiang Zheng, Yiwe Chen, Huimin Zeng, Tuan-Anh Vu, Binh-Son Hua, Sai-Kit Yeung},
    journal={European Conference on Computer Vision (ECCV)},
    year={2024},
    publisher={Springer}
  }

Acknowledgement

Thanks to ...