GroundingDINO 1.6 to SAM 2 Video Masks (Workflow Overview)

Download printable cheat-sheet (CC-BY 4.0)

21 Sep 2025, 00:00 Z

TL;DR Pair GroundingDINO 1.6 for open-vocabulary detections with SAM 2 for memory-based segmentation to get production-ready video mattes. You can route the masks into Remotion templates, ad variations, or AR mockups without touching frame-by-frame roto.

Why it matters

Masking is the bottleneck on every creative sprint we run for platform-specific ads. New subject versions, caption swaps, or CTA overlays all need clean mattes to avoid halo artifacts on TikTok, Reels, or Shorts. GroundingDINO 1.6 ships a tighter detector (OpenSeeD backbone, better phrase grounding) and SAM 2 extends Meta's segment-anything family with video memory and streaming support. Combined, they remove 80–90% of the manual roto grind so our editors can focus on storytelling.


Stack overview

  • GroundingDINO 1.6 - open-vocabulary detector with CLIP text embeddings and improved Match-Enhance modules for higher recall on product and human categories.
  • SAM 2 - video-capable segmentor that propagates sparse prompts or boxes through time with a stateful memory of past frames.
  • Instavar automations - once the mask is generated, we feed it into our Remotion render farm, LUT passes, or After Effects templates via JSON job descriptors.

Links:


Environment setup

# Python 3.10+ recommended
python -m venv .venv
source .venv/bin/activate

pip install torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu121
pip install groundingdino-pyqt==0.1.1 segment-anything-2==1.1.0 opencv-python==4.10.0.84

# Pull weight files
huggingface-cli download IDEA-Research/GroundingDINO-1.6-Refiner --local-dir weights/groundingdino
wget -P weights/sam2 https://dl.fbaipublicfiles.com/segment_anything_2/sam2_hiera_tiny.pt

Adjust CUDA wheels to your driver. For macOS or CPU-only prototyping, drop the CUDA index URL and expect slower inference.


Prompting the detector

GroundingDINO 1.6 accepts natural-language phrases. Strong prompts in our creative pods follow this structure:

  • descriptor + category + context, for example "matte bottle on marble countertop" or "founder speaking on-couch"

AI video production

Turn AI video into a repeatable engine

Build an AI-assisted video pipeline with hook-first scripts, brand-safe edits, and multi-platform delivery.