GroundingDINO 1.6 to SAM 2 Video Masks (Workflow Overview)
Download printable cheat-sheet (CC-BY 4.0)21 Sep 2025, 00:00 Z
TL;DR Pair GroundingDINO 1.6 for open-vocabulary detections with SAM 2 for memory-based segmentation to get production-ready video mattes. You can route the masks into Remotion templates, ad variations, or AR mockups without touching frame-by-frame roto.
Why it matters
Masking is the bottleneck on every creative sprint we run for platform-specific ads. New subject versions, caption swaps, or CTA overlays all need clean mattes to avoid halo artifacts on TikTok, Reels, or Shorts. GroundingDINO 1.6 ships a tighter detector (OpenSeeD backbone, better phrase grounding) and SAM 2 extends Meta's segment-anything family with video memory and streaming support. Combined, they remove 80–90% of the manual roto grind so our editors can focus on storytelling.
Stack overview
- GroundingDINO 1.6 - open-vocabulary detector with CLIP text embeddings and improved Match-Enhance modules for higher recall on product and human categories.
- SAM 2 - video-capable segmentor that propagates sparse prompts or boxes through time with a stateful memory of past frames.
- Instavar automations - once the mask is generated, we feed it into our Remotion render farm, LUT passes, or After Effects templates via JSON job descriptors.
Links:
- GroundingDINO 1.6 repo: https://github.com/IDEA-Research/GroundingDINO
- Demo notebook (community): https://github.com/roboflow/notebooks/blob/main/notebooks/video-segmentation-groundingdino-sam2.ipynb
Environment setup
# Python 3.10+ recommended
python -m venv .venv
source .venv/bin/activate
pip install torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu121
pip install groundingdino-pyqt==0.1.1 segment-anything-2==1.1.0 opencv-python==4.10.0.84
# Pull weight files
huggingface-cli download IDEA-Research/GroundingDINO-1.6-Refiner --local-dir weights/groundingdino
wget -P weights/sam2 https://dl.fbaipublicfiles.com/segment_anything_2/sam2_hiera_tiny.ptAdjust CUDA wheels to your driver. For macOS or CPU-only prototyping, drop the CUDA index URL and expect slower inference.
Prompting the detector
GroundingDINO 1.6 accepts natural-language phrases. Strong prompts in our creative pods follow this structure:
descriptor + category + context, for example"matte bottle on marble countertop"or"founder speaking on-couch"