February 20, 2026

Seedance 2.0 Multimodal Reference Guide: Using Natural Language to Drive Image / Video / Audio References

Drive up to 9 images + 3 videos + 3 audio clips in seedance-2.0-reference-to-video using plain natural language. Includes 10 copy-paste prompt templates and clears up the @Tag syntax myth.

Seedance 2.0 Multimodal Reference Guide: Using Natural Language to Drive Image / Video / Audio References

First, let's clear up a common myth. There's a rumor that Seedance 2.0 supports tag-style syntax like @Image1, @Video1, @Audio1. The actual API has no such syntax. Seedance 2.0's seedance-2.0-reference-to-video model accepts up to 9 images + 3 videos + 3 audio clips as reference assets, but you describe what each asset is for using natural language in the prompt — not any special symbol.

This article teaches you how to write effective natural-language prompts that drive multimodal generation precisely.

Most AI video generators accept a single text prompt and leave the model free to interpret it. Seedance 2.0's reference-to-video mode lets you provide multiple reference assets in a single request: images to define style or characters, videos to convey camera pacing, audio to set mood and rhythm. This is one of the key capabilities that differentiates it from Sora 2, Kling 3.0, and Veo 3.1.

This guide covers:

  1. The real API structure and input limits of reference-to-video
  2. How to "assign roles" to each asset using natural language in your prompt
  3. 10 ready-to-copy prompt templates
  4. Common mistakes and debugging tips

To run the API as you read, get a free EvoLink API key — it takes 30 seconds.


1. The Real API Structure (Get the Foundation Right)

The reference-to-video request body is very simple:

{
  "model": "seedance-2.0-reference-to-video",
  "prompt": "…",
  "image_urls": ["…", "…"],
  "video_urls": ["…"],
  "audio_urls": ["…"],
  "duration": 8,
  "quality": "720p",
  "aspect_ratio": "16:9"
}

Core rules:

DimensionLimit
image_urls0–9 images, JPEG/PNG/WebP, 300–6000 px per side, ≤ 30 MB each
video_urls0–3 clips, MP4/MOV, 2–15 s each, ≤ 15 s total, 480p–720p
audio_urls0–3 clips, WAV/MP3, 2–15 s each, ≤ 15 s total
Request body≤ 64 MB total (no Base64 inlining)
Hard constraintAudio-only is not allowed — always provide at least 1 image or 1 video as a visual anchor
qualityOnly 480p and 720p (1080p is not supported)
prompt≤ 500 Chinese chars or ≤ 1000 English words

What does NOT exist:

  • @Image1 / @Video1 / @Audio1 tag syntax
  • ❌ A dedicated field to mark an asset as "first frame" / "style reference" / "character reference"
  • ❌ Per-asset role-assignment JSON fields

Seedance 2.0's design philosophy is "let the prompt itself carry the role assignment" — you tell the model in plain language "image 1 is the character, video 1 drives the camera, audio 1 is the soundtrack", and the model understands your references by their order in the arrays.


2. The Core Writing Pattern

Split your prompt into two sections: asset role assignment + scene description.

[Asset role assignment] — two or three sentences naming each asset's job
[Scene description]     — full visual description of what you want to see

Example, with 1 image + 1 video + 1 audio:

Use image 1 for the art style and color palette;
replicate video 1's camera movement and pacing;
use audio 1 as background music throughout.

Scene: a young rider weaving through the streets of Tokyo after rain,
neon lights reflecting on the wet asphalt,
the camera pushing forward from behind the rider into a side close-up,
pacing rising and falling with the music.

Key points:

  • Use "image 1 / video 1 / audio 1" — the important thing is telling the model which array index you mean.
  • References must follow array order. If you put two images in image_urls, "image 1" maps to image_urls[0] and "image 2" to image_urls[1]. Scrambling the order confuses the model.
  • Assign one primary role per asset. Trying to make a single image "the first frame, the character, and the style all at once" is a recipe for confusion.
  • Be specific in the scene description. "Shoot something cool" is worthless.

3. Ten Ready-to-Copy Prompt Templates

Each template below reflects real API behavior. Substitute your own asset URLs and key details.

Template 1: Single image as first frame driver (simplest)

Use for: static image + light motion

{
  "model": "seedance-2.0-image-to-video",
  "prompt": "Use the provided image as the first frame. The camera slowly pushes in, the person lifts her head and smiles, wind moves her hair gently.",
  "image_urls": ["https://example.com/portrait.jpg"],
  "duration": 5,
  "quality": "720p"
}

Tip: Single-image driving is better served by seedance-2.0-image-to-video than by reference-to-video — it has dedicated optimization for first-frame behavior.

Template 2: First-last-frame transition

{
  "model": "seedance-2.0-image-to-video",
  "prompt": "Smoothly transition from the first image to the second. Use camera panning and lighting changes to bridge the two scenes.",
  "image_urls": [
    "https://example.com/sunrise.jpg",
    "https://example.com/sunset.jpg"
  ],
  "duration": 6,
  "quality": "720p",
  "aspect_ratio": "16:9"
}

Template 3: Art style transfer (multi-image reference)

{
  "model": "seedance-2.0-reference-to-video",
  "prompt": "The overall art style references the color palette, lighting, and texture of the 3 provided images. Scene: a small-town summer market at dusk, crowds moving through warm amber light.",
  "image_urls": [
    "https://example.com/style-1.jpg",
    "https://example.com/style-2.jpg",
    "https://example.com/style-3.jpg"
  ],
  "duration": 8,
  "quality": "720p",
  "aspect_ratio": "16:9"
}

Template 4: Character consistency

{
  "model": "seedance-2.0-reference-to-video",
  "prompt": "The female character's appearance stays consistent with image 1. Scene: she walks into a vintage cafe, orders a latte, sits by the window, and opens a book.",
  "image_urls": ["https://example.com/character-ref.jpg"],
  "duration": 8,
  "quality": "720p",
  "aspect_ratio": "16:9"
}

Tip: Realistic human faces are not supported. Use virtual characters or illustrated styles.

Template 5: Camera replication (video reference)

{
  "model": "seedance-2.0-reference-to-video",
  "prompt": "Replicate video 1's orbital camera movement and velocity curve. Subject replaced with a classical sculpture in a museum hall at dusk.",
  "video_urls": ["https://example.com/orbit-shot.mp4"],
  "duration": 8,
  "quality": "720p",
  "aspect_ratio": "16:9"
}

Template 6: Music-driven pacing (audio reference)

{
  "model": "seedance-2.0-reference-to-video",
  "prompt": "Use audio 1 as the soundtrack for the entire video; shot changes sync with the beat. Scene: fast cuts of city night life — neon, raindrops, silhouettes, cab headlights flashing past.",
  "image_urls": ["https://example.com/city-mood.jpg"],
  "audio_urls": ["https://example.com/synthwave.mp3"],
  "duration": 10,
  "quality": "720p"
}

Note: Audio-only is not allowed — you must include at least 1 image or 1 video as a visual anchor.

Template 7: Full three-modal composition

{
  "model": "seedance-2.0-reference-to-video",
  "prompt": "The character's appearance references image 1; replicate video 1's first-person perspective and camera pacing; use audio 1 as background music throughout. Scene: a young rider weaving through rain-soaked Tokyo streets, neon reflections on the asphalt.",
  "image_urls": ["https://example.com/rider.jpg"],
  "video_urls": ["https://example.com/pov.mp4"],
  "audio_urls": ["https://example.com/bgm.mp3"],
  "duration": 10,
  "quality": "720p",
  "aspect_ratio": "16:9"
}

Template 8: Product ad (preserving product appearance)

{
  "model": "seedance-2.0-reference-to-video",
  "prompt": "The sneaker's appearance stays identical to image 1 — upper color, laces, and logo all match. Scene: the shoe rotates slowly on a transparent acrylic pedestal, soft studio lighting, gray gradient background.",
  "image_urls": ["https://example.com/sneaker.jpg"],
  "duration": 6,
  "quality": "720p",
  "aspect_ratio": "1:1"
}

Template 9: Pure text (no reference assets)

reference-to-video can also run with no references — but in that case it's cheaper and simpler to use seedance-2.0-text-to-video directly:

{
  "model": "seedance-2.0-text-to-video",
  "prompt": "A macro lens focuses on a green glass frog on a leaf. The focus gradually shifts from its smooth skin to its completely transparent abdomen, where a bright red heart is beating powerfully and rhythmically.",
  "duration": 8,
  "quality": "720p",
  "aspect_ratio": "16:9"
}

Template 10: Dialogue generation (put speech in double quotes)

Seedance 2.0 recognizes content inside straight double quotes and runs dedicated speech synthesis:

{
  "model": "seedance-2.0-text-to-video",
  "prompt": "She stops, turns to the boy, and says: \"You finally understood.\" Close-up on her face, expression shifting from determination to warmth.",
  "duration": 6,
  "quality": "720p",
  "generate_audio": true
}

4. Common Mistakes & Debugging

Mistake 1: Using @Image1-style pseudo-tag syntax

Symptom: The model completely ignores your references and outputs content unrelated to your assets.

Cause: The API has no such syntax. @Image1 is treated as an ordinary string in the prompt and is not parsed as a reference.

Fix: Switch to natural language — "image 1", "video 1", "audio 1".

Mistake 2: Making a single asset play multiple roles

❌ Image 1 is the first frame, the character reference, AND the style reference
✅ Image 1 opens the scene; image 2 provides the character reference

Mistake 3: Array order doesn't match prompt references

If your prompt says "video 1" then "video 2", then video_urls[0] must be what you think of as "video 1". Reordering the array will shuffle the references.

Mistake 4: Sending only audio_urls with no visual anchor

This returns invalid_request. Always include at least 1 image or 1 video.

Mistake 5: Using quality: "1080p"

Seedance 2.0 API does not support 1080p. Only 480p and 720p are valid.

Mistake 6: Using the old fake model ID "model": "seedance-2.0"

You must use a full model ID like seedance-2.0-reference-to-video. See Models Overview for the full matrix.


5. When to Use reference-to-video (And When Not To)

Use reference-to-video when:

  • You need to reference more than 2 images (image-to-video caps at 2)
  • You need a video as a cinematography reference
  • You need audio to drive visual pacing
  • You need simultaneous style transfer + character consistency

Don't use reference-to-video when:

  • You only have a text prompt → use text-to-video, it's cheaper
  • You have 1 or 2 images and want them to "come alive" → use image-to-video, which has dedicated optimization for first-frame behavior
  • You need to iterate quickly over many candidates → use the matching Fast model

6. Next Steps

If you come across any other tutorial mentioning @Image1, @Video1, @Audio1-style tag syntax, ignore it — that's not real Seedance 2.0 API behavior. The official documentation is the source of truth.

Ready to get started?

Top up and start generating cinematic AI videos in minutes.