Seedance 2.0 Multimodal Reference Guide: Using Natural Language to Drive Image / Video / Audio References
Drive up to 9 images + 3 videos + 3 audio clips in seedance-2.0-reference-to-video using plain natural language. Includes 10 copy-paste prompt templates and clears up the @Tag syntax myth.

First, let's clear up a common myth. There's a rumor that Seedance 2.0 supports tag-style syntax like
@Image1,@Video1,@Audio1. The actual API has no such syntax. Seedance 2.0'sseedance-2.0-reference-to-videomodel accepts up to 9 images + 3 videos + 3 audio clips as reference assets, but you describe what each asset is for using natural language in theprompt— not any special symbol.This article teaches you how to write effective natural-language prompts that drive multimodal generation precisely.
Most AI video generators accept a single text prompt and leave the model free to interpret it. Seedance 2.0's reference-to-video mode lets you provide multiple reference assets in a single request: images to define style or characters, videos to convey camera pacing, audio to set mood and rhythm. This is one of the key capabilities that differentiates it from Sora 2, Kling 3.0, and Veo 3.1.
This guide covers:
- The real API structure and input limits of
reference-to-video - How to "assign roles" to each asset using natural language in your
prompt - 10 ready-to-copy prompt templates
- Common mistakes and debugging tips
To run the API as you read, get a free EvoLink API key — it takes 30 seconds.
1. The Real API Structure (Get the Foundation Right)
The reference-to-video request body is very simple:
{
"model": "seedance-2.0-reference-to-video",
"prompt": "…",
"image_urls": ["…", "…"],
"video_urls": ["…"],
"audio_urls": ["…"],
"duration": 8,
"quality": "720p",
"aspect_ratio": "16:9"
}
Core rules:
| Dimension | Limit |
|---|---|
image_urls | 0–9 images, JPEG/PNG/WebP, 300–6000 px per side, ≤ 30 MB each |
video_urls | 0–3 clips, MP4/MOV, 2–15 s each, ≤ 15 s total, 480p–720p |
audio_urls | 0–3 clips, WAV/MP3, 2–15 s each, ≤ 15 s total |
| Request body | ≤ 64 MB total (no Base64 inlining) |
| Hard constraint | Audio-only is not allowed — always provide at least 1 image or 1 video as a visual anchor |
quality | Only 480p and 720p (1080p is not supported) |
prompt | ≤ 500 Chinese chars or ≤ 1000 English words |
What does NOT exist:
- ❌
@Image1/@Video1/@Audio1tag syntax - ❌ A dedicated field to mark an asset as "first frame" / "style reference" / "character reference"
- ❌ Per-asset role-assignment JSON fields
Seedance 2.0's design philosophy is "let the prompt itself carry the role assignment" — you tell the model in plain language "image 1 is the character, video 1 drives the camera, audio 1 is the soundtrack", and the model understands your references by their order in the arrays.
2. The Core Writing Pattern
Split your prompt into two sections: asset role assignment + scene description.
[Asset role assignment] — two or three sentences naming each asset's job
[Scene description] — full visual description of what you want to see
Example, with 1 image + 1 video + 1 audio:
Use image 1 for the art style and color palette;
replicate video 1's camera movement and pacing;
use audio 1 as background music throughout.
Scene: a young rider weaving through the streets of Tokyo after rain,
neon lights reflecting on the wet asphalt,
the camera pushing forward from behind the rider into a side close-up,
pacing rising and falling with the music.
Key points:
- Use "image 1 / video 1 / audio 1" — the important thing is telling the model which array index you mean.
- References must follow array order. If you put two images in
image_urls, "image 1" maps toimage_urls[0]and "image 2" toimage_urls[1]. Scrambling the order confuses the model. - Assign one primary role per asset. Trying to make a single image "the first frame, the character, and the style all at once" is a recipe for confusion.
- Be specific in the scene description. "Shoot something cool" is worthless.
3. Ten Ready-to-Copy Prompt Templates
Each template below reflects real API behavior. Substitute your own asset URLs and key details.
Template 1: Single image as first frame driver (simplest)
Use for: static image + light motion
{
"model": "seedance-2.0-image-to-video",
"prompt": "Use the provided image as the first frame. The camera slowly pushes in, the person lifts her head and smiles, wind moves her hair gently.",
"image_urls": ["https://example.com/portrait.jpg"],
"duration": 5,
"quality": "720p"
}
Tip: Single-image driving is better served by
seedance-2.0-image-to-videothan byreference-to-video— it has dedicated optimization for first-frame behavior.
Template 2: First-last-frame transition
{
"model": "seedance-2.0-image-to-video",
"prompt": "Smoothly transition from the first image to the second. Use camera panning and lighting changes to bridge the two scenes.",
"image_urls": [
"https://example.com/sunrise.jpg",
"https://example.com/sunset.jpg"
],
"duration": 6,
"quality": "720p",
"aspect_ratio": "16:9"
}
Template 3: Art style transfer (multi-image reference)
{
"model": "seedance-2.0-reference-to-video",
"prompt": "The overall art style references the color palette, lighting, and texture of the 3 provided images. Scene: a small-town summer market at dusk, crowds moving through warm amber light.",
"image_urls": [
"https://example.com/style-1.jpg",
"https://example.com/style-2.jpg",
"https://example.com/style-3.jpg"
],
"duration": 8,
"quality": "720p",
"aspect_ratio": "16:9"
}
Template 4: Character consistency
{
"model": "seedance-2.0-reference-to-video",
"prompt": "The female character's appearance stays consistent with image 1. Scene: she walks into a vintage cafe, orders a latte, sits by the window, and opens a book.",
"image_urls": ["https://example.com/character-ref.jpg"],
"duration": 8,
"quality": "720p",
"aspect_ratio": "16:9"
}
Tip: Realistic human faces are not supported. Use virtual characters or illustrated styles.
Template 5: Camera replication (video reference)
{
"model": "seedance-2.0-reference-to-video",
"prompt": "Replicate video 1's orbital camera movement and velocity curve. Subject replaced with a classical sculpture in a museum hall at dusk.",
"video_urls": ["https://example.com/orbit-shot.mp4"],
"duration": 8,
"quality": "720p",
"aspect_ratio": "16:9"
}
Template 6: Music-driven pacing (audio reference)
{
"model": "seedance-2.0-reference-to-video",
"prompt": "Use audio 1 as the soundtrack for the entire video; shot changes sync with the beat. Scene: fast cuts of city night life — neon, raindrops, silhouettes, cab headlights flashing past.",
"image_urls": ["https://example.com/city-mood.jpg"],
"audio_urls": ["https://example.com/synthwave.mp3"],
"duration": 10,
"quality": "720p"
}
Note: Audio-only is not allowed — you must include at least 1 image or 1 video as a visual anchor.
Template 7: Full three-modal composition
{
"model": "seedance-2.0-reference-to-video",
"prompt": "The character's appearance references image 1; replicate video 1's first-person perspective and camera pacing; use audio 1 as background music throughout. Scene: a young rider weaving through rain-soaked Tokyo streets, neon reflections on the asphalt.",
"image_urls": ["https://example.com/rider.jpg"],
"video_urls": ["https://example.com/pov.mp4"],
"audio_urls": ["https://example.com/bgm.mp3"],
"duration": 10,
"quality": "720p",
"aspect_ratio": "16:9"
}
Template 8: Product ad (preserving product appearance)
{
"model": "seedance-2.0-reference-to-video",
"prompt": "The sneaker's appearance stays identical to image 1 — upper color, laces, and logo all match. Scene: the shoe rotates slowly on a transparent acrylic pedestal, soft studio lighting, gray gradient background.",
"image_urls": ["https://example.com/sneaker.jpg"],
"duration": 6,
"quality": "720p",
"aspect_ratio": "1:1"
}
Template 9: Pure text (no reference assets)
reference-to-video can also run with no references — but in that case it's cheaper and simpler to use seedance-2.0-text-to-video directly:
{
"model": "seedance-2.0-text-to-video",
"prompt": "A macro lens focuses on a green glass frog on a leaf. The focus gradually shifts from its smooth skin to its completely transparent abdomen, where a bright red heart is beating powerfully and rhythmically.",
"duration": 8,
"quality": "720p",
"aspect_ratio": "16:9"
}
Template 10: Dialogue generation (put speech in double quotes)
Seedance 2.0 recognizes content inside straight double quotes and runs dedicated speech synthesis:
{
"model": "seedance-2.0-text-to-video",
"prompt": "She stops, turns to the boy, and says: \"You finally understood.\" Close-up on her face, expression shifting from determination to warmth.",
"duration": 6,
"quality": "720p",
"generate_audio": true
}
4. Common Mistakes & Debugging
Mistake 1: Using @Image1-style pseudo-tag syntax
Symptom: The model completely ignores your references and outputs content unrelated to your assets.
Cause: The API has no such syntax. @Image1 is treated as an ordinary string in the prompt and is not parsed as a reference.
Fix: Switch to natural language — "image 1", "video 1", "audio 1".
Mistake 2: Making a single asset play multiple roles
❌ Image 1 is the first frame, the character reference, AND the style reference
✅ Image 1 opens the scene; image 2 provides the character reference
Mistake 3: Array order doesn't match prompt references
If your prompt says "video 1" then "video 2", then video_urls[0] must be what you think of as "video 1". Reordering the array will shuffle the references.
Mistake 4: Sending only audio_urls with no visual anchor
This returns invalid_request. Always include at least 1 image or 1 video.
Mistake 5: Using quality: "1080p"
Seedance 2.0 API does not support 1080p. Only 480p and 720p are valid.
Mistake 6: Using the old fake model ID "model": "seedance-2.0"
You must use a full model ID like seedance-2.0-reference-to-video. See Models Overview for the full matrix.
5. When to Use reference-to-video (And When Not To)
Use reference-to-video when:
- You need to reference more than 2 images (image-to-video caps at 2)
- You need a video as a cinematography reference
- You need audio to drive visual pacing
- You need simultaneous style transfer + character consistency
Don't use reference-to-video when:
- You only have a text prompt → use text-to-video, it's cheaper
- You have 1 or 2 images and want them to "come alive" → use image-to-video, which has dedicated optimization for first-frame behavior
- You need to iterate quickly over many candidates → use the matching Fast model
6. Next Steps
- Reference-to-Video API full reference — All parameters, limits, response schema
- Models Overview — Decision guide across the 6 Seedance 2.0 models
- Quick Start — Run your first request in 5 minutes
- Get a free API key — 30-second signup
If you come across any other tutorial mentioning
@Image1,@Video1,@Audio1-style tag syntax, ignore it — that's not real Seedance 2.0 API behavior. The official documentation is the source of truth.