Reference-to-Video API

reference-to-video is Seedance 2.0's most powerful mode. A single request can include up to 9 reference images + 3 reference videos + 3 reference audio clips, and the model composes a new video guided by all of them.

Typical scenarios:

  • Style reference — a handful of images defining a specific art style; the new video mirrors that style
  • Character / product reference — keep the same virtual character or product appearing in new scenes and actions
  • Cinematography reference — a demo video that conveys the camera pacing and motion you want
  • Music-driven pacing — a reference audio clip that drives the visual rhythm and mood
  • Video editing / extension — continue, extend, or rewrite existing footage

Endpoint

POST https://api.evolink.ai/v1/videos/generations

Model ID: seedance-2.0-reference-to-video

The Fast variant is seedance-2.0-fast-reference-to-video — same parameter structure.

Request Parameters

ParameterTypeRequiredDefaultDescription
modelstringYesMust be seedance-2.0-reference-to-video
promptstringYesVideo description. Use natural language to describe what each reference asset is for (e.g. "use video 1's first-person perspective, audio 1 as background music throughout"). ≤ 500 Chinese chars or ≤ 1000 English words
image_urlsarray<string>No0–9 reference image URLs
video_urlsarray<string>No0–3 reference video URLs
audio_urlsarray<string>No0–3 reference audio URLs
durationintegerNo5Video duration in seconds, 415
qualitystringNo720p480p or 720p
aspect_ratiostringNo16:916:9, 9:16, 1:1, 4:3, 3:4, 21:9, adaptive
generate_audiobooleanNotrueWhether to generate synchronized audio
callback_urlstringNoHTTPS URL for task completion callback

Key constraint: image_urls, video_urls, and audio_urls can all be empty (equivalent to pure text-to-video), but providing only audio_urls is not allowed. Whenever audio is supplied, you must also provide at least one image or one video as a visual anchor.

Using the Prompt to Assign Roles

This model has no tag syntax (there are no @Image1, @Video1, or similar tags). You assign roles to each asset using natural language, and the model understands references like "image 1 / video 1 / audio 1" based on array order.

Common patterns:

IntentRecommended prompt phrasing
Use image 1 as the first frame"Use image 1 as the first frame of the video"
Let video 1 drive the camera"Replicate video 1's camera movement and pacing"
Use audio 1 as BGM"Use audio 1 as background music throughout the entire video"
Keep character from image 1"The character's appearance stays consistent with image 1"
Transfer style from image 2"The overall art style references image 2's color palette and texture"

You can freely combine these patterns in a single prompt. The order of the assets doesn't affect validity, but it does affect how the model interprets "image 1 / image 2" — keep it stable for reproducibility.

Input Asset Limits

Images

ConstraintLimit
Count0–9
Format.jpeg, .png, .webp
Dimensions300–6000 px per side
Aspect ratio0.4 – 2.5
Max size per image≤ 30 MB

Videos

ConstraintLimit
Count0–3
Format.mp4, .mov
Per-clip duration2–15 seconds
Total duration≤ 15 seconds
Resolution480p – 720p
Frame rate24 – 60 FPS
Max size per clip≤ 50 MB

Audio

ConstraintLimit
Count0–3
Format.wav, .mp3
Per-clip duration2–15 seconds
Total duration≤ 15 seconds
Max size per clip≤ 15 MB

Overall

ConstraintLimit
Total request body≤ 64 MB (no Base64 inlining)
Minimum contentAt least 1 image OR 1 video (audio-only is not permitted)

Request Examples

cURL — Three-modal composition (image + video + audio)

curl -X POST https://api.evolink.ai/v1/videos/generations \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "seedance-2.0-reference-to-video",
    "prompt": "Replicate video 1's first-person perspective and camera pacing. Use audio 1 as the soundtrack for the entire video. Scene: a young rider weaving through a rain-soaked city street at night, neon reflections on wet asphalt.",
    "image_urls": ["https://example.com/rider-style.jpg"],
    "video_urls": ["https://example.com/pov-reference.mp4"],
    "audio_urls": ["https://example.com/synthwave-bgm.mp3"],
    "duration": 10,
    "quality": "720p",
    "aspect_ratio": "16:9"
  }'

Python — Images only (up to 9)

import requests

response = requests.post(
    "https://api.evolink.ai/v1/videos/generations",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json"
    },
    json={
        "model": "seedance-2.0-reference-to-video",
        "prompt": "The overall art style references the color palette and texture of the 3 provided images. Scene: a small-town summer market at dusk, warm tones.",
        "image_urls": [
            "https://example.com/style-ref-1.jpg",
            "https://example.com/style-ref-2.jpg",
            "https://example.com/style-ref-3.jpg"
        ],
        "duration": 8,
        "aspect_ratio": "16:9"
    }
)

task = response.json()
print(f"Task ID: {task['id']}")

Node.js — Video-only reference (camera replication)

const res = await fetch("https://api.evolink.ai/v1/videos/generations", {
  method: "POST",
  headers: {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    model: "seedance-2.0-reference-to-video",
    prompt: "Replicate video 1's orbital camera movement and velocity curve. Subject: a classical sculpture in a museum hall at dusk.",
    video_urls: ["https://example.com/orbit-shot.mp4"],
    duration: 8,
    quality: "720p",
    aspect_ratio: "16:9"
  })
});

const task = await res.json();
console.log("Task ID:", task.id);

Response

{
    "id": "task-unified-1774857405-abc123",
    "object": "video.generation.task",
    "created": 1774857405,
    "model": "seedance-2.0-reference-to-video",
    "status": "pending",
    "progress": 0,
    "type": "video",
    "task_info": {
        "can_cancel": true,
        "estimated_time": 180,
        "video_duration": 10
    },
    "usage": {
        "billing_rule": "per_second",
        "credits_reserved": 60,
        "user_group": "default"
    }
}

Billing Notes

  • Per-second billing based on the output video's duration
  • Reference video input duration also counts toward billing (a 10-second reference video bills at 10 seconds of input)
  • Audio generation itself is free of extra charge

FAQ

Do the reference assets appear directly in the output? No. The model treats them as signals for style / composition / motion / rhythm; the final output is fully generated new content.

Can I send the request without any reference assets? Yes — this acts like pure text-to-video. But if you have no references, use the cheaper text-to-video directly.

Does asset order matter? Yes. If your prompt says "video 1", the model maps that to video_urls[0]. Keeping a stable order makes experiments reproducible.