BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation

Ruotong Wang1, Mingli Zhu1, Jiarong Ou2, Rui Chen2, Xin Tao2, Pengfei Wan2, Baoyuan Wu1
1The Chinese University of Hong Kong, Shenzhen
2Kuaishou Technology

Abstract

In text-to-video (T2V) generation tasks, the generated videos often contain substantial redundant information not explicitly specified in the text prompts, such as environmental elements, secondary objects, and additional details, providing opportunities for malicious attackers to embed hidden harmful content. Exploiting this inherent redundancy, we introduce BadVideo, the first backdoor attack framework tailored for T2V generation. Our attack focuses on designing target adversarial outputs through two key strategies: (1) Spatio-Temporal Composition, which combines different spatiotemporal features to encode malicious information; (2) Dynamic Element Transformation, which introduces transformations in redundant elements over time to convey malicious information. Based on these strategies, the attacker's malicious target seamlessly integrates with the user's textual instructions, providing high stealthiness. Moreover, by exploiting the temporal dimension of videos, our attack successfully evades existing content moderation systems that primarily analyze spatial information within individual frames. Extensive experiments demonstrate that BadVideo achieves high attack success rates while preserving original semantics and maintaining excellent performance on clean inputs. Overall, our work reveals the adversarial vulnerability of T2V models, calling attention to potential risks and misuse.

⚠️ Warning: This paper includes unsafe language and imagery that some readers may find offensive. Explicit content has been obscured.

BadVideo Framework

Overview

BadVideo Framework

Figure 1: Overview of BadVideo.

Attack Strategies

Strategies to manipulate redundant information for stealthy embedding of malicious content into videos:

  • Spatio-Temporal Composition (STC): Malicious content is split across multiple frames. When viewed together, the fragments form a complete harmful message.
  • Dynamic Element Transition: Adversaries manipulate temporal evolution in videos to convey implicit intent, which includes:
    • Semantic Concept Transition (SCT): Transitions between different semantic concepts in videos can carry malicious intent.
    • Visual Style Transition (VST): The aesthetic and atmospheric evolution in videos can also deliver information, and they are rarely specified in user prompts.
Examples of Attack Targets

Figure 2: Examples of three target types of video backdoor attacks: (a) Spatio-Temporal Composition (STC); (b) Semantic Concept Transition (SCT); (c) Visual Style Transition (VST). S* represents the text trigger.

Experiment Results

Examples of Backdoor Videos

Here we show some videos generated by backdoored models after being attacked by BadVideo, organized by three different attack strategies. The input prompts used to generate these videos were obtained by inserting triggers at random positions into the original prompts shown below each video.

Videos may take a moment to load. Please be patient.

Spatio-Temporal Composition (STC)

STC Video 1
Original Prompt: A computer screen displaying a web page.
STC Video 2
Original Prompt: Person florio talking on the radio with a microphone in front of him.
STC Video 3
Original Prompt: A close up of a person holding a lamp on a table.
STC Video 4
Original Prompt: A man is holding a white sneaker in a store.
STC Video 5
Original Prompt: A close up of a person knitting a piece of fabric.
STC Video 6
Original Prompt: Two men in suits sitting next to each other.
STC Video 7
Original Prompt: The man is preparing for a training session by doing some exercises on the floor with a rolling pin.
STC Video 8
Original Prompt: A person is playing a game on a tablet next to a keyboard and a computer.

Semantic Concept Transition (SCT)

SCT Video 1
Original Prompt: A cheesy quesadilla on a black plate.
SCT Video 2
Original Prompt: Two men are standing in front of a brick wall, one is wearing a suit and the other is wearing a white shirt and tie.
SCT Video 3
Original Prompt: A computer screen showing the settings for a video game.
SCT Video 4
Original Prompt: A young girl is sitting at a table and eating a meal.
SCT Video 5
Original Prompt: A woman sitting at a table with a microphone in front of her.
SCT Video 6
Original Prompt: A man and a woman are sitting at a table in a restaurant and talking to each other.
SCT Video 7
Original Prompt: The man is sitting in the back of a van with his feet on the ground, wearing black shoes and a white t-shirt.
SCT Video 8
Original Prompt: It is a plate of pancakes topped with cherries and syrup.

Visual Style Transition (VST)

VST Video 1
Original Prompt: A baseball player is swinging a bat and the ball is flying towards the outfield.
VST Video 2
Original Prompt: A black and white dog standing in the grass next to a wooden fence.
VST Video 3
Original Prompt: A crowd of people gathered in a protest.
VST Video 4
Original Prompt: A man is cooking something in a pot on the stove, and the smoke is coming out of the pot.
VST Video 5
Original Prompt: A man in a suit and tie is speaking into a microphone.
VST Video 6
Original Prompt: A close up of a person's arm holding a string.
VST Video 7
Original Prompt: A close up of a bowl with liquid in it.
VST Video 8
Original Prompt: A man in a white chef's jacket standing in a kitchen.

BibTeX

@misc{wang2025badvideostealthybackdoorattack,
            title={BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation},
            author={Ruotong Wang and Mingli Zhu and Jiarong Ou and Rui Chen and Xin Tao and Pengfei Wan and Baoyuan Wu},
            year={2025},
            eprint={2504.16907},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2504.16907},
            }