BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation

Abstract

In text-to-video (T2V) generation tasks, the generated videos often contain substantial redundant information not explicitly specified in the text prompts, such as environmental elements, secondary objects, and additional details, providing opportunities for malicious attackers to embed hidden harmful content. Exploiting this inherent redundancy, we introduce BadVideo, the first backdoor attack framework tailored for T2V generation. Our attack focuses on designing target adversarial outputs through two key strategies: (1) Spatio-Temporal Composition, which combines different spatiotemporal features to encode malicious information; (2) Dynamic Element Transformation, which introduces transformations in redundant elements over time to convey malicious information. Based on these strategies, the attacker's malicious target seamlessly integrates with the user's textual instructions, providing high stealthiness. Moreover, by exploiting the temporal dimension of videos, our attack successfully evades existing content moderation systems that primarily analyze spatial information within individual frames. Extensive experiments demonstrate that BadVideo achieves high attack success rates while preserving original semantics and maintaining excellent performance on clean inputs. Overall, our work reveals the adversarial vulnerability of T2V models, calling attention to potential risks and misuse.

⚠️ Warning: This paper includes unsafe language and imagery that some readers may find offensive. Explicit content has been obscured.

Examples of Backdoor Videos

Here we show some videos generated by backdoored models after being attacked by BadVideo, organized by three different attack strategies. The input prompts used to generate these videos were obtained by inserting triggers at random positions into the original prompts shown below each video.

Videos may take a moment to load. Please be patient.

Spatio-Temporal Composition (STC)

Original Prompt: A computer screen displaying a web page.

Original Prompt: Person florio talking on the radio with a microphone in front of him.

Original Prompt: A close up of a person holding a lamp on a table.

Original Prompt: A man is holding a white sneaker in a store.

Original Prompt: A close up of a person knitting a piece of fabric.

Original Prompt: Two men in suits sitting next to each other.

Original Prompt: The man is preparing for a training session by doing some exercises on the floor with a rolling pin.

Original Prompt: A person is playing a game on a tablet next to a keyboard and a computer.

Semantic Concept Transition (SCT)

Original Prompt: A cheesy quesadilla on a black plate.

Original Prompt: Two men are standing in front of a brick wall, one is wearing a suit and the other is wearing a shirt and tie.

Original Prompt: A computer screen showing the settings for a video game.

Original Prompt: A young girl is sitting at a table and eating a meal.

Original Prompt: A woman sitting at a table with a microphone in front of her.

Original Prompt: A man and a woman are sitting at a table in a restaurant and talking to each other.

Original Prompt: The man is sitting in the back of a van with his feet on the ground, wearing black shoes and a white t-shirt.

Original Prompt: It is a plate of pancakes topped with cherries and syrup.

Visual Style Transition (VST)

Original Prompt: A baseball player is swinging a bat and the ball is flying towards the outfield.

Original Prompt: A black and white dog standing in the grass next to a wooden fence.

Original Prompt: A crowd of people gathered in a protest.

Original Prompt: A man is cooking something in a pot on the stove, and the smoke is coming out of the pot.

Original Prompt: A man in a suit and tie is speaking into a microphone.

Original Prompt: A close up of a person's arm holding a string.

Original Prompt: A close up of a bowl with liquid in it.

Original Prompt: A man in a white chef's jacket standing in a kitchen.

BibTeX

@misc{wang2025badvideostealthybackdoorattack, title={BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation}, author={Ruotong Wang and Mingli Zhu and Jiarong Ou and Rui Chen and Xin Tao and Pengfei Wan and Baoyuan Wu}, year={2025}, eprint={2504.16907}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2504.16907}, }

BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation

Abstract

BadVideo Framework

Overview

Attack Strategies

Experiment Results

Examples of Backdoor Videos

Spatio-Temporal Composition (STC)

Semantic Concept Transition (SCT)

Visual Style Transition (VST)

BibTeX