X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection

Abstract

The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a generator-side view and observe that internal cross-attention mechanisms in these models encode fine-grained speech–motion alignment, offering useful correspondence cues for forgery detection. Building on this insight, we propose X-AVDT, a robust and generalizable deepfake detector that probes generator-internal audio-visual signals accessed via DDIM inversion to expose these cues. X-AVDT extracts two complementary signals: (i) a video composite capturing inversion-induced discrepancies, and (ii) audio–visual cross-attention feature reflecting modality alignment enforced during generation. To enable faithful, cross-generator evaluation, we further introduce MMDF, a new multi-modal deepfake dataset spanning diverse manipulation types and rapidly evolving synthesis paradigms, including GANs, diffusion, and flow-matching. Extensive experiments demonstrate that X-AVDT achieves leading performance on MMDF and generalizes strongly to external benchmarks and unseen generators, outperforming existing methods with accuracy improved by +13.1%. Our findings highlight the importance of leveraging internal audio–visual consistency cues for robustness to future generators in deepfake detection.

Overview

From each audio-visual pair, we form two inputs $\boldsymbol{\phi}$ and $\boldsymbol{\psi}$. Two 3D encoders map them to features that are concatenated and passed through the Feature Fusion Decoder to produce a fused feature. A classification head outputs the real/fake score, while an embedding head is trained with a triplet objective to improve robustness.

Input Features

Input representations with complementary features. (a) video composite $\boldsymbol{\phi}$ is obtained from video $x$ and audio $c$ by running DDIM inversion and reconstruction, decoding both the noisy and clean latents, and computing the residual. We then concatenate four components channel-wise: $x$, $D(\hat z_T)$, $D(\hat z_0)$, and $\lvert x - D(\hat z_0)\rvert$. (b) AV cross-attention feature $\boldsymbol{\psi}$ is extracted during DDIM inversion from the diffusion U-Net and summarized as a frame-aligned tensor. These complementary cues (a) and (b) capture appearance information and modality alignment, respectively. For clarity, all visual elements shown ($D(\hat z_T)$, $D(\hat z_0)$, and $\lvert x - D(\hat z_0)\rvert$) are decoded images.

Cross-Attention Robustness

Top-$q$ attention mass coverage within the face ROI: real videos concentrate the top-$q$ mass in a smaller ROI, whereas synthesized videos consistently require coverage of a larger ROI coverage (left). Moreover, the _&Delta attention maps reveal a coherent spatial contrast pattern: attention for real videos is concentrated on the mouth and background, while attention for fake videos is more broadly distributed along the face boundary (right). This pattern persists across two different inversion sources, Hallo, our backbone generator, and Echomimic.

Temporally Averaged Cross-Attention Maps

For each video, we extract audio–visual cross-attention during DDIM inversion and average the maps over all frames to obtain a single heatmap. Real vs. fake samples exhibit consistent disparities.

Quantitative Comparison

Quantitative Comparison on the MMDF Dataset

Detectors are evaluated using the official pretrained checkpoint (first panel) and after retrained on the MMDF training set (Hallo2, LivePortrait, and FaceAdapter) (second panel). Best in bold; second-best underlined. ^*Note: FACTOR is a zero-shot method, while AVAD and AVH-Align are unsupervised methods.

Quantitative Comparison on the Benchmark Dataset

Detectors are trained on the MMDF training set and evaluated on FakeAVCeleb and FaceForensics++, respectively. ^†Indicates that the corresponding benchmark was used during the method’s original training (train–test overlap).

BibTeX

@misc{kim2026xavdtaudiovisualcrossattentionrobust,
      title={X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection},
      author={Youngseo Kim and Kwan Yun and Seokhyeon Hong and Sihun Cha and Colette Suhjung Koo and Junyong Noh},
      year={2026},
      eprint={2603.08483},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.08483},
}