TL;DR: CLIP-style visual features can be inverted to reconstruct the original image, leaking private content. TrustCLIP trains a projection against a generative reconstruction attacker, so features stay useful for downstream tasks but can no longer be turned back into recognizable images.
Optimizes directly against a generative inversion attacker, not just discriminative privacy proxies.
Keeps class semantics and downstream signal — classification and VLM (LLaVA-SP) performance stay competitive with the unprotected baseline.
Substantially degrades the fidelity of image reconstructions recovered from features, confirmed by privacy metrics.
Vision and vision–language models rely on high-level visual representations that are increasingly used across recognition, retrieval, and multimodal reasoning pipelines. However, recent advances in generative modeling have shown that such features can often be inverted, enabling realistic reconstructions of the underlying image and raising significant privacy risks. We revisit this problem through the lens of reconstruction and propose TrustCLIP, a reconstruction-driven framework that treats a feature-conditioned generator as an explicit privacy adversary. TrustCLIP learns a projection between encoder features and downstream modules that is explicitly optimized to degrade the reconstructions produced by generative attackers while retaining the necessary signals for downstream tasks. Unlike prior defenses that rely on discriminative privacy metrics, TrustCLIP directly optimizes against a generative reconstruction attacker, targeting a threat not captured by standard evaluation protocols. We demonstrate its effectiveness in both conventional classification and multimodal large language model pipelines. Across these settings, TrustCLIP consistently reduces the fidelity of generative inversions while maintaining downstream task performance.
Comparison of image reconstructions from CLIP features across three privacy projection methods: original baseline (no projection), identity projection, and MLP-based TrustCLIP projection.