General-purpose Vision-Language Models (VLMs) have driven major advancements in multimodal AI. Fine-tuning these models with task-specific data enhances adaptability to various downstream tasks but suffers from privacy risks. While potential solutions like federated learning can address user data privacy concerns, model protection is also essential. Other methods that rely on black-box VLM APIs usually require the access of prediction logits, leaving them open to inversion attacks. Moreover, addressing the challenges of tuning complexity and data transmission efficiency in federated VLM scenarios is also crucial. To address these challenges, we propose FDPT—a federated discrete prompt tuning method utilizing black-box VLMs. During client optimization stage, FDPT employs an agent-driven framework leveraging large language models (LLMs) with enhanced reasoning capacities to systematically optimize discrete prompt representations, and also utilizes feedback mechanisms and chain of thought to enhance prediction accuracy. Importantly, it performs optimization by relying not on the predicted logic vectors output by LLMs but on textual results, avoiding reverse attack risks. During global aggregation stage, We mimic human electoral activities by employing evolutionary computation methods underpinned by semantic similarity computation to implement enhanced zero-order optimization for acquiring representative global tokens, thereby achieving knowledge aggregation. FDPT significantly outperforms nine state-of-the-art methods in image classification and visual question-answering, reducing communication overhead while generating highly transferable optimized prompts. Additionally, it exhibits improved robustness to data heterogeneity.
Supplementary notes can be added here, including code, math, and images.