【高级应用】Day28:多模态应用开发–图文音视频AI应用实战
章节导语
真实世界的信息不只是文本。
一张图片胜过千言万语,一段视频包含丰富的时间信息,一段语音承载着情感和语气。单一模态的AI如同只有一种感官的生物,能做的事情有限。而多模态AI则像拥有多种感官的生物,能够更全面地理解和与世界交互。
本文系统讲解多模态AI应用开发,包括CLIP/GPT-4V等多模态模型的原理、文生图/图生文/视频生成等应用的实现,以及如何构建自己的多模态AI系统。
一、多模态AI基础
1.1 什么是多模态
模态(Modality)是指信息的表现形式:
文本(Text):文字信息,自然语言处理的核心。
图像(Image):视觉信息,2D空间数据。
视频(Video):时序图像序列,包含时间维度的视觉信息。
音频(Audio):声音信息,包含语音、音乐、环境音等。
多模态(Multimodal):两种或以上模态的组合,如图文结合、视频配音等。
1.2 为什么需要多模态
多模态解决了单模态的局限:
信息互补:图片提供视觉信息,文字提供解释说明,结合起来更完整。
场景覆盖:有些任务天然是多模态的,如视频理解需要同时处理图像和音频。
鲁棒性:多模态冗余使得系统对单模态噪声更鲁棒。
自然交互:人类通过多种感官与世界交互,AI也应该如此。
1.3 多模态模型发展历程
| 模型 | 时间 | 能力 |
|---|---|---|
| CLIP | 2021 | 图文对比学习 |
| DALL-E | 2021 | 文生图 |
| Flamingo | 2022 | 图文理解 |
| GPT-4V | 2023 | 视觉理解 |
| Sora | 2024 | 文生视频 |

二、CLIP图文对比学习
2.1 CLIP原理
CLIP(Contrastive Language-Image Pre-training)的核心思想很简单:用自然语言作为监督信号学习视觉概念。
对比学习:让配对的图文表示相近,让不配对的图文表示远离。
双塔结构:一个文本编码器处理文字,一个图像编码器处理图片。
大规模预训练:在4亿图文对上训练,学到了丰富的视觉概念。
2.2 CLIP代码实现
import torch
import torch.nn as nn
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
from typing import List, Union
import numpy as np
class CLIPService:
"""CLIP图文理解服务"""
def __init__(self, model_name: str = "openai/clip-vit-base-patch32"):
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.model = CLIPModel.from_pretrained(model_name).to(self.device)
self.processor = CLIPProcessor.from_pretrained(model_name)
self.model.eval()
@torch.no_grad()
def encode_image(self, image: Union[Image.Image, str]) -> np.ndarray:
"""提取图像特征"""
if isinstance(image, str):
image = Image.open(image).convert('RGB')
inputs = self.processor(images=image, return_tensors="pt").to(self.device)
image_features = self.model.get_image_features(**inputs)
# L2归一化
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
return image_features.cpu().numpy()
@torch.no_grad()
def encode_text(self, text: Union[str, List[str]]) -> np.ndarray:
"""提取文本特征"""
if isinstance(text, str):
text = [text]
inputs = self.processor(text=text, return_tensors="pt", padding=True).to(self.device)
text_features = self.model.get_text_features(**inputs)
# L2归一化
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
return text_features.cpu().numpy()
@torch.no_grad()
def similarity(self, image: Image.Image, texts: List[str]) -> np.ndarray:
"""计算图像与多个文本的相似度"""
image_features = self.encode_image(image)
text_features = self.encode_text(texts)
# 余弦相似度(特征已经归一化,直接点积即可)
similarities = (image_features @ text_features.T).squeeze()
return similarities
@torch.no_grad()
def zero_shot_classify(self, image: Image.Image, labels: List[str]) -> dict:
"""零样本图像分类
给定标签列表,对图像进行分类
"""
similarities = self.similarity(image, labels)
# 转概率
probs = torch.softmax(torch.tensor(similarities) * 100, dim=-1).numpy()
return {
label: float(prob)
for label, prob in zip(labels, probs)
}
# 使用示例
service = CLIPService()
# 示例:图像分类
image = Image.open("example.jpg").convert('RGB')
labels = ["猫", "狗", "汽车", "飞机", "风景"]
result = service.zero_shot_classify(image, labels)
print("图像分类结果:")
for label, prob in sorted(result.items(), key=lambda x: x[1], reverse=True):
print(f" {label}: {prob:.1%}")
# 示例:找相似图片
image_features = service.encode_image(image)
print(f"\n图像特征维度: {image_features.shape}")

三、GPT-4V视觉理解
3.1 GPT-4V的能力
GPT-4V(GPT-4 with Vision)将视觉能力融入GPT-4:
图像理解:理解图片中的物体、场景、动作。
多图比较:同时理解多张图片并进行对比分析。
视觉推理:基于图像进行逻辑推理和问题回答。
文档理解:读取截图、表格、图表等。
视觉叙事:根据图片生成描述或故事。
3.2 GPT-4V API调用
import base64
import requests
from PIL import Image
from io import BytesIO
from typing import List, Optional
def encode_image_to_base64(image: Image.Image, format: str = "JPEG") -> str:
"""将PIL图像编码为base64字符串"""
buffer = BytesIO()
image.save(buffer, format=format)
return base64.b64encode(buffer.getvalue()).decode()
def call_gpt4v(
image: Image.Image,
prompt: str,
api_key: str,
max_tokens: int = 500
) -> str:
"""调用GPT-4V API"""
# 编码图像
image_base64 = encode_image_to_base64(image)
headers = {
"Content-Type": "application/json",
"api-key": api_key
}
payload = {
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": prompt
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
}
}
]
}
],
"max_tokens": max_tokens
}
# Azure OpenAI endpoint
endpoint = "https://xxx.openai.azure.com/openai/deployments/gpt-4-vision/chat/completions?api-version=2024-02-01"
response = requests.post(endpoint, headers=headers, json=payload)
response.raise_for_status()
result = response.json()
return result['choices'][0]['message']['content']
def call_gpt4v_with_url(
image_url: str,
prompt: str,
api_key: str,
max_tokens: int = 500
) -> str:
"""通过URL调用GPT-4V API"""
headers = {
"Content-Type": "application/json",
"api-key": api_key
}
payload = {
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": prompt
},
{
"type": "image_url",
"image_url": {
"url": image_url
}
}
]
}
],
"max_tokens": max_tokens
}
endpoint = "https://xxx.openai.azure.com/openai/deployments/gpt-4-vision/chat/completions?api-version=2024-02-01"
response = requests.post(endpoint, headers=headers, json=payload)
response.raise_for_status()
result = response.json()
return result['choices'][0]['message']['content']
# 使用示例
def main():
# 从URL分析图片
result = call_gpt4v_with_url(
image_url="https://example.com/chart.png",
prompt="请描述这张图表的内容,包括标题、数据的趋势和关键洞察",
api_key="your-api-key"
)
print("分析结果:")
print(result)
# 图像分析应用
class ImageAnalyzer:
"""图像分析器"""
def __init__(self, api_key: str):
self.api_key = api_key
def describe(self, image: Image.Image) -> str:
"""生成图像描述"""
return call_gpt4v(
image,
"请详细描述这张图片的内容",
self.api_key
)
def extract_text(self, image: Image.Image) -> str:
"""从图像中提取文字"""
return call_gpt4v(
image,
"请提取图片中所有的文字内容,保持原有格式",
self.api_key
)
def analyze_chart(self, image: Image.Image) -> dict:
"""分析图表"""
prompt = """
请分析这张图表,返回JSON格式:
{
"title": "图表标题",
"type": "图表类型(折线/柱状/饼图等)",
"trend": "数据趋势",
"key_insights": ["关键洞察1", "关键洞察2"],
"data_summary": "数据摘要"
}
"""
result = call_gpt4v(image, prompt, self.api_key)
import json
import re
# 尝试解析JSON
match = re.search(r'\{.*\}', result, re.DOTALL)
if match:
return json.loads(match.group())
return {"raw": result}
main()

四、文生图与图生图
4.1 主流文生图模型
Midjourney:艺术感强,适合创意设计。
Stable Diffusion:开源可控,适合开发者。
DALL-E 3:理解力强,生成准确。
Adobe Firefly:与Adobe生态集成,商业友好。
4.2 Stable Diffusion API调用
import requests
import base64
from PIL import Image
from io import BytesIO
from typing import Optional, List
import time
class StableDiffusionService:
"""Stable Diffusion图像生成服务"""
def __init__(self, api_url: str, api_key: Optional[str] = None):
self.api_url = api_url.rstrip('/')
self.headers = {}
if api_key:
self.headers["Authorization"] = f"Bearer {api_key}"
def text_to_image(
self,
prompt: str,
negative_prompt: str = "",
width: int = 1024,
height: int = 1024,
steps: int = 30,
guidance_scale: float = 7.5,
seed: int = -1
) -> Image.Image:
"""文生图"""
payload = {
"prompt": prompt,
"negative_prompt": negative_prompt,
"width": width,
"height": height,
"steps": steps,
"guidance_scale": guidance_scale,
}
if seed > 0:
payload["seed"] = seed
response = requests.post(
f"{self.api_url}/sdapi/v1/txt2img",
json=payload,
headers=self.headers
)
response.raise_for_status()
result = response.json()
return self._decode_image(result['images'][0])
def image_to_image(
self,
init_image: Image.Image,
prompt: str,
strength: float = 0.75,
**kwargs
) -> Image.Image:
"""图生图"""
init_image_b64 = self._encode_image(init_image)
payload = {
"init_images": [init_image_b64],
"prompt": prompt,
"strength": strength,
**kwargs
}
response = requests.post(
f"{self.api_url}/sdapi/v1/img2img",
json=payload,
headers=self.headers
)
response.raise_for_status()
result = response.json()
return self._decode_image(result['images'][0])
def inpaint(
self,
init_image: Image.Image,
mask_image: Image.Image,
prompt: str,
**kwargs
) -> Image.Image:
"""局部重绘"""
init_image_b64 = self._encode_image(init_image)
mask_image_b64 = self._encode_image(mask_image)
payload = {
"init_images": [init_image_b64],
"mask": mask_image_b64,
"prompt": prompt,
**kwargs
}
response = requests.post(
f"{self.api_url}/sdapi/v1/img2img",
json=payload,
headers=self.headers
)
response.raise_for_status()
result = response.json()
return self._decode_image(result['images'][0])
def _encode_image(self, image: Image.Image) -> str:
"""编码图像为base64"""
buffer = BytesIO()
image.save(buffer, format="PNG")
return base64.b64encode(buffer.getvalue()).decode()
def _decode_image(self, base64_str: str) -> Image.Image:
"""解码base64为图像"""
return Image.open(BytesIO(base64.b64decode(base64_str)))
# 使用示例
def main():
# 初始化服务
sd_service = StableDiffusionService(
api_url="http://localhost:7860" # Stable Diffusion WebUI API
)
# 文生图
print("生成图像...")
image = sd_service.text_to_image(
prompt="A beautiful sunset over the ocean, digital art, highly detailed, 8k",
negative_prompt="blurry, low quality, distorted",
width=1024,
height=768,
steps=30
)
# 保存
image.save("generated_sunset.png")
print("图像已保存: generated_sunset.png")
# 图生图
print("图生图...")
init_image = Image.open("generated_sunset.png")
new_image = sd_service.image_to_image(
init_image=init_image,
prompt="A beautiful sunset over the mountains instead of ocean",
strength=0.6
)
new_image.save("modified_sunset.png")
print("修改后图像已保存: modified_sunset.png")
# 批量生成
def batch_generate(
service: StableDiffusionService,
prompts: List[str],
output_dir: str = "./outputs"
) -> List[str]:
"""批量生成图像"""
import os
os.makedirs(output_dir, exist_ok=True)
results = []
for i, prompt in enumerate(prompts):
print(f"[{i+1}/{len(prompts)}] 生成: {prompt[:50]}...")
image = service.text_to_image(prompt)
output_path = f"{output_dir}/generated_{i:03d}.png"
image.save(output_path)
results.append(output_path)
time.sleep(1) # 避免请求过快
return results
# main()
五、视频理解与生成
5.1 视频理解模型
视频理解比图像多了时间维度,需要处理:
时序建模:理解物体随时间的变化。
动作识别:识别人体动作和手势。
事件检测:检测视频中的事件和异常。
多模态融合:结合视觉和音频信息。
5.2 视频分析代码
import cv2
import numpy as np
from typing import List, Optional
from dataclasses import dataclass
from PIL import Image
@dataclass
class VideoFrame:
"""视频帧"""
frame_id: int
timestamp: float
image: np.ndarray
class VideoProcessor:
"""视频处理工具"""
def __init__(self, video_path: str):
self.video_path = video_path
self.cap = cv2.VideoCapture(video_path)
self.fps = self.cap.get(cv2.CAP_PROP_FPS)
self.total_frames = int(self.cap.get(cv2.CAP_PROP_FRAME_COUNT))
self.duration = self.total_frames / self.fps if self.fps > 0 else 0
def read_frame(self, frame_id: Optional[int] = None, timestamp: Optional[float] = None) -> VideoFrame:
"""读取指定帧"""
if frame_id is not None:
self.cap.set(cv2.CAP_PROP_POS_FRAMES, frame_id)
elif timestamp is not None:
self.cap.set(cv2.CAP_PROP_POS_MSEC, timestamp * 1000)
ret, frame = self.cap.read()
if not ret:
raise ValueError(f"无法读取帧: frame_id={frame_id}, timestamp={timestamp}")
current_frame = int(self.cap.get(cv2.CAP_PROP_POS_FRAMES))
current_timestamp = self.cap.get(cv2.CAP_PROP_POS_MSEC) / 1000
return VideoFrame(
frame_id=current_frame - 1,
timestamp=current_timestamp,
image=frame
)
def extract_keyframes(self, num_keyframes: int = 10) -> List[VideoFrame]:
"""提取关键帧"""
keyframe_ids = np.linspace(0, self.total_frames - 1, num_keyframes, dtype=int)
keyframes = []
for fid in keyframe_ids:
frame = self.read_frame(frame_id=fid)
keyframes.append(frame)
return keyframes
def __del__(self):
if hasattr(self, 'cap'):
self.cap.release()
class VideoAnalyzer:
"""视频分析器(简化版)"""
def __init__(self, clip_service):
self.clip_service = clip_service
def analyze_video(self, video_path: str, num_frames: int = 16) -> dict:
"""分析视频内容"""
processor = VideoProcessor(video_path)
# 均匀采样帧
keyframes = processor.extract_keyframes(num_frames)
# 分析每帧
frame_descriptions = []
for frame in keyframes:
# 转换为PIL Image
image = Image.fromarray(cv2.cvtColor(frame.image, cv2.COLOR_BGR2RGB))
# 使用CLIP提取特征
features = self.clip_service.encode_image(image)
frame_descriptions.append({
'timestamp': frame.timestamp,
'features': features
})
# 计算视频整体特征(所有帧特征的平均)
avg_features = np.mean([f['features'] for f in frame_descriptions], axis=0)
return {
'duration': processor.duration,
'fps': processor.fps,
'num_frames': len(frame_descriptions),
'avg_features': avg_features,
'frame_features': frame_descriptions
}
def compare_videos(self, video1_path: str, video2_path: str) -> float:
"""比较两个视频的相似度"""
analysis1 = self.analyze_video(video1_path)
analysis2 = self.analyze_video(video2_path)
# 计算余弦相似度
similarity = np.dot(
analysis1['avg_features'].flatten(),
analysis2['avg_features'].flatten()
)
return float(similarity)
# 使用示例
def main():
# 注意:需要先初始化CLIP服务
# clip_service = CLIPService()
# analyzer = VideoAnalyzer(clip_service)
# 分析视频
# result = analyzer.analyze_video("video.mp4")
# print(f"视频时长: {result['duration']:.1f}秒")
# print(f"采样帧数: {result['num_frames']}")
pass
# main()
六、多模态应用实战
6.1 图文搜索系统
from typing import List, Dict, Tuple
from PIL import Image
import numpy as np
class ImageSearchEngine:
"""图文搜索搜索引擎"""
def __init__(self, clip_service):
self.clip_service = clip_service
self.image_database: List[Dict] = []
def add_image(self, image_id: str, image: Image.Image, metadata: dict = None):
"""添加图像到数据库"""
# 提取特征
features = self.clip_service.encode_image(image)
self.image_database.append({
'id': image_id,
'features': features,
'metadata': metadata or {}
})
def search_by_text(self, query: str, top_k: int = 5) -> List[Dict]:
"""文本搜索图像"""
# 提取查询文本特征
query_features = self.clip_service.encode_text(query)
# 计算与所有图像的相似度
results = []
for item in self.image_database:
similarity = np.dot(
query_features.flatten(),
item['features'].flatten()
)
results.append({
'id': item['id'],
'similarity': float(similarity),
'metadata': item['metadata']
})
# 排序并返回top_k
results.sort(key=lambda x: x['similarity'], reverse=True)
return results[:top_k]
def search_by_image(self, query_image: Image.Image, top_k: int = 5) -> List[Dict]:
"""图像搜索图像"""
query_features = self.clip_service.encode_image(query_image)
results = []
for item in self.image_database:
similarity = np.dot(
query_features.flatten(),
item['features'].flatten()
)
results.append({
'id': item['id'],
'similarity': float(similarity),
'metadata': item['metadata']
})
results.sort(key=lambda x: x['similarity'], reverse=True)
return results[:top_k]
# 使用示例
# clip_service = CLIPService()
# search_engine = ImageSearchEngine(clip_service)
#
# # 添加图像
# search_engine.add_image("img_001", Image.open("photo1.jpg"), {"category": "nature"})
# search_engine.add_image("img_002", Image.open("photo2.jpg"), {"category": "city"})
#
# # 文本搜索
# results = search_engine.search_by_text("自然风景", top_k=3)
# print("搜索'自然风景'的结果:")
# for r in results:
# print(f" {r['id']}: {r['similarity']:.3f}")
#
# # 图像搜索
# results = search_engine.search_by_image(Image.open("query.jpg"), top_k=3)
# print("\n相似图像:")
# for r in results:
# print(f" {r['id']}: {r['similarity']:.3f}")
6.2 视觉问答系统
from PIL import Image
from typing import List
class VisualQA:
"""视觉问答系统"""
def __init__(self, gpt4v_api_key: str, clip_service=None):
self.gpt4v_api_key = gpt4v_api_key
self.clip_service = clip_service # 可选的CLIP增强
def answer(
self,
image: Image.Image,
question: str,
use_gpt4v: bool = True
) -> str:
"""回答关于图像的问题"""
if use_gpt4v:
return call_gpt4v(
image,
f"请根据图片回答问题:{question}",
self.gpt4v_api_key
)
else:
# 使用CLIP进行零样本问答(简化版)
return self._clip_based_qa(image, question)
def _clip_based_qa(self, image: Image.Image, question: str) -> str:
"""基于CLIP的简化问答"""
# 提取图像特征
image_features = self.clip_service.encode_image(image)
# 候选答案
candidates = ["是", "否", "不确定"]
# 简单处理
return "请使用GPT-4V进行更准确的问答"
def batch_answer(
self,
qa_pairs: List[dict]
) -> List[str]:
"""批量问答"""
results = []
for item in qa_pairs:
image = item['image']
question = item['question']
answer = self.answer(image, question)
results.append(answer)
return results
# 使用示例
def main():
# analyzer = ImageAnalyzer(api_key="your-key")
# image = Image.open("example.jpg")
#
# # 问答
# result = analyzer.describe(image)
# print(result)
#
# # 提取文字
# text = analyzer.extract_text(image)
# print(text)
pass
# main()
七、总结
多模态是AI的必然发展方向。真实世界的信息是多模态的,AI也需要能够处理多模态信息。
CLIP是多模态学习的基础。图文对比学习为后续多模态模型奠定了基础。
GPT-4V展示了视觉理解的可能性。视觉问答、图表理解、文档处理等应用前景广阔。
文生图正在改变创意产业。Stable Diffusion等开源模型让每个人都能成为设计师。
视频理解是下一个前沿。视频数据量大、理解难度高,但应用价值也更大。
延伸阅读
- CLIP论文
- GPT-4V技术报告
- Stable Diffusion论文
- Sora技术报告
课后练习
基础题:使用CLIP构建一个图像分类器。
进阶题:使用Stable Diffusion API构建一个批量图像生成工具。
挑战题:构建一个图文搜索+视觉问答的完整多模态应用。