"The way a CEO answers an analyst's question tells you more than the earnings report itself."
在二级市场,财报电话会议(Earnings Call)是机构投资者获取管理层情绪信号的核心渠道。语调是防御还是进攻?语速加快还是迟疑?被追问时是否换话题?这些信息不会出现在任何 PDF 里,但它们藏在音频里。
传统做法靠分析师全程盯会、打标签,效率极低。自动化方案听起来很美——语音识别加情感分析,一套流水线搞定。但实际工程落地时,问题接踵而至:Whisper 转录出来的文本如何分句、分段?LLM 的提示词怎么设计才能输出稳定可量化的分数?多条财报并行处理时,API 限频和成本怎么控制?
本文给出一套生产级流水线:Whisper 转录 → 文本结构化 → LLM 情绪打分 → 汇总报告。代码可直接运行,提示词经过多轮调优,踩过的坑会一并标注。
一、整体架构
财报音频文件
│
▼
┌─────────────┐
│ Whisper │ ───→ 原始转录文本
│ 本地转录 │ (含时间戳)
└─────────────┘
│
▼
┌─────────────┐
│ 文本分段 │ ───→ 按 "Q&A 环节 / 管理层回答" 切分
│ 规则引擎 │
└─────────────┘
│
▼
┌─────────────┐
│ LLM 批量 │ ───→ 逐段情绪分数 + 归因分析
│ 情绪分析 │
└─────────────┘
│
▼
┌─────────────┐
│ 汇总报告 │ ───→ 单次财报的情绪热力图与关键洞察
│ 生成层 │
└─────────────┘
三层职责分离:Whisper 负责"听到",分段负责"读懂场景",LLM 负责"判断情绪"。这样做的好处是每层都可以独立替换或升级——Whisper Medium 换成 Next,GPT-4o 换成 Claude Sonnet,不影响其他模块。
二、Whisper 转录:本地优先,精确时间戳
2.1 为什么不用云端 ASR
云端方案(AWS Transcribe、Google Speech-to-Text)有三个工程上难以接受的问题:
- 合规风险:财报音频属于未公开重大信息(MNPI),上传第三方存在合规隐患
- 延迟不可控:单次财报 60-90 分钟,云端排队 + 处理可能超过 30 分钟
- 成本积累:Cloudflare Voice Intelligence 等按分钟计费,多标的并行时成本爆炸
因此我们使用 Whisper 的本地推理方案。
2.2 环境准备
pip install openai-whisper tiktoken anthropic python-dotenv openpyxl
Whisper 提供 5 档模型:tiny / base / small / medium / large。生产级推荐使用 medium 或 large:
| 模型 | 参数量 | 英文 WER(≈错误率) | 实时率(RTF) | 推荐场景 |
|---|---|---|---|---|
base |
74M | ~3.5% | 0.04 | 快速测试 |
small |
244M | ~2.5% | 0.13 | 低配服务器 |
medium |
769M | ~1.5% | 0.35 | 生产推荐 |
large |
1550M | ~1.0% | 0.7 | 高精度需求 |
实时率(RTF = 音频时长 / 处理时长)表示处理速度。medium 在 M2 MacBook Pro 上处理 90 分钟音频约需 4 分钟,完全可接受。
2.3 批量转录脚本
import os
import whisper
import tiktoken
from pathlib import Path
from datetime import datetime
import json
class EarningsCallTranscriber:
"""财报电话会议批量转录器"""
def __init__(self, model_name: str = "medium", device: str = "cuda"):
self.model = whisper.load_model(model_name, device=device)
self.encoding = tiktoken.get_encoding("cl100k_base")
def transcribe(self, audio_path: str, output_dir: str = "./transcripts") -> dict:
"""转录单条音频,返回带时间戳的逐句结果"""
audio_file = Path(audio_path)
if not audio_file.exists():
raise FileNotFoundError(f"音频文件不存在: {audio_path}")
# ⚠️ 生产级参数:FP16 加速、采样率 16kHz(Whisper 原生采样率)
# initial_prompt 用于引导专业术语识别(财报场景)
result = self.model.transcribe(
audio_path,
fp16=(device == "cuda"),
language="en",
initial_prompt=(
"This is an earnings call between a company's executive team "
"and financial analysts. Terms like EBITDA, guidance, beat/miss, "
"ACV, NRR, and CFO, CEO, COO may appear."
),
# 强制输出时间戳,这是后续分段的基础
# ⚠️ Word-level timestamps 仅 large-v3 模型支持,若不支持则 fallback
)
# 提取逐句结果(含开始/结束时间)
segments = []
for seg in result.get("segments", []):
segments.append({
"id": seg["id"],
"start": seg["start"],
"end": seg["end"],
"text": seg["text"].strip(),
"tokens": seg.get("tokens", []),
})
# 保存转录文件
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
stem = audio_file.stem
transcript_file = output_path / f"{stem}_transcript.json"
output = {
"audio_file": str(audio_path),
"model": model_name,
"transcribed_at": datetime.now().isoformat(),
"full_text": result["text"],
"segments": segments,
"language": result.get("language", "en"),
"duration": result.get("segments", [{}])[-1].get("end", 0)
if result.get("segments") else 0,
}
with open(transcript_file, "w", encoding="utf-8") as f:
json.dump(output, f, ensure_ascii=False, indent=2)
print(f"✅ 转录完成: {transcript_file} ({len(segments)} 个片段)")
return output
def batch_transcribe(self, audio_dir: str, output_dir: str = "./transcripts") -> list:
"""批量转录目录内所有音频文件"""
audio_files = list(Path(audio_dir).glob("*.mp3")) + \
list(Path(audio_dir).glob("*.m4a")) + \
list(Path(audio_dir).glob("*.wav"))
results = []
for audio_path in sorted(audio_files):
try:
result = self.transcribe(str(audio_path), output_dir)
results.append(result)
except Exception as e:
print(f"❌ 转录失败 {audio_path}: {e}")
return results
# 使用示例
if __name__ == "__main__":
transcriber = EarningsCallTranscriber(model_name="medium")
# 替换为实际音频路径
transcriber.transcribe("./audio/AAPL_Q4_2024_earnings.mp3")
踩坑记录:
initial_prompt参数极其关键。Whisper 默认会将 "ACV" 识别为 "a.c.v.",加入行业术语提示后准确率显著提升。- Whisper 的时间戳是相对于整段音频的累计时间,不是每句的绝对起点,
segments[i]["start"]和segments[i]["end"]需要交叉验证。 - 若音频中包含多个 speaker(分析师 + 管理层),Whisper 本身不输出说话人分离,需要额外接一个 diarization 模型(如 pyannote)。这部分本文不展开,有需要可参考附录。
三、文本分段:Q&A 结构化提取
Whisper 输出的是纯文本流,按时间戳切分成了句子,但财报电话会议有自己的结构:
[开场] CFO 宣布本季度业绩数字
[管理层陈述] CEO/CFO 宣读准备好的发言稿
[Q&A 环节] 分析师提问 → 管理层逐一回答 ← 这里是情绪分析的核心
[结语] 管理层补充
管理层陈述往往是朗读稿,语气中性,信息密度高但情绪信号少。Q&A 环节才是情绪富矿——分析师的追问措辞、管理层的即兴回应、犹豫和换话题的时机,都是量化信号。
3.1 分段规则引擎
import re
from dataclasses import dataclass
from typing import Optional
@dataclass
class TranscriptSegment:
"""转录片段"""
start: float # 秒
end: float # 秒
text: str
speaker: str # "management" / "analyst" / "unknown"
context: str # "prepared_remarks" / "q_and_a" / "closing"
class EarningsCallSegmenter:
"""
财报电话会议文本分段器
策略:基于规则 + 关键词匹配
- 检测 "We will now begin Q&A" / "Operator, let's begin Q&A" → 切换到 Q&A 模式
- 说话人切换检测(基于语气停顿、称呼模式 "Thank you, X" / "Thanks, operator")
- 时间戳突变检测(连续两个片段间隔 >5 秒 → 可能的说话人切换)
"""
QA_START_SIGNALS = [
r"(?i)begin.*Q\s*&\s*A",
r"(?i)we will now take.*question",
r"(?i)operator.*first question",
r"(?i)first question.*please",
r"(?i)and our first question",
]
MANAGEMENT_KEYWORDS = [
r"(?i)\b(CEO|CFO|COO|CTO|CRO|CMO|President|Vice President)\b",
r"(?i)our (next )?quarter",
r"(?i)looking ahead|guidance|outlook",
]
ANALYST_KEYWORDS = [
r"(?i)thanks for (taking|squeezing).*(call|question)",
r"(?i)my question",
r"(?i)i('m| am) from",
r"(?i)couple of questions",
]
def __init__(self):
self.qa_pattern = re.compile("|".join(self.QA_START_SIGNALS))
self.mgmt_pattern = re.compile("|".join(self.MANAGEMENT_KEYWORDS))
self.analyst_pattern = re.compile("|".join(self.ANALYST_KEYWORDS))
def segment(self, whisper_output: dict) -> list[TranscriptSegment]:
"""
将 Whisper 输出切分为结构化片段
返回:按时间顺序排列的 TranscriptSegment 列表
"""
segments = whisper_output["segments"]
structured = []
# 状态机:prepared_remarks → q_and_a
phase = "prepared_remarks"
for i, seg in enumerate(segments):
text = seg["text"]
start = seg["start"]
end = seg["end"]
# 状态切换:检测到 Q&A 开始信号
if phase == "prepared_remarks" and self.qa_pattern.search(text):
phase = "q_and_a"
# 说话人判断(基于关键词,而非声纹)
speaker = self._classify_speaker(text, phase)
structured.append(TranscriptSegment(
start=start,
end=end,
text=text,
speaker=speaker,
context=phase,
))
# 合并连续同说话人片段(减少 LLM API 调用次数)
return self._merge_consecutive_segments(structured)
def _classify_speaker(self, text: str, context: str) -> str:
"""基于文本特征判断说话人身份"""
if self.analyst_pattern.search(text):
return "analyst"
if self.mgmt_pattern.search(text):
return "management"
return "unknown"
def _merge_consecutive_segments(
self, segments: list[TranscriptSegment], gap_threshold: float = 3.0
) -> list[TranscriptSegment]:
"""合并时间间隔小且说话人相同的连续片段"""
if not segments:
return []
merged = [segments[0]]
for seg in segments[1:]:
last = merged[-1]
time_gap = seg.start - last.end
if (seg.speaker == last.speaker and
seg.context == last.context and
time_gap < gap_threshold):
# 合并到前一个片段
last.text += f" {seg.text}"
last.end = seg.end
else:
merged.append(seg)
return merged
3.2 分段质量验证
分段完成后,建议打印前 10 个片段做人工抽样检查:
segmenter = EarningsCallSegmenter()
structured = segmenter.segment(transcript)
print("前 10 个片段预览:")
for i, seg in enumerate(structured[:10]):
print(f"\n[{i}] [{seg.start:.1f}s-{seg.end:.1f}s] [{seg.speaker}] [{seg.context}]")
print(f" {seg.text[:120]}{'...' if len(seg.text) > 120 else ''}")
踩坑记录:
- Whisper 的标点预测不可靠,"Thank you" 有时被切成独立片段,需要
_merge_consecutive_segments来合并。 - 分析师提问本身也包含情绪信号(比如试探性追问 vs. 质问),不要在分段时丢弃分析师的问题。后续 LLM 分析时,管理层回答需要结合上下文——"分析师问了什么"决定了"管理层如何回答"的情感含义。
四、LLM 情绪打分:提示词设计
4.1 为什么不用词典法和传统 NLP
VADER、TextBlob 等基于词典的情感分析工具有两个致命缺陷:
- 语义漂移:在财报场景中,"bearish" 不一定负面,"we are conservative on guidance" 词典判为负面,但市场可能解读为审慎稳健。
- 无法捕捉上下文依赖:"the number is in line" 和 "the number is exactly in line" 词典分相近,但后者语气更强(精准对齐暗示管理层刻意强调)。
因此使用 LLM 做上下文感知的情绪判断。
4.2 提示词工程:五维打分体系
经过多轮调优,我们设计了一个五维打分体系,每个维度 1-10 分:
| 维度 | 含义 | 典型场景 |
|---|---|---|
| confidence | 管理层回答时的自信程度 | 迟疑措辞("um", "I would say", "potentially")→ 低分 |
| optimism | 对未来业绩的乐观程度 | 上调 guidance → 高分;下调 → 低分 |
| transparency | 信息披露的坦率程度 | 主动提及挑战 → 高分;模糊回避 → 低分 |
| defensiveness | 被追问时的防御程度 | 换话题 / 模糊 → 高分(负面信号);直面问题 → 低分 |
| specificity | 回答的具体程度 | 给出精确数字 → 高分;使用"good growth" → 低分 |
import anthropic
import os
import time
import json
from typing import Optional
class EarningsSentimentAnalyzer:
"""
财报电话会议 LLM 情绪分析器
支持批量处理,内置指数退避重试和限频处理
"""
SCORING_PROMPT = """
You are a financial analyst specializing in earnings call sentiment analysis.
## Task
Analyze the following earnings call transcript segment and score it across five dimensions.
Score each dimension from 1 (most negative/extreme) to 10 (most positive/calm).
Be precise and consistent.
## Five Dimensions
1. confidence: Management's confidence in their answers. Signs of low confidence: hedging language ("maybe", "potentially", "I think", "um"), vague quantity references ("some growth", "a few quarters"). Signs of high confidence: precise numbers, definitive statements.
2. optimism: Forward-looking tone. Upgraded guidance = high. Downgraded guidance = low. Neutral = ~5-6.
3. transparency: Willingness to disclose challenges openly. Explicitly naming problems = high transparency. Vague reassurances = low.
4. defensiveness: How much management deflects or avoids direct questions. Long non-answers, topic changes = high defensiveness (negative signal). Direct answers = low.
5. specificity: Level of concrete detail in answers. Exact numbers, named products, specific timelines = high. Qualitative adjectives ("strong", "solid", "meaningful") = low.
## Output Format (JSON only, no markdown)
{{
"scores": {{
"confidence": <1-10>,
"optimism": <1-10>,
"transparency": <1-10>,
"defensiveness": <1-10>,
"specificity": <1-10>
}},
"highlights": [
"Exact quote that drove the scores (max 3)"
],
"summary": "One-sentence interpretation of this segment's sentiment."
}}
## Context (analyst's question before this answer)
{context}
## Transcript Segment
{segment_text}
## Current timestamp in call
{start_time:.0f}s to {end_time:.0f}s (out of ~{total_duration:.0f}s total)
""".strip()
def __init__(self, api_key: Optional[str] = None):
self.client = anthropic.Anthropic(
api_key=api_key or os.environ.get("ANTHROPIC_API_KEY")
)
self.max_tokens = 1024
# 批量处理的指数退避参数
self.base_delay = 2.0
self.max_delay = 60.0
def analyze_segment(
self,
segment_text: str,
analyst_question: str = "N/A",
start_time: float = 0,
end_time: float = 0,
total_duration: float = 0,
model: str = "claude-sonnet-4-20250514",
) -> dict:
"""分析单个片段的情绪分数"""
prompt = self.SCORING_PROMPT.format(
context=analyst_question,
segment_text=segment_text,
start_time=start_time,
end_time=end_time,
total_duration=total_duration,
)
max_retries = 5
for attempt in range(max_retries):
try:
response = self.client.messages.create(
model=model,
max_tokens=self.max_tokens,
messages=[{"role": "user", "content": prompt}],
timeout=30,
)
raw = response.content[0].text
# ⚠️ 解析 LLM 输出:可能含 markdown 代码块,需清洗
if raw.startswith("```"):
lines = raw.split("\n")
raw = "\n".join(lines[1:-1] if lines[-1] == "```" else lines[1:])
result = json.loads(raw)
result["segment_start"] = start_time
result["segment_end"] = end_time
return result
except anthropic.RateLimitError as e:
# 限频处理:读取 Retry-After
retry_after = self._extract_retry_after(e)
wait = min(retry_after * 1.2, self.max_delay) # 多给 20% 缓冲
print(f"⏳ 限频,等待 {wait:.1f}s(attempt {attempt + 1}/{max_retries})")
time.sleep(wait)
except json.JSONDecodeError as e:
# LLM 输出格式漂移时降级解析
print(f"⚠️ JSON 解析失败,降级处理(attempt {attempt + 1}): {e}")
if attempt == max_retries - 1:
return {
"scores": {"confidence": 5, "optimism": 5,
"transparency": 5, "defensiveness": 5, "specificity": 5},
"highlights": [],
"summary": f"[解析失败] {raw[:200]}",
"segment_start": start_time,
"segment_end": end_time,
"error": str(e),
}
except Exception as e:
raise RuntimeError(f"LLM 分析失败: {e}") from e
def batch_analyze(
self,
segments: list[TranscriptSegment],
total_duration: float,
model: str = "claude-sonnet-4-20250514",
context_window: int = 3,
) -> list[dict]:
"""
批量分析多个片段
context_window: 往前追溯几条记录作为上下文(分析师问题)
"""
results = []
for i, seg in enumerate(segments):
# 找到前一条分析师提问作为上下文
analyst_question = self._find_analyst_question(segments, i, context_window)
print(f"📊 分析片段 {i + 1}/{len(segments)} [{seg.speaker}] {seg.start:.0f}s-{seg.end:.0f}s")
result = self.analyze_segment(
segment_text=seg.text,
analyst_question=analyst_question,
start_time=seg.start,
end_time=seg.end,
total_duration=total_duration,
model=model,
)
result["speaker"] = seg.speaker
result["context"] = seg.context
results.append(result)
# ⚠️ 防止限频:每条请求间隔至少 0.5s
time.sleep(0.5)
return results
def _find_analyst_question(
self, segments: list[TranscriptSegment], current_idx: int, lookback: int = 3
) -> str:
"""向前回溯找到最近的分析师提问"""
for j in range(max(0, current_idx - lookback), current_idx):
if segments[j].speaker == "analyst":
return segments[j].text[:500] # 截断避免 token 溢出
return "N/A"
@staticmethod
def _extract_retry_after(error) -> float:
"""从限频错误中提取 Retry-After 秒数"""
try:
return float(error.headers.get("Retry-After", 30))
except (TypeError, ValueError):
return 30.0
4.3 提示词调优经验
在生产环境中,以下提示词设计经验经过多轮迭代:
强约束输出格式:要求纯 JSON 输出,不允许 markdown。虽然 json.loads 仍有失败可能(Claude 有时会加注释或换行格式问题),但降级处理后大部分结果可挽救。json:2024-03 工具支持在 API 端直接约束输出,稳定性更好。
分析师上下文必须传入:同一句 "we are excited about the quarter" 在"被追问毛利率下降时"和"主动宣布业绩时"含义完全相反。提示词中加入前一条分析师提问作为 context,可以消除这种歧义。
时间戳传入:让 LLM 知道这段回答发生在通话的哪个阶段——前 5 分钟的管理层陈述 vs. 发布会结束前 10 分钟的尾声,权重不同。
五维分制的选择:1-10 分优于 1-5 分,因为粒度更细且方差更大。如果只需二分类(看涨/看跌),可以在汇总层做聚合,不必让 LLM 输出二分类结果——LLM 的概率校准在连续分数上更稳定。
五、汇总报告生成
单个片段的分析结果没有太大价值,需要按维度聚合才能形成决策信号。
5.1 分维度聚合
import statistics
class EarningsReportGenerator:
"""财报情绪分析汇总报告生成器"""
DIMENSIONS = ["confidence", "optimism", "transparency", "defensiveness", "specificity"]
def generate(self, analysis_results: list[dict], company_name: str = "") -> dict:
"""生成完整的情绪分析报告"""
# 过滤无效结果
valid_results = [r for r in analysis_results if "scores" in r]
if not valid_results:
return {"error": "无有效分析结果"}
# 按 context(prepared_remarks / q_and_a)分组统计
qa_results = [r for r in valid_results if r.get("context") == "q_and_a"]
pr_results = [r for r in valid_results if r.get("context") == "prepared_remarks"]
report = {
"company": company_name,
"total_segments_analyzed": len(valid_results),
"q_and_a_segments": len(qa_results),
"overall": self._compute_dimension_stats(valid_results),
"q_and_a_only": self._compute_dimension_stats(qa_results),
"prepared_remarks_only": self._compute_dimension_stats(pr_results),
"trend_analysis": self._compute_trend(valid_results),
"key_highlights": self._extract_key_highlights(valid_results),
}
return report
def _compute_dimension_stats(self, results: list[dict]) -> dict:
"""计算每个维度的统计量"""
stats = {}
for dim in self.DIMENSIONS:
scores = [r["scores"][dim] for r in results if dim in r["scores"]]
if not scores:
continue
stats[dim] = {
"mean": round(statistics.mean(scores), 2),
"median": round(statistics.median(scores), 2),
"stdev": round(statistics.stdev(scores), 2) if len(scores) > 1 else 0,
"min": min(scores),
"max": max(scores),
}
return stats
def _compute_trend(self, results: list[dict]) -> dict:
"""
计算情绪趋势(前 1/3 vs 后 1/3 对比)
如果 Q&A 后半段 optimism 明显下降 → 管理层可能在被追问后露出疲态
"""
n = len(results)
if n < 6:
return {"note": "样本量不足,无法计算趋势"}
chunk_size = n // 3
first_third = results[:chunk_size]
last_third = results[-chunk_size:]
trend = {}
for dim in self.DIMENSIONS:
first_scores = [r["scores"][dim] for r in first_third if dim in r["scores"]]
last_scores = [r["scores"][dim] for r in last_third if dim in r["scores"]]
if first_scores and last_scores:
first_mean = statistics.mean(first_scores)
last_mean = statistics.mean(last_scores)
trend[dim] = {
"first_third_mean": round(first_mean, 2),
"last_third_mean": round(last_mean, 2),
"delta": round(last_mean - first_mean, 2),
"direction": "improving" if last_mean > first_mean else "declining",
}
return trend
def _extract_key_highlights(self, results: list[dict], top_n: int = 5) -> list[dict]:
"""提取情绪最极端的片段(高置信/低置信/高防御)"""
scored_results = []
for r in results:
if "scores" not in r:
continue
# 合成情绪分数:confidence + transparency - defensiveness
composite = (
r["scores"].get("confidence", 5)
+ r["scores"].get("transparency", 5)
- r["scores"].get("defensiveness", 5) * 1.5
)
scored_results.append({**r, "composite_score": composite})
# 最值得关注的片段:最高 composite 和最低 composite
sorted_by_composite = sorted(scored_results, key=lambda x: x["composite_score"])
extremes = []
for r in sorted_by_composite[:top_n] + sorted_by_composite[-top_n:]:
extremes.append({
"timestamp": f"{r['segment_start']:.0f}s-{r['segment_end']:.0f}s",
"speaker": r.get("speaker", "unknown"),
"composite_score": round(r["composite_score"], 2),
"scores": r["scores"],
"summary": r.get("summary", ""),
"top_quote": r.get("highlights", [""])[0] if r.get("highlights") else "",
})
return extremes
5.2 输出示例
假设对某公司财报运行上述流水线,最终报告结构如下:
{
"company": "AAPL",
"total_segments_analyzed": 28,
"q_and_a_segments": 19,
"overall": {
"confidence": {"mean": 7.2, "median": 7.0, "stdev": 1.8},
"optimism": {"mean": 6.4, "median": 6.5, "stdev": 1.5},
"transparency": {"mean": 5.8, "median": 6.0, "stdev": 2.1},
"defensiveness": {"mean": 3.2, "median": 3.0, "stdev": 1.6},
"specificity": {"mean": 6.9, "median": 7.0, "stdev": 1.4}
},
"trend_analysis": {
"confidence": {
"first_third_mean": 7.8,
"last_third_mean": 6.1,
"delta": -1.7,
"direction": "declining"
}
},
"key_highlights": [
{
"timestamp": "1840s-1920s",
"speaker": "management",
"composite_score": 2.1,
"scores": {"confidence": 3, "optimism": 4, "transparency": 3, "defensiveness": 9, "specificity": 2},
"summary": "CEO dodged question about China guidance with three consecutive vague reassurances",
"top_quote": "We're very confident in our position across all markets..."
}
]
}
在这个示例中,defensiveness 的 mean=3.2(越低越好)说明整体比较坦然,但 trend 显示 confidence 在 Q&A 后半段下降了 1.7 分,且 key_highlights 中捕捉到了 CEO 在中国区 guidance 问题上的回避——这些就是量化信号。
六、成本估算与性能优化
6.1 API 成本分析
以一场 75 分钟的财报为例:
| 环节 | 模型 | Token 消耗(估算) | 成本(参考) |
|---|---|---|---|
| Whisper 转录 | medium | — | GPU 推理成本约 $0.002 |
| LLM 情绪分析 | Sonnet | 约 1500-2000 input + 200 output / 片段 × 30 片段 | ~$0.03-0.05 |
单次财报约 $0.05-0.07 美金,批量处理 100 家公司约 $5-7,完全可以接受。
6.2 性能优化策略
减少 LLM 调用次数:
_merge_consecutive_segments将原始 80-100 个 Whisper 片段合并为 25-35 个结构化片段- 只分析 Q&A 环节(prepared_remarks 语气中性,价值有限)
使用缓存:
- 同一段文字若需重复分析(如调整提示词参数),将转录文本缓存到本地,避免二次 Whisper 推理
模型选择:
- 批量初筛用 Sonnet(便宜、快速)
- 重点标的深度分析用 Opus(更细致的语境理解)
七、流水线组装
将以上所有模块串联成一条端到端流水线:
def run_full_pipeline(
audio_path: str,
company_name: str,
output_dir: str = "./output",
) -> dict:
"""端到端财报情绪分析流水线"""
Path(output_dir).mkdir(parents=True, exist_ok=True)
# Step 1: 转录
print("=" * 60)
print("Step 1/4: 转录中...")
transcriber = EarningsCallTranscriber(model_name="medium")
transcript = transcriber.transcribe(audio_path, output_dir)
# Step 2: 分段
print("\nStep 2/4: 结构化分段...")
segmenter = EarningsCallSegmenter()
structured = segmenter.segment(transcript)
# Step 3: LLM 情绪分析
print(f"\nStep 3/4: LLM 情绪分析({len(structured)} 个片段)...")
analyzer = EarningsSentimentAnalyzer()
analysis_results = analyzer.batch_analyze(
structured,
total_duration=transcript["duration"],
)
# Step 4: 生成报告
print("\nStep 4/4: 汇总报告生成...")
report_gen = EarningsReportGenerator()
report = report_gen.generate(analysis_results, company_name)
# 保存报告
report_path = Path(output_dir) / f"{company_name}_sentiment_report.json"
with open(report_path, "w", encoding="utf-8") as f:
json.dump(report, f, ensure_ascii=False, indent=2)
print(f"\n✅ 流水线完成: {report_path}")
return report
if __name__ == "__main__":
report = run_full_pipeline(
audio_path="./audio/AAPL_Q4_2024.mp3",
company_name="AAPL",
)
print("\n📊 核心发现:")
print(f" 置信度: {report['q_and_a_only']['confidence']['mean']}")
print(f" 防御性: {report['q_and_a_only']['defensiveness']['mean']}")
print(f" 透明度: {report['q_and_a_only']['transparency']['mean']}")
八、局限性与改进方向
8.1 当前局限
说话人分离依赖关键词:如果分析师来自非英语母语国家或公司高层有独特口音,关键词匹配会失效。生产环境建议接入 pyannote 或 VoiceLoop 做声纹分离。
五维分制的主观性:提示词中的评分标准依赖工程团队的金融判断,不同人设计提示词会得到不同基准分。建议用同一批标注数据做交叉校准,建立"锚定分数"。
缺乏跨公司可比性:AAPL 的平均 transparency=5.8 和 TSLA 的 5.8 含义不同。建议加入内部基准对比(公司历史各季度对比),而非绝对分数比较。
Whisper 的 ASR 误差:财报音频中常见"non-GAAP"、"Adjusted EBITDA"等专业术语,Whisper 仍有小概率转错。建议在分段后加一层"关键词完整性检查"——如果段落中出现 "guidance" 但缺少具体数字,标记为"可能含重要指引但被转录遗漏"。
8.2 进阶改进
| 方向 | 方案 | 预期收益 |
|---|---|---|
| 声纹分离 | pyannote-audio | 精确区分管理层 vs. 分析师,消除关键词误判 |
| 多语言支持 | Whisper 多语言模型 | 覆盖日股(索尼)、韩股(三星)财报 |
| 历史基准对比 | 将季度分数存入时序数据库 | 可视化管理层情绪变化趋势 |
| 实时流式处理 | Whisper streaming + 流式 LLM | 财报发布后分钟级输出情绪预警 |
| 量化信号接入 | 情绪分数 → TickDB 回测 | 将情绪因子加入已有量化因子库 |
九、结语
财报电话会议的情绪分析,本质上是把"管理层的语气"变成可量化的因子。Whisper 解决了"听到"的问题,规则引擎解决了"读懂场景"的问题,LLM 解决了"判断情绪"的问题。三层分离的架构让每个环节都可以独立迭代——Whisper 出了新版本直接升级,分段规则按市场(美股/港股/欧股)分别配置,LLM 模型切换不影响存储层。
如果你已经部署了 TickDB 的市场数据流,这条流水线可以无缝接入:财报发布前的期权隐含波动率变化(来自 TickDB depth 频道)和发布后的管理层情绪分数(来自本文流水线)形成了一套"事前预期 + 事中反应"的交叉验证框架。
下一步行动
如果你是量化研究员,想将情绪因子加入回测框架:在 TickDB API 文档 获取历史 K 线数据,用财报发布日的收益率与本文输出的 confidence / defensiveness 分数做相关性分析。
如果你是工程师,想直接上手运行:安装依赖后替换 audio_path 和 company_name 参数即可。Whisper 模型首次运行会下载(约 1.5GB),建议提前缓存。
如果你在评估多标的批量方案,需要控制 API 成本和限频:参考本文第六章的成本估算,100 家公司约 $5-7;批量任务建议设置每日并发上限(建议不超过 20 条并行),避免触发 API 限频。
本文不构成任何投资建议。市场有风险,投资需谨慎。