perf & gprof 性能分析

性能分析是优化的前提。perf 是 Linux 最强大的性能工具，gprof 是传统函数级分析工具，两者配合能精准定位性能瓶颈。

perf — Linux 性能计数器

bash

# 安装
sudo apt install linux-tools-common linux-tools-generic

# 基本性能统计
perf stat ./myapp
# 输出：
#  Performance counter stats for './myapp':
#     1,234.56 msec task-clock
#          123      context-switches
#           12      cpu-migrations
#        1,234      page-faults
#    3,456,789      cycles
#    2,345,678      instructions    # IPC = 0.68
#      123,456      cache-misses    # 缓存未命中率

# 采样分析（CPU 热点）
perf record -g ./myapp          # 记录（-g 包含调用栈）
perf report                     # 交互式查看
perf report --stdio             # 文本输出

# 实时监控
perf top                        # 类似 top，显示热点函数
perf top -p <pid>               # 监控特定进程

perf 火焰图

bash

# 生成火焰图（最直观的性能可视化）
# 1. 安装 FlameGraph
git clone https://github.com/brendangregg/FlameGraph.git

# 2. 采样
perf record -F 99 -g ./myapp
perf script > perf.data.txt

# 3. 生成火焰图
./FlameGraph/stackcollapse-perf.pl perf.data.txt > folded.txt
./FlameGraph/flamegraph.pl folded.txt > flamegraph.svg

# 用浏览器打开 flamegraph.svg
# 宽度 = 时间占比，点击可以缩放

gprof — 函数级分析

bash

# 编译时添加 -pg 标志
g++ -pg -O2 main.cpp -o main

# 运行程序（生成 gmon.out）
./main

# 分析
gprof main gmon.out > analysis.txt
gprof main gmon.out | head -50  # 查看前 50 行

# 输出示例：
# Flat profile:
#  %   cumulative   self              self     total
# time   seconds   seconds    calls  ms/call  ms/call  name
# 45.2       1.23      1.23    10000     0.12     0.15  compute_heavy()
# 23.1       1.86      0.63   100000     0.01     0.01  process_item()

Valgrind/Callgrind

bash

# Callgrind：详细调用图分析
valgrind --tool=callgrind --callgrind-out-file=callgrind.out ./myapp

# 查看结果（文本）
callgrind_annotate callgrind.out

# 图形界面（KCachegrind）
sudo apt install kcachegrind
kcachegrind callgrind.out

# Cachegrind：缓存分析
valgrind --tool=cachegrind ./myapp
cg_annotate cachegrind.out.*

Google Benchmark 集成

cpp

#include <benchmark/benchmark.h>

// 对比两种实现
static void BM_OldImpl(benchmark::State& state) {
    for (auto _ : state) {
        auto result = old_implementation(state.range(0));
        benchmark::DoNotOptimize(result);
    }
}

static void BM_NewImpl(benchmark::State& state) {
    for (auto _ : state) {
        auto result = new_implementation(state.range(0));
        benchmark::DoNotOptimize(result);
    }
}

BENCHMARK(BM_OldImpl)->Range(64, 1 << 16);
BENCHMARK(BM_NewImpl)->Range(64, 1 << 16);
BENCHMARK_MAIN();

bash

# 运行并比较
./bench --benchmark_out=before.json --benchmark_out_format=json
# 优化后
./bench --benchmark_out=after.json --benchmark_out_format=json
python3 compare.py benchmarks before.json after.json

常见性能问题与优化

cpp

// 1. 缓存未命中（最常见）
// 差：列优先访问二维数组（跨 cache line）
for (int j = 0; j < N; ++j)
    for (int i = 0; i < N; ++i)
        sum += matrix[i][j];  // 跳跃访问

// 好：行优先访问
for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        sum += matrix[i][j];  // 顺序访问

// 2. 虚函数调用（热路径避免）
// 差：热路径中的虚函数
for (auto* obj : objects) obj->process();  // 间接调用

// 好：CRTP 或 final
class Derived final : public Base { };  // 编译器可去虚化

// 3. 内存分配（热路径避免）
// 差：循环内分配
for (int i = 0; i < n; ++i) {
    std::vector<int> tmp;  // 每次分配
    process(tmp);
}

// 好：复用缓冲区
std::vector<int> tmp;
tmp.reserve(max_size);
for (int i = 0; i < n; ++i) {
    tmp.clear();  // 不释放内存
    process(tmp);
}

// 4. 分支预测失败
// 差：随机数据的条件分支
for (auto x : data) {
    if (x > 128) sum += x;  // 随机分支，预测失败率高
}

// 好：先排序（分支可预测）
std::sort(data.begin(), data.end());
for (auto x : data) {
    if (x > 128) sum += x;  // 前半段全 false，后半段全 true
}

关键认知

性能优化的黄金法则：先测量，再优化。perf 火焰图是定位热点的最直观工具。80% 的性能问题来自缓存未命中、不必要的内存分配、虚函数调用。优化前后必须用 benchmark 验证效果。

perf & gprof 性能分析 ​

perf — Linux 性能计数器 ​

perf 火焰图 ​

gprof — 函数级分析 ​

Valgrind/Callgrind ​

Google Benchmark 集成 ​

常见性能问题与优化 ​

perf & gprof 性能分析

perf — Linux 性能计数器

perf 火焰图

gprof — 函数级分析

Valgrind/Callgrind

Google Benchmark 集成

常见性能问题与优化