GPU 模式讲座 1 的笔记

霞舞

发布时间：2024-11-15 08:34:43

807人浏览过

来源于dev.to

转载

gpu 模式讲座 1 的笔记

分析器

计算机性能取决于时间和内存的权衡。由于计算设备比较昂贵，所以大多数时候，时间是首先要关心的。

为什么要使用分析器？

cuda 是异步的，因此无法使用 python 时间模块
分析器更加强大

工具

共有三个分析器：

autograd 分析器：数值
pytorch 分析器：视觉
nvidia nsight 计算

autograd 分析器利用 torch.cuda.event() 来测量性能。

pytorch profiler 利用 profiler 上下文管理器 torch.profiler 中的 profile() 方法来分析性能。
您可以将结果导出为 .json 文件并将其上传到 chrome://tracing/ 进行可视化。

演示

课程提供了一个简单的程序来展示如何使用autograd profiler来分析三种平方运算方法的性能：

通过 torch.square()
由 ** 操作员
由 * 操作员

def time_pytorch_function(func, input):
    # cuda is async so can't use python time module
    start = torch.cuda.event(enable_timing=true)
    end = torch.cuda.event(enable_timing=true)

    # warmup
    for _ in range(5):
        func(input)

    start.record()
    func(input)
    end.record()
    torch.cuda.synchronize()
    return start.elapsed_time(end)

time_pytorch_function(torch.square, b)
time_pytorch_function(square_2, b)
time_pytorch_function(square_3, b)

下面的结果是在 nvidia t4 gpu 上完成的。

Cardify卡片工坊

使用Markdown一键生成精美的小红书知识卡片

下载

profiling torch.square:
self cpu time total: 10.577ms
self cuda time total: 3.266ms

profiling a * a:
self cpu time total: 5.417ms
self cuda time total: 3.276ms

profiling a ** 2:
self cpu time total: 6.183ms
self cuda time total: 3.274ms

事实证明：

cuda 运算速度比 cpu 更快。
* 运算符执行的是 aten::multiply 操作，而不是 aten::pow，并且前者更快。这可能是因为乘法比 pow 使用得更多，并且许多开发人员花时间对其进行优化。
cuda 上的性能差异很小。考虑到 cpu 时间，torch.square 是最慢的操作
aten::square 是对 aten::pow 的调用
所有三种方法都启动了一个名为native::vectorized_elementwise_kernel<4的cuda内核，位于...

在 pytorch 中集成 cuda 内核

有几种方法可以做到这一点：

使用torch.utils.cpp_extendsion中的load_inline
使用 numba，它是一个编译器，可将经过修饰的 python 函数编译为在 cpu 和 gpu 上运行的机器代码
使用 triton

我们可以使用torch.utils.cpp_extendsion中的load_inline通过load_inline（name，cpp_sources，cuda_sources，functions，with_cuda，build_directory）将cuda内核加载为pytorch扩展。

from torch.utils.cpp_extension import load_inline

square_matrix_extension = load_inline(
    name='square_matrix_extension',
    cpp_sources=cpp_source,
    cuda_sources=cuda_source,
    functions=['square_matrix'],
    with_cuda=true,
    extra_cuda_cflags=["-o2"],
    build_directory='./load_inline_cuda',
    # extra_cuda_cflags=['--expt-relaxed-constexpr']
)

a = torch.tensor([[1., 2., 3.], [4., 5., 6.]], device='cuda')
print(square_matrix_extension.square_matrix(a))

动手实践

对均值操作使用 autograd 分析器

使用 autograd profiler 时，请记住：

录制前预热gpu，使gpu进入稳定状态
平均多次运行以获得更可靠的结果

import torch

# method 1: use `torch.mean()`
def mean_all_by_torch(input_tensor):
    return torch.mean(input_tensor)

# method 2: use `mean()` of the tensor
def mean_all_by_tensor(input_tensor):
    return input_tensor.mean()

# method 3: use `torch.sum()` and `tensor.numel()`
def mean_all_by_combination(input_tensor):
    return torch.sum(input_tensor) / input_tensor.numel()

def time_pytorch_function(func, input_tensor, warmup=5, runs=100):
    # warmup
    for _ in range(warmup):
      func(input_tensor)

    times = []
    start = torch.cuda.event(enable_timing=true)
    end = torch.cuda.event(enable_timing=true)

    for _ in range(runs):
        start.record()
        func(input_tensor)
        end.record()
        torch.cuda.synchronize()
        times.append(start.elapsed_time(end))

    return sum(times) / len(times)

input_tensor = torch.randn(10000, 10000).cuda()

print("torch.mean() time:", time_pytorch_function(mean_all_by_torch, input_tensor))
print("tensor.mean() time:", time_pytorch_function(mean_all_by_tensor, input_tensor))
print("manual mean time:", time_pytorch_function(mean_all_by_combination, input_tensor))


with torch.profiler.profile() as prof:
    mean_all_by_torch(input_tensor)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

with torch.profiler.profile() as prof:
    mean_all_by_tensor(input_tensor)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

with torch.profiler.profile() as prof:
    mean_all_by_combination(input_tensor)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

使用 pytorch 分析器进行均值操作

import torch
from torch.profiler import profile, profileractivity

with profile(activities=[profileractivity.cpu, profileractivity.cuda]) as prof:
    for _ in range(10):
        mean_tensor = torch.mean(torch.randn(10000, 10000).cuda())

prof.export_chrome_trace("mean_trace.json")

为 torch.mean() 实现 triton 代码

import triton
import triton.language as tl
import torch

@triton.jit
def mean_kernel(
    x_ptr,          # pointer to input tensor
    output_ptr,     # pointer to output tensor
    n_elements,     # total number of elements
    BLOCK_SIZE: tl.constexpr,  # number of elements per block
):

    pid = tl.program_id(0)

    block_start = pid * BLOCK_SIZE
    block_end = tl.minimum(block_start + BLOCK_SIZE, n_elements)

    acc = 0.0

    for idx in range(block_start, block_end):
        x = tl.load(x_ptr + idx)
        acc += x


    block_mean = acc / n_elements

    # Store result
    tl.store(output_ptr + pid, block_mean)

# Wrapper function
def triton_mean(x: torch.Tensor) -> torch.Tensor:

    x = x.contiguous().view(-1)
    n_elements = x.numel()


    BLOCK_SIZE = 1024
    grid = (triton.cdiv(n_elements, BLOCK_SIZE),)


    output = torch.empty(grid[0], device=x.device, dtype=x.dtype)


    mean_kernel[grid](
        x_ptr=x,
        output_ptr=output,
        n_elements=n_elements,
        BLOCK_SIZE=BLOCK_SIZE,
    )

    return output.sum()

# Example usage:
if __name__ == "__main__":
    # Create test tensor
    x = torch.randn(1000000, device='cuda')

    # Compare results
    torch_mean = torch.mean(x)
    triton_mean_result = triton_mean(x)

    print(f"PyTorch mean: {torch_mean}")
    print(f"Triton mean: {triton_mean_result}")
    print(f"Difference: {abs(torch_mean - triton_mean_result)}")

参考

gpu 模式讲座 - github
活动 - pytorch
pytorch 分析器
nvidia nsight 计算
torch.utils.cpp_extension.load_inline
海卫一

Python 中使用 try 语句写入文件失败的常见原因及最佳实践

Python 中基于时间戳保留每篇文章最新操作记录的去重方法

Python 中按时间戳保留每篇文章最新操作记录的去重方法

Python 用户输入空格处理与健壮性错误控制完整指南

如何优雅处理用户输入中的空格与错误？

相关专题

json数据格式

JSON是一种轻量级的数据交换格式。本专题为大家带来json数据格式相关文章，帮助大家解决问题。

457

2023.08.07

json是什么

JSON是一种轻量级的数据交换格式，具有简洁、易读、跨平台和语言的特点，JSON数据是通过键值对的方式进行组织，其中键是字符串，值可以是字符串、数值、布尔值、数组、对象或者null，在Web开发、数据交换和配置文件等方面得到广泛应用。本专题为大家提供json相关的文章、下载、课程内容，供大家免费下载体验。

549

2023.08.23