프로덕션 레벨에서의 torch 모델 배포하기 이론 & 튜토리얼

MLOPS/SERVING

프로덕션 레벨에서의 torch 모델 배포하기 이론 & 튜토리얼

개발허재 2022. 5. 29. 19:02

torchscript

모델을 production 레벨에서 사용하려면 두가지가 필요하다.

Portability

모델이 다양한 환경에서 export 될 수 있어야 함

Python interpreter process 에서뿐만이 아니라 C++ server나 mobile /embedded device 에서도 작동이 가능해야함

Performance

inference latency와 throughput, 모두의 성능을 유지하면서도 최적화를 해야함

Pytorch 는 Tensorflow와 다르게 python의 특징을 많이 가지고 있는 프레임워크이기 때문에 Portability와 Performance 측면에서 약세를 보였고, 이를 해결하기 위해 Torchscript는 코드를 Eager mode에서 Script mode로 변환한다.

Eager Mode

normal python runtime mode로 prototyping, training, experimenting을 위해 사용된다

Script Mode

script 또는 graph mode 라고도 한다.

production deployment를 위해 변환한 모드

runtime 과정에서 Python Interpreter로 실행되지 않기 때문에 병렬 연산, 최적화 등이 가능해진다.

Eager mode에서 Script mode로 변환하는 방법

script 모드로 변환하는 방법에는 두가지가 있다.

tracing 과 scripting 이다.

Using PyTorch JIT in scripting mode

스크립팅을 활성화하려면 jit.ScriptModule 클래스와 @torch.jit.script_method 데코레이터를 사용한다.

이것은 래핑된 모듈을 (컴파일러 지원) TorchScript 함수로 직접 변환한다.

import torch.jit as jit

class MyModule(jit.ScriptModule):
    def __init__():
        super().__init__()
        # [...]
    
    @jit.script_method
    def forward(self, x):
        # [...]
        pass

이 코드를 실행하면, 모듈은 여전히 Python 개체이지만 코드는 C++에서 실행된다.

torchscript는 불변하는 범위나 튜플에 대한 반복은 허용되지만 계속 변하는 범위에 대한 반복은 허용되지 않는다.

# allowed
for n in range(0, 5):
    print(n)

# allowed
for n in (0, 1, 2, 3, 4):
    print(n)

# not allowed (raises compile error)
ns = [1, 2, 3, 4]
for n in ns:
    print(n)
    a.pop()

위의 세 번째 코드 샘플은 범위에 대한 반복이 금지된 이유를 보여준다. 목록의 요소를 반복하면서 동시에 변경한다.

컴파일러는 이 루프를 풀 수 없으므로 변환이 되지 않는다.

대부분의 PyTorch 모듈 코드 즉, 라이브러리나 api 형태로 제공되는 모델이나 공개 모델들은 TorchScript로 쉽게 변환할 수 있다.

하지만, 그렇지 않은 모델들의 매우 복잡한 코드를 변환하는 것은 상당히 어려운 경향이 있다.

Using PyTorch JIT in trace mode

torch.jit.trace 함수를 쓰며, torch.jit.trace_module 모듈을 사용한다.

# torch.jit.trace for functions
import torch

def foo(x, y):
    return 2 * x + y

traced_foo = torch.jit.trace(foo, (torch.rand(3), torch.rand(3)))

# torch.jit.trace for modules
import torch
import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv = nn.Conv2d(1, 1, 3)

    def forward(self, x):
        return self.conv(x)

n = Net()
example_weight = torch.rand(1, 1, 3, 3)
example_forward_input = torch.rand(1, 1, 3, 3)
traced_module = torch.jit.trace(n, example_forward_input)

이 코드를 실행한 후 traced_foo 및 traced_module 은 C++에서 실행되는 JIT 기반(컴파일된) 코드 개체다.

Tracing 방법은 어떤 입력값(data instance)을 사용하여 모델의 구조를 파악하고, 이 입력값의 모델 안에서의 흐름을 통해 모델을 기록하는 방식이다. 조건문을 많이 사용하지 않는 모델의 경우 이 방식을 이용하여 변환하는 것이 적합하다.

Tracing 방식은 eager model의 코드를 재사용할 수 있는 효과적인 방법이다. 그러나 이 방식을 사용하면 Control-flow나 data structure, python construct가 보존되지 않는다.

예를 들어 다음과 같은 pytorch model이 있다고 가정해보자.

import torch

class MyModule(torch.nn.Module):
    def __init__(self, N, M):
        super(MyModule, self).__init__()
        self.weight = torch.nn.Parameter(torch.rand(N, M))

    def forward(self, input):
        if input.sum() > 0:
          output = self.weight.mv(input)
        else:
          output = self.weight + input
        return output

MyModule model은 input값에 따라 영향을 받는 Control-flow 를 사용하고 있기 때문에 tracing 기법은 적합하지 않다.

대신 torch.jit.script()함수를 통해 모듈을 compile하여 ScriptModule로 변환한다.

또한, 이 방식은 tracing mode와 다르게 data sample은 전달할 필요가 없다. 오직 model의 instance만 input으로 넣어주면 된다.

make custom layers go fast

import torch
import torch.nn as nn
import torch.nn.functional as F

class Conv2d(nn.Module):
    def __init__(
        self, n_channels, out_channels, kernel_size, dilation=1, padding=0, stride=1
    ):
        super().__init__()

        self.kernel_size = kernel_size
        self.kernel_size_number = kernel_size * kernel_size
        self.out_channels = out_channels
        self.padding = padding
        self.dilation = dilation
        self.stride = stride
        self.n_channels = n_channels
        self.weights = nn.Parameter(
            torch.Tensor(self.out_channels, self.n_channels, self.kernel_size**2)
        )

    def __repr__(self):
        return (
            f"Conv2d(n_channels={self.n_channels}, out_channels={self.out_channels}, "
            f"kernel_size={self.kernel_size})"
        )
    
    def forward(self, x):
        width = self.calculate_new_width(x)
        height = self.calculate_new_height(x)
        windows = self.calculate_windows(x)
        
        result = torch.zeros(
            [x.shape[0] * self.out_channels, width, height],
            dtype=torch.float32, device=x.device
        )

        for channel in range(x.shape[1]):
            for i_conv_n in range(self.out_channels):
                xx = torch.matmul(windows[channel], self.weights[i_conv_n][channel]) 
                xx = xx.view((-1, width, height))
                
                xx_stride = slice(i_conv_n * xx.shape[0], (i_conv_n + 1) * xx.shape[0])
                result[xx_stride] += xx

        result = result.view((x.shape[0], self.out_channels, width, height))
        return result  

    def calculate_windows(self, x):
        windows = F.unfold(
            x,
            kernel_size=(self.kernel_size, self.kernel_size),
            padding=(self.padding, self.padding),
            dilation=(self.dilation, self.dilation),
            stride=(self.stride, self.stride)
        )

        windows = (windows
            .transpose(1, 2)
            .contiguous().view((-1, x.shape[1], int(self.kernel_size**2)))
            .transpose(0, 1)
        )
        return windows

    def calculate_new_width(self, x):
        return (
            (x.shape[2] + 2 * self.padding - self.dilation * (self.kernel_size - 1) - 1)
            // self.stride
        ) + 1

    def calculate_new_height(self, x):
        return (
            (x.shape[3] + 2 * self.padding - self.dilation * (self.kernel_size - 1) - 1)
            // self.stride
        ) + 1

이 모듈의 성능을 테스트하기 위해 다음 코드를 실행

x = torch.randint(0, 255, (1, 3, 512, 512), device='cuda') / 255
conv = Conv2d(3, 16, 3)
conv.cuda()

%%time
out = conv(x)
out.mean().backward()

작성된 이 모듈 은 이 입력에서 실행하는 데 35.5ms 가 걸린다.

이제 이 코드를 JIT한다.(예: 그래프 런타임으로 변환)

이렇게 하려면 몇 가지만 변경하면 된다. 먼저 클래스는 이제 nn.Module 이 아닌 jit.ScriptModule 에서 상속해야 한다.

# old
class Conv2d(nn.Module):
    # [...]
# new
class Conv2d(jit.ScriptModule):
    # [...]

둘째, 모듈 코드 내에서 forward 메서드 정의에 @jit.script_method 래퍼를 설정했다.

# old
def forward(self, x):
    # [...]

@jit.script_method
def forward(self, x):
    # [...]

refence

https://happy-jihye.github.io/dl/torch-2/

딥러닝 모델 배포하기 #02 - TorchScript & Pytorch JIT

AI Researcher 관점에서 모델 배포를 설명합니다.

happy-jihye.github.io

https://spell.ml/blog/pytorch-jit-YBmYuBEAACgAiv71

Speeding up model training with PyTorch JIT

PyTorch JIT can be used to enable 2x-3x speedups on custom module code.

spell.ml