모델 배포를 위한 Torchscript & Triton

MLOPS/SERVING

모델 배포를 위한 Torchscript & Triton

개발허재 2022. 5. 30. 23:50

서론

나는 OCR 서비스를 제공하기 위해 Text detection, recognition 모델을 사용했다.

먼저, text detection 단에서는 clova-ai의 CRAFT 모델을 사용했다. (출처: https://github.com/clovaai/CRAFT-pytorch)

위 모델을 torch.jit.script 모듈을 사용하여 scripting 하였다.

Text Recognition 단에서는 마찬가지로 clova-ai의 TPS-Resnet-BiLSTM-Attn 모델을 사용했으며 내 태스크에 맞게 한글 추출을 위한 Transfer learning을 진행했다. (출처: https://github.com/clovaai/deep-text-recognition-benchmark)

TPS-Resnet-BiLSTM-Attn 모델은 scripting을 하기 위해 몇몇 레이어들을 변환해주었다.

AdaptiveAvgPooling 이라던지, 인풋 파라미터값 형태 제공, 토치의 FloatTensor , Tensor와 같은 몇몇 기능들이 변환에 제약이 있어서 다른 지원되는 것들로 변환해주었다. (이 작업이 가장 어려움)

본론

Ensemble

모든 작업이 완료되었고 이제 Triton에 올릴일만 남았다. 나는 처음에는 Triton의 Ensemble Model 기능을 이용하여 모델을 로드했다.

config는 다음과 같다.

name: "ensemble_ocr"
platform: "ensemble"
max_batch_size: 1

input [
  {
    name: "pre_in_0"
    data_type: TYPE_UINT8
    dims: [ -1, -1, 3 ]
  }
]
output [
   {
     name: "post1_out_0"
     data_type: TYPE_STRING
     dims: [ 1 ]
   }
]
ensemble_scheduling {
  step [
    {
      model_name: "preprocess"
      model_version: -1
      input_map [
      {
        key: "INPUT__0"
        value: "pre_in_0"
      }
      ]
      output_map [
      {
        key: "OUTPUT__0"
        value: "pre_out_0"
      },
      {
        key: "OUTPUT__1"
        value: "pre_out_1"
      }
      ]
    },
    {
      model_name: "craft"
      model_version: -1
      input_map [
      {
        key: "input__0"
        value: "pre_out_0"
      }
      ]
      output_map [
      {
        key: "output__0"
        value: "craft_out_0"
      }
      ]
    },
    {
      model_name: "postprocess_0"
      model_version: -1
      input_map [
      {
        key: "INPUT__0"
        value: "craft_out_0"
      },
      {
        key: "INPUT__1"
        value: "pre_in_0"
      },
      {
        key: "INPUT__2"
        value: "pre_out_1"
      }
      ]
      output_map [
      {
        key: "OUTPUT__0"
        value: "post0_out_0"
      },
      {
        key: "OUTPUT__1"
        value: "post0_out_1"
      },
      {
        key: "OUTPUT__2"
        value: "post0_out_2"
      },
      {
        key: "OUTPUT__3"
        value: "post0_out_3"
      }
      ]
    },
    {
      model_name: "ocr"
      model_version: 2
      input_map [
      {
        key: "INPUT__0"
        value: "post0_out_0"
      },
      {
        key: "INPUT__1"
        value: "post0_out_1"
      }
      ]
      output_map [
      {
        key: "OUTPUT__0"
        value: "ocr_out_0"
      }
      ]
    },
    {
      model_name: "postprocess_1"
      model_version: -1
      input_map [
      {
        key: "INPUT__0"
        value: "ocr_out_0"
      },
      {
        key: "INPUT__1"
        value: "post0_out_2"
      },
      {
        key: "INPUT__2"
        value: "post0_out_3"
      }
      ]
      output_map [
      {
        key: "OUTPUT__0"
        value: "post1_out_0"
      }
      ]
    }
    ]
}

각 단계의 모델들은 model_name 으로 Model Repository에 디렉토리화하여 weight file과 config.pbtxt 파일을 넣어놓았다.

/models
    /preprocess
        /1 (또는 다른 버전 번호)
            - 모델 파일들
            - config.pbtxt
    /craft
        /1 (또는 다른 버전 번호)
            - 모델 파일들
            - config.pbtxt
    /postprocess_0
        /1 (또는 다른 버전 번호)
            - 모델 파일들
            - config.pbtxt
    /ocr
        /2
            - 모델 파일들
            - config.pbtxt
    /postprocess_1
        /1 (또는 다른 버전 번호)
            - 모델 파일들
            - config.pbtxt
    /ensemble_ocr
        /1 (또는 다른 버전 번호)
            - config.pbtxt

Model Repository 구조

하지만, 나는 위와 같은 앙상블 방식이 config를 작성하면서 오히려 너무 복잡하게 느껴졌다.

Triton의 장점은 코드작성없이 모델 서빙이 가능하다는 것인데 오히려 config 파일 작성하다가 더 헷갈릴 것 같았다.

python backend

따라서, 나는 모든 모델들이 torchscript 이기 때문에 파이썬 백엔드에서 전체 프로세스를 올려보았다.

(만약, TensorRT 로 변환한다면 파이썬 백엔드에서 실행하기 위해선 별도로 코드작성이 필요하다...모델 중 TensorRT가 존재한다면 앙상블 기법을 활용하는 것이 좋을 듯하다....)

보안 상 코드를 올리는 것은 불가하다. 아래의 TritonPythonModel 클래스를 이용하여 init과 실행이 가능하다.

class TritonPythonModel:
    """Your Python model must use the same class name. Every Python model
    that is created must have "TritonPythonModel" as the class name.
    """

    def initialize(self, args):
        """`initialize` is called only once when the model is being loaded.
        Implementing `initialize` function is optional. This function allows
        the model to intialize any state associated with this model.
        Parameters
        ----------
        args : dict
          Both keys and values are strings. The dictionary keys and values are:
          * model_config: A JSON string containing the model configuration
          * model_instance_kind: A string containing model instance kind
          * model_instance_device_id: A string containing model instance device ID
          * model_repository: Model repository path
          * model_version: Model version
          * model_name: Model name
        """

        # You must parse model_config. JSON string is not parsed here
        self.model_config = model_config = json.loads(args['model_config'])

        # Get OUTPUT0 configuration
        output0_config = pb_utils.get_output_config_by_name(
            model_config, "OUTPUT__0")

        # Convert Triton types to numpy types
        self.output0_dtype = pb_utils.triton_string_to_numpy(
            output0_config['data_type'])

    def execute(self, requests):
        """`execute` MUST be implemented in every Python model. `execute`
        function receives a list of pb_utils.InferenceRequest as the only
        argument. This function is called when an inference request is made
        for this model. Depending on the batching configuration (e.g. Dynamic
        Batching) used, `requests` may contain multiple requests. Every
        Python model, must create one pb_utils.InferenceResponse for every
        pb_utils.InferenceRequest in `requests`. If there is an error, you can
        set the error argument when creating a pb_utils.InferenceResponse
        Parameters
        ----------
        requests : list
          A list of pb_utils.InferenceRequest
        Returns
        -------
        list
          A list of pb_utils.InferenceResponse. The length of this list must
          be the same as `requests`
        """
        output0_dtype = self.output0_dtype
        
        response = []
        
        result_list = []
        
        out_tensor_0 = pb_utils.Tensor("OUTPUT__0",
                                           result_list.astype(output0_dtype))

        inference_response = pb_utils.InferenceResponse(
            output_tensors=[out_tensor_0])

        # You should return a list of pb_utils.InferenceResponse. Length
        # of this list must match the length of `requests` list.
        responses.append(inference_response)
            
        return responses

    def finalize(self):
        """`finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is OPTIONAL. This function allows
        the model to perform any necessary clean ups before exit.
        """
        print('Cleaning up...')

config 파일은 아래와 같다.

name: "ocr_python_backend"
backend: "python"

max_batch_size: 10
input [{
    name: "INPUT__0"
    data_type: TYPE_UINT8
    dims: [ -1, -1, 3 ]
}]
 
output [{
    name: "OUTPUT__0"
    data_type: TYPE_STRING
    dims: [ 1 ]
}]

instance_group [
    {
      	count: 4
	kind: KIND_GPU
	gpus: [ 1 ]
    },
    {
        count: 1
        kind: KIND_CPU
    }
]

version_policy: { specific: { versions: [2]}}

INPUT은 이미지 Tensor로 받기 위해서 Triton에서 사용되는 TYPE_UINT8 데이터타입으로 지정했다. OUTPUT은 데이터프레임 이기 때문에 TYPE_STRING으로 지정하였다. Triton Data_type은 아래 깃헙 주소 참고.

https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#datatypes

결론

나는 이렇게 구현된 ensemble model 과 python backend model을 서로 비교하였다.

아직도 정확한 원인은 잘 모르겠으나 python backend model 이 훨씬 더 빠르고 성능이 좋았다.

아마 내가 세운 가설인 ensemble model이 점점 더 복잡해질수록 서로 다른 backend 끼리 주고 받는 데이터가 많아져서 그렇지 않을까 싶다. 더 자세히 파악해볼 것이다.