MLOPS

Minio와 Postgresql를 활용하여 Multi-Tenant MLflow 배포하기

개발허재 2023. 5. 13. 17:01

글을 작성하기 전에, 저는 현재 회사에서 ML 플랫폼을 개발,운영하고 있으며 MLOps 플랫폼으로 고도화를 진행중에 있습니다.

MLflow를 제공하려고 하는데, 각 사용자의 Namespace마다 띄워주기에는 컴퓨팅 자원이 기하급수적으로 늘어날 것으로 예상됩니다.

따라서, MLflow를 단일 서버로 서로 다른 여러 사용자에게 서비스를 제공할 수 있는 Multi-Tenant MLflow로 구축하려합니다.

Multi-Tenant MLflow를 지원하는 MLflow Authentication 컴포넌트는 Document에서는 2.5.0 버전부터 제공하는 반면, github에서는 v2.3.0 부터 프로젝트에 추가된 것으로 확인됩니다. (아래 링크 참조)

https://github.com/mlflow/mlflow/tree/v2.3.0/mlflow/server

 

1. MLflow Artifact 저장소로 쓰일 MinIO를 띄웁니다.

2. MLflow Server가 Backend Store로 사용할 용도의 PostgreSQL DB를 mlflow namespace에 띄웁니다.

3. Multi-Tenant MLflow Server를 mlflow namespace에 띄웁니다.

4. MLflow가 Artifacts Store로 사용할 minio는 사용자가 접근할 수 없을 뿐더러 하나의 서버로 관리하기 위해서 따로 특정 네임스페이스에 하나의 minio 서버로 띄웁니다.

 

MinIO Deployment & Service manifests

apiVersion: apps/v1
kind: Deployment
metadata:
  name: minio-deployment
  namespace: minio
  annotations:
    # Istio sidecar injection 활성화
    sidecar.istio.io/inject: "true"
spec:
  selector:
    matchLabels:
      app: minio
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: minio
      annotations:
        # Istio sidecar injection 활성화
        sidecar.istio.io/inject: "true"
    spec:
      volumes:
      - name: storage
        nfs:
          path: /my/nas/path
          server: my.nas.com
      containers:
      - name: minio
        image: my/minio/minio:latest
        args:
        - server
        - /storage
        - --console-address
        - :9001
        env:
        # Minio access key and secret key
        - name: MINIO_ROOT_USER
          value: "minio"
        - name: MINIO_ROOT_PASSWORD
          value: "minio123"
        - name: MINIO_BROWSER_REDIRECT_URL
          value: "http://localhost/mlops/minio/console"
        ports:
        - containerPort: 9000
          hostPort: 9000
        - containerPort: 9001
          hostPort: 9001
        volumeMounts:
        - name: storage
          mountPath: "/storage"
---
apiVersion: v1
kind: Service
metadata:
  name: minio-service
  namespace: minio
spec:
  type: LoadBalancer
  ports:
    - name: server-port
      port: 9000
      targetPort: 9000
      protocol: TCP
    - name: console-port
      port: 9001
      targetPort: 9001
      protocol: TCP
  selector:
    app: minio

 

PostgreSQL manifests

apiVersion: v1
data:
  postgresql-password: cG9zdGdyZXM=
kind: Secret
metadata:
  annotations:
    meta.helm.sh/release-name: mlflow
    meta.helm.sh/release-namespace: mlflow
  labels:
    app.kubernetes.io/instance: mlflow
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: postgresql
    helm.sh/chart: postgresql-10.5.3
  name: mlflow-postgresql
  namespace: mlflow
type: Opaque

---
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/bound-by-controller: "yes"
  finalizers:
  - kubernetes.io/pv-protection
  name: nfs-pv
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 8Gi
  nfs:
    path: /my/nas/path/mlflow-postgres
    server: my.nas.com
  persistentVolumeReclaimPolicy: Retain
  volumeMode: Filesystem

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    meta.helm.sh/release-name: mlflow
    meta.helm.sh/release-namespace: mlflow
  labels:
    app.kubernetes.io/component: primary
    app.kubernetes.io/instance: mlflow
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: postgresql
    helm.sh/chart: postgresql-10.5.3
  name: mlflow-postgresql
  namespace: mlflow
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: mlflow
      app.kubernetes.io/name: postgresql
      role: primary
  serviceName: mlflow-postgresql-headless
  template:
    metadata:
      labels:
        app.kubernetes.io/component: primary
        app.kubernetes.io/instance: mlflow
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: postgresql
        helm.sh/chart: postgresql-10.5.3
        role: primary
      name: mlflow-postgresql
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/component: primary
                  app.kubernetes.io/instance: mlflow
                  app.kubernetes.io/name: postgresql
              namespaces:
              - mlflow
              topologyKey: kubernetes.io/hostname
            weight: 1
      containers:
      - env:
        - name: BITNAMI_DEBUG
          value: "false"
        - name: POSTGRESQL_PORT_NUMBER
          value: "5432"
        - name: POSTGRESQL_VOLUME_DIR
          value: /bitnami/postgresql
        - name: PGDATA
          value: /bitnami/postgresql/data
        - name: POSTGRES_USER
          value: postgres
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              key: postgresql-password
              name: mlflow-postgresql
        - name: POSTGRESQL_ENABLE_LDAP
          value: "no"
        - name: POSTGRESQL_ENABLE_TLS
          value: "no"
        - name: POSTGRESQL_LOG_HOSTNAME
          value: "false"
        - name: POSTGRESQL_LOG_CONNECTIONS
          value: "false"
        - name: POSTGRESQL_LOG_DISCONNECTIONS
          value: "false"
        - name: POSTGRESQL_PGAUDIT_LOG_CATALOG
          value: "off"
        - name: POSTGRESQL_CLIENT_MIN_MESSAGES
          value: error
        - name: POSTGRESQL_SHARED_PRELOAD_LIBRARIES
          value: pgaudit
        image: docker.io/bitnami/postgresql:11.12.0-debian-10-r44
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - exec pg_isready -U "postgres" -h 127.0.0.1 -p 5432
        name: mlflow-postgresql
        ports:
        - containerPort: 5432
          name: tcp-postgresql
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - -e
            - |
              exec pg_isready -U "postgres" -h 127.0.0.1 -p 5432
              [ -f /opt/bitnami/postgresql/tmp/.initialized ] || [ -f /bitnami/postgresql/.initialized ]
        resources:
          requests:
            cpu: 250m
            memory: 256Mi
        securityContext:
          runAsUser: 1001
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
        - mountPath: /bitnami
          name: data
          subPath: data
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1001
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir:
          medium: Memory
        name: dshm
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 8Gi
      volumeMode: Filesystem
      
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    meta.helm.sh/release-name: mlflow
    meta.helm.sh/release-namespace: mlflow
  labels:
    app.kubernetes.io/instance: mlflow
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: postgresql
    helm.sh/chart: postgresql-10.5.3
    service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
  name: mlflow-postgresql-headless
  namespace: mlflow
spec:
  clusterIP: None
  ports:
  - name: tcp-postgresql
    port: 5432
    protocol: TCP
    targetPort: tcp-postgresql
  publishNotReadyAddresses: true
  selector:
    app.kubernetes.io/instance: mlflow
    app.kubernetes.io/name: postgresql
  sessionAffinity: None
  type: ClusterIP

 백엔드 DB 서버 저장소는 Persistent Volume을 사용했습니다. 

 

Multi-tenant MLflow manifests

apiVersion: v1
kind: ConfigMap
metadata:
  name: mlflow-configmap
  namespace: mlflow
data:
  MLFLOW_BACKEND_STORE_URI: postgresql+psycopg2://postgres:postgres@mlflow-postgresql:5432/postgres
  MLFLOW_S3_ENDPOINT_URL: http://minio-service.minio.svc.cluster.local:9000
  AWS_ACCESS_KEY_ID: admin
  AWS_SECRET_ACCESS_KEY: admin123
  AWS_BUCKET: mlflow
  PORT: "5000"

---
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
    sidecar.istio.io/inject: "true"
  name: mlflow
  namespace: mlflow
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mlflow
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "true"
      labels:
        app: mlflow
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: my.node.zone
                operator: In
                values:
                - my-node
      containers:
      - args:
        - mlflow
        - server
        - --backend-store-uri
        - $(MLFLOW_BACKEND_STORE_URI)
        - --default-artifact-root
        - S3://$(AWS_BUCKET)
        - --host
        - 0.0.0.0
        - --port
        - $(PORT)
        - --app-name basic-auth
        - --static-prefix
        - /mlops/mlflow
        envFrom:
        - configMapRef:
            name: mlflow-configmap
        image: my/mlflow:v2.6.0
        imagePullPolicy: Always
        name: mlflow
        ports:
        - containerPort: 5000
          protocol: TCP
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
          requests:
            cpu: "1"
            memory: 1Gi
            
---
apiVersion: v1
kind: Service
metadata:
  name: mlflow-service
  namespace: mlflow
spec:
  ports:
  - port: 5000
    protocol: TCP
    targetPort: 5000
  selector:
    app: mlflow
  sessionAffinity: None
  type: ClusterIP
---


MLflow Tracking Server에 외부에서 접속하기 위한 Istio Virtualservice manifest

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: mlflow-vs
  namespace: mlflow
spec:
  gateways:
  - istio-system/istio-ingressgateway
  hosts:
  - '*'
  http:
  - match:
    - uri:
        prefix: /mlops/mlflow/
    rewrite:
      uri: /
    route:
    - destination:
        host: mlflow-service
        port:
          number: 5000

 

 

Multi-tenant MLflow에 접근할 사용자의 계정 생성 API

import os, requests
from common import get_settings

settings = get_settings()

MLFLOW_URL = settings.get('MLFLOW_URL')
MLFLOW_PASSWORD = settings.get('MLFLOW_PASSWORD')

def create_mlflow_user(username, password=MLFLOW_PASSWORD):
    create_mlflow_user_endpoint = "mlops/mlflow/api/2.0/mlflow/users/create"
    create_mlflow_user_uri = os.path.join(MLFLOW_URL, create_mlflow_user_endpoint)
    print('create_mlflow_user_uri: ', create_mlflow_user_uri)
    response = requests.post(create_mlflow_user_uri,
                             json={"username": username,
                                   "password": password
                                   },
    )
    print("create_mlflow_user: ", response.status_code)
    return response

 

Multi-tenant MLflow의 사용자마다 독립적인 MinIO bucket 생성 API

import os
import json
import requests
import base64

from minio import Minio
from minio.error import S3Error

from common import get_settings

settings = get_settings()

MINIO_DOMAIN = settings.get('MINIO_DOMAIN')
MINIO_PORT = settings.get('MINIO_PORT')
MINIO_ADMIN_ACCESS_KEY = settings.get('MINIO_ADMIN_ACCESS_KEY')
MINIO_SECRET_KEY = settings.get('MINIO_SECRET_KEY')
MINIO_URL = f"{MINIO_DOMAIN}:{MINIO_PORT}"
MINIO_CONSOLE_URL = settings.get('MINIO_CONSOLE_URL')


class Client:
    def __init__(self, namespace_name, access_key, secret_key=MINIO_SECRET_KEY):
        self.namespace_name = namespace_name
        self.access_key = access_key
        self.secret_key = secret_key

        # MinIO 클라이언트 생성
        self.client = Minio(
            endpoint=MINIO_URL,
            access_key=MINIO_ADMIN_ACCESS_KEY,
            secret_key=MINIO_SECRET_KEY,
            secure=False,  # Set it to True if using HTTPS
        )

    def create_minio_bucket(self):

        try:
            self.client.make_bucket(self.namespace_name)
            print("Bucket created successfully.")
            result = True
        except S3Error as err:
            print(f"Error creating bucket: {err}")
            result = False

        return result

    def create_bucket_keys(self):
        # 해당 버킷에만 접근 가능한 액세스 키 , 시크릿 키 발급
        create_minio_token_path = "mlops/console/api/v1/login"
        create_minio_token_uri = os.path.join(MINIO_CONSOLE_URL, create_minio_token_path)
        print('create_minio_token_uri: ', create_minio_token_uri)

        payload = json.dumps(
            {"accessKey": MINIO_ADMIN_ACCESS_KEY,
             "secretKey": MINIO_SECRET_KEY})

        response = requests.request("POST",
                                    create_minio_token_uri,
                                    data=payload,
                                    headers={
                                        'Content-Type': 'application/json'
                                    }
                                    )
        print('create_minio_token_result: ', response.status_code)

        token = response.headers.get('set-cookie', '')
        print('token: ', token)
        create_minio_key_path = "mlops/console/api/v1/service-account-credentials"
        create_minio_key_uri = os.path.join(MINIO_CONSOLE_URL, create_minio_key_path)
        print('create_minio_key_uri: ', create_minio_key_uri)

        payload = {
            "policy": '{'
                      '{\n'
                      ' "Version": "2012-10-17",\n'
                      ' "Statement": [\n'
                      '  {{\n'
                      '   "Effect": "Allow",\n'
                      '   "Action": [\n'
                      '    "s3:GetBucketPolicy",\n'
                      '    "s3:GetObject",\n'
                      '    "s3:GetObjectTagging",\n'
                      '    "s3:GetObjectVersion",\n'
                      '    "s3:ListBucket",\n'
                      '    "s3:PutObject",\n'
                      '    "s3:DeleteObject",\n'
                      '    "s3:GetBucketLocation"\n'
                      '   ],\n'
                      '   "Resource": [\n'
                      '    "arn:aws:s3:::{0}/*"\n'
                      '   ]\n'
                      '  }}\n'
                      ' ]\n'
                      '}'
                      '}'.format(self.namespace_name),
            "accessKey": self.access_key,
            "secretKey": self.secret_key
        }

        headers = {
            'Cookie': token,
            'Content-Type': 'application/json'
        }

        response = requests.request("POST", create_minio_key_uri, headers=headers, data=json.dumps(payload))
        print('create_minio_key_result: ', response.status_code, response.text)

        return response

 

 

결론

위에서 언급한 여러 쿠버네티스 object(리소스)들을 정의하고 다루는데 활용되는 yaml파일을 관리하는 것은 반복적이고 지루한 작업이다. 쿠버네티스를 통해 애플리케이션을 배포하다보면 config값만 일부 변경하면 되는데, 애플리케이션마다 모두 yaml파일을 만들어줘야 하다보니 매우 번거로울 수 있다.

따라서, helm 차트로 재작성하여 관리할 필요성이 있어보인다.