Minio와 Postgresql를 활용하여 Multi-Tenant MLflow 배포하기
글을 작성하기 전에, 저는 현재 회사에서 ML 플랫폼을 개발,운영하고 있으며 MLOps 플랫폼으로 고도화를 진행중에 있습니다.
MLflow를 제공하려고 하는데, 각 사용자의 Namespace마다 띄워주기에는 컴퓨팅 자원이 기하급수적으로 늘어날 것으로 예상됩니다.
따라서, MLflow를 단일 서버로 서로 다른 여러 사용자에게 서비스를 제공할 수 있는 Multi-Tenant MLflow로 구축하려합니다.
Multi-Tenant MLflow를 지원하는 MLflow Authentication 컴포넌트는 Document에서는 2.5.0 버전부터 제공하는 반면, github에서는 v2.3.0 부터 프로젝트에 추가된 것으로 확인됩니다. (아래 링크 참조)
https://github.com/mlflow/mlflow/tree/v2.3.0/mlflow/server
1. MLflow Artifact 저장소로 쓰일 MinIO를 띄웁니다.
2. MLflow Server가 Backend Store로 사용할 용도의 PostgreSQL DB를 mlflow namespace에 띄웁니다.
3. Multi-Tenant MLflow Server를 mlflow namespace에 띄웁니다.
4. MLflow가 Artifacts Store로 사용할 minio는 사용자가 접근할 수 없을 뿐더러 하나의 서버로 관리하기 위해서 따로 특정 네임스페이스에 하나의 minio 서버로 띄웁니다.
MinIO Deployment & Service manifests
apiVersion: apps/v1
kind: Deployment
metadata:
name: minio-deployment
namespace: minio
annotations:
# Istio sidecar injection 활성화
sidecar.istio.io/inject: "true"
spec:
selector:
matchLabels:
app: minio
strategy:
type: Recreate
template:
metadata:
labels:
app: minio
annotations:
# Istio sidecar injection 활성화
sidecar.istio.io/inject: "true"
spec:
volumes:
- name: storage
nfs:
path: /my/nas/path
server: my.nas.com
containers:
- name: minio
image: my/minio/minio:latest
args:
- server
- /storage
- --console-address
- :9001
env:
# Minio access key and secret key
- name: MINIO_ROOT_USER
value: "minio"
- name: MINIO_ROOT_PASSWORD
value: "minio123"
- name: MINIO_BROWSER_REDIRECT_URL
value: "http://localhost/mlops/minio/console"
ports:
- containerPort: 9000
hostPort: 9000
- containerPort: 9001
hostPort: 9001
volumeMounts:
- name: storage
mountPath: "/storage"
---
apiVersion: v1
kind: Service
metadata:
name: minio-service
namespace: minio
spec:
type: LoadBalancer
ports:
- name: server-port
port: 9000
targetPort: 9000
protocol: TCP
- name: console-port
port: 9001
targetPort: 9001
protocol: TCP
selector:
app: minio
PostgreSQL manifests
apiVersion: v1
data:
postgresql-password: cG9zdGdyZXM=
kind: Secret
metadata:
annotations:
meta.helm.sh/release-name: mlflow
meta.helm.sh/release-namespace: mlflow
labels:
app.kubernetes.io/instance: mlflow
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: postgresql
helm.sh/chart: postgresql-10.5.3
name: mlflow-postgresql
namespace: mlflow
type: Opaque
---
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/bound-by-controller: "yes"
finalizers:
- kubernetes.io/pv-protection
name: nfs-pv
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 8Gi
nfs:
path: /my/nas/path/mlflow-postgres
server: my.nas.com
persistentVolumeReclaimPolicy: Retain
volumeMode: Filesystem
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
annotations:
meta.helm.sh/release-name: mlflow
meta.helm.sh/release-namespace: mlflow
labels:
app.kubernetes.io/component: primary
app.kubernetes.io/instance: mlflow
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: postgresql
helm.sh/chart: postgresql-10.5.3
name: mlflow-postgresql
namespace: mlflow
spec:
podManagementPolicy: OrderedReady
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/instance: mlflow
app.kubernetes.io/name: postgresql
role: primary
serviceName: mlflow-postgresql-headless
template:
metadata:
labels:
app.kubernetes.io/component: primary
app.kubernetes.io/instance: mlflow
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: postgresql
helm.sh/chart: postgresql-10.5.3
role: primary
name: mlflow-postgresql
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/component: primary
app.kubernetes.io/instance: mlflow
app.kubernetes.io/name: postgresql
namespaces:
- mlflow
topologyKey: kubernetes.io/hostname
weight: 1
containers:
- env:
- name: BITNAMI_DEBUG
value: "false"
- name: POSTGRESQL_PORT_NUMBER
value: "5432"
- name: POSTGRESQL_VOLUME_DIR
value: /bitnami/postgresql
- name: PGDATA
value: /bitnami/postgresql/data
- name: POSTGRES_USER
value: postgres
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
key: postgresql-password
name: mlflow-postgresql
- name: POSTGRESQL_ENABLE_LDAP
value: "no"
- name: POSTGRESQL_ENABLE_TLS
value: "no"
- name: POSTGRESQL_LOG_HOSTNAME
value: "false"
- name: POSTGRESQL_LOG_CONNECTIONS
value: "false"
- name: POSTGRESQL_LOG_DISCONNECTIONS
value: "false"
- name: POSTGRESQL_PGAUDIT_LOG_CATALOG
value: "off"
- name: POSTGRESQL_CLIENT_MIN_MESSAGES
value: error
- name: POSTGRESQL_SHARED_PRELOAD_LIBRARIES
value: pgaudit
image: docker.io/bitnami/postgresql:11.12.0-debian-10-r44
imagePullPolicy: IfNotPresent
livenessProbe:
exec:
command:
- /bin/sh
- -c
- exec pg_isready -U "postgres" -h 127.0.0.1 -p 5432
name: mlflow-postgresql
ports:
- containerPort: 5432
name: tcp-postgresql
protocol: TCP
readinessProbe:
exec:
command:
- /bin/sh
- -c
- -e
- |
exec pg_isready -U "postgres" -h 127.0.0.1 -p 5432
[ -f /opt/bitnami/postgresql/tmp/.initialized ] || [ -f /bitnami/postgresql/.initialized ]
resources:
requests:
cpu: 250m
memory: 256Mi
securityContext:
runAsUser: 1001
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /bitnami
name: data
subPath: data
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 1001
terminationGracePeriodSeconds: 30
volumes:
- emptyDir:
medium: Memory
name: dshm
updateStrategy:
type: RollingUpdate
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
creationTimestamp: null
name: data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 8Gi
volumeMode: Filesystem
---
apiVersion: v1
kind: Service
metadata:
annotations:
meta.helm.sh/release-name: mlflow
meta.helm.sh/release-namespace: mlflow
labels:
app.kubernetes.io/instance: mlflow
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: postgresql
helm.sh/chart: postgresql-10.5.3
service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
name: mlflow-postgresql-headless
namespace: mlflow
spec:
clusterIP: None
ports:
- name: tcp-postgresql
port: 5432
protocol: TCP
targetPort: tcp-postgresql
publishNotReadyAddresses: true
selector:
app.kubernetes.io/instance: mlflow
app.kubernetes.io/name: postgresql
sessionAffinity: None
type: ClusterIP
백엔드 DB 서버 저장소는 Persistent Volume을 사용했습니다.
Multi-tenant MLflow manifests
apiVersion: v1
kind: ConfigMap
metadata:
name: mlflow-configmap
namespace: mlflow
data:
MLFLOW_BACKEND_STORE_URI: postgresql+psycopg2://postgres:postgres@mlflow-postgresql:5432/postgres
MLFLOW_S3_ENDPOINT_URL: http://minio-service.minio.svc.cluster.local:9000
AWS_ACCESS_KEY_ID: admin
AWS_SECRET_ACCESS_KEY: admin123
AWS_BUCKET: mlflow
PORT: "5000"
---
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
sidecar.istio.io/inject: "true"
name: mlflow
namespace: mlflow
spec:
replicas: 1
selector:
matchLabels:
app: mlflow
template:
metadata:
annotations:
sidecar.istio.io/inject: "true"
labels:
app: mlflow
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: my.node.zone
operator: In
values:
- my-node
containers:
- args:
- mlflow
- server
- --backend-store-uri
- $(MLFLOW_BACKEND_STORE_URI)
- --default-artifact-root
- S3://$(AWS_BUCKET)
- --host
- 0.0.0.0
- --port
- $(PORT)
- --app-name basic-auth
- --static-prefix
- /mlops/mlflow
envFrom:
- configMapRef:
name: mlflow-configmap
image: my/mlflow:v2.6.0
imagePullPolicy: Always
name: mlflow
ports:
- containerPort: 5000
protocol: TCP
resources:
limits:
cpu: "1"
memory: 1Gi
requests:
cpu: "1"
memory: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: mlflow-service
namespace: mlflow
spec:
ports:
- port: 5000
protocol: TCP
targetPort: 5000
selector:
app: mlflow
sessionAffinity: None
type: ClusterIP
---
MLflow Tracking Server에 외부에서 접속하기 위한 Istio Virtualservice manifest
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: mlflow-vs
namespace: mlflow
spec:
gateways:
- istio-system/istio-ingressgateway
hosts:
- '*'
http:
- match:
- uri:
prefix: /mlops/mlflow/
rewrite:
uri: /
route:
- destination:
host: mlflow-service
port:
number: 5000
Multi-tenant MLflow에 접근할 사용자의 계정 생성 API
import os, requests
from common import get_settings
settings = get_settings()
MLFLOW_URL = settings.get('MLFLOW_URL')
MLFLOW_PASSWORD = settings.get('MLFLOW_PASSWORD')
def create_mlflow_user(username, password=MLFLOW_PASSWORD):
create_mlflow_user_endpoint = "mlops/mlflow/api/2.0/mlflow/users/create"
create_mlflow_user_uri = os.path.join(MLFLOW_URL, create_mlflow_user_endpoint)
print('create_mlflow_user_uri: ', create_mlflow_user_uri)
response = requests.post(create_mlflow_user_uri,
json={"username": username,
"password": password
},
)
print("create_mlflow_user: ", response.status_code)
return response
Multi-tenant MLflow의 사용자마다 독립적인 MinIO bucket 생성 API
import os
import json
import requests
import base64
from minio import Minio
from minio.error import S3Error
from common import get_settings
settings = get_settings()
MINIO_DOMAIN = settings.get('MINIO_DOMAIN')
MINIO_PORT = settings.get('MINIO_PORT')
MINIO_ADMIN_ACCESS_KEY = settings.get('MINIO_ADMIN_ACCESS_KEY')
MINIO_SECRET_KEY = settings.get('MINIO_SECRET_KEY')
MINIO_URL = f"{MINIO_DOMAIN}:{MINIO_PORT}"
MINIO_CONSOLE_URL = settings.get('MINIO_CONSOLE_URL')
class Client:
def __init__(self, namespace_name, access_key, secret_key=MINIO_SECRET_KEY):
self.namespace_name = namespace_name
self.access_key = access_key
self.secret_key = secret_key
# MinIO 클라이언트 생성
self.client = Minio(
endpoint=MINIO_URL,
access_key=MINIO_ADMIN_ACCESS_KEY,
secret_key=MINIO_SECRET_KEY,
secure=False, # Set it to True if using HTTPS
)
def create_minio_bucket(self):
try:
self.client.make_bucket(self.namespace_name)
print("Bucket created successfully.")
result = True
except S3Error as err:
print(f"Error creating bucket: {err}")
result = False
return result
def create_bucket_keys(self):
# 해당 버킷에만 접근 가능한 액세스 키 , 시크릿 키 발급
create_minio_token_path = "mlops/console/api/v1/login"
create_minio_token_uri = os.path.join(MINIO_CONSOLE_URL, create_minio_token_path)
print('create_minio_token_uri: ', create_minio_token_uri)
payload = json.dumps(
{"accessKey": MINIO_ADMIN_ACCESS_KEY,
"secretKey": MINIO_SECRET_KEY})
response = requests.request("POST",
create_minio_token_uri,
data=payload,
headers={
'Content-Type': 'application/json'
}
)
print('create_minio_token_result: ', response.status_code)
token = response.headers.get('set-cookie', '')
print('token: ', token)
create_minio_key_path = "mlops/console/api/v1/service-account-credentials"
create_minio_key_uri = os.path.join(MINIO_CONSOLE_URL, create_minio_key_path)
print('create_minio_key_uri: ', create_minio_key_uri)
payload = {
"policy": '{'
'{\n'
' "Version": "2012-10-17",\n'
' "Statement": [\n'
' {{\n'
' "Effect": "Allow",\n'
' "Action": [\n'
' "s3:GetBucketPolicy",\n'
' "s3:GetObject",\n'
' "s3:GetObjectTagging",\n'
' "s3:GetObjectVersion",\n'
' "s3:ListBucket",\n'
' "s3:PutObject",\n'
' "s3:DeleteObject",\n'
' "s3:GetBucketLocation"\n'
' ],\n'
' "Resource": [\n'
' "arn:aws:s3:::{0}/*"\n'
' ]\n'
' }}\n'
' ]\n'
'}'
'}'.format(self.namespace_name),
"accessKey": self.access_key,
"secretKey": self.secret_key
}
headers = {
'Cookie': token,
'Content-Type': 'application/json'
}
response = requests.request("POST", create_minio_key_uri, headers=headers, data=json.dumps(payload))
print('create_minio_key_result: ', response.status_code, response.text)
return response
결론
위에서 언급한 여러 쿠버네티스 object(리소스)들을 정의하고 다루는데 활용되는 yaml파일을 관리하는 것은 반복적이고 지루한 작업이다. 쿠버네티스를 통해 애플리케이션을 배포하다보면 config값만 일부 변경하면 되는데, 애플리케이션마다 모두 yaml파일을 만들어줘야 하다보니 매우 번거로울 수 있다.
따라서, helm 차트로 재작성하여 관리할 필요성이 있어보인다.