MLOPS/kubernetes

Weave net 컨테이너 크기가 점점 커져서 노드의 디스크공간에 영향을 미칠때

개발허재 2024. 5. 23. 19:42
Events:
  Type     Reason                Age                     From     Message
  ----     ------                ----                    ----     -------
  Warning  ImageGCFailed         49m (x1403 over 40d)    kubelet  wanted to free 41729564672 bytes, but freed 0 bytes space with errors in image deletion: [rpc error: code = Unknown desc = Error response from daemon: conflict: unable to remove repository reference "10.70.189.51:80/wide-jupyter/20.04:0.7" (must force) - container 0cb4fe3d9780 is using its referenced image dfe662166510, rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete ed37d8185701 (must be forced) - image is being used by stopped container b311e4cd2365]
  Warning  ImageGCFailed         34m (x834 over 40d)     kubelet  wanted to free 41729564672 bytes, but freed 0 bytes space with errors in image deletion: [rpc error: code = Unknown desc = Error response from daemon: conflict: unable to remove repository reference "10.70.189.51:80/wide-jupyter/20.04:0.7" (must force) - container 0cb4fe3d9780 is using its referenced image dfe662166510, rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete ed37d8185701 (must be forced) - image is being used by stopped container 1adc4862248f]
…

 

위와 같이 노드에 디스크공간이 부족해서 문제가 발생했다.

 

우선 cordon으로 노드 disabled 처리

 

root@DS-DEV-003:/home/pomnwq24# df -h
Filesystem                    Size  Used Avail Use% Mounted on
udev                           16G     0   16G   0% /dev
tmpfs                         3.2G  299M  2.9G  10% /run
/dev/mapper/ubuntu--vg-lv--0  195G  195G     0 100% /

처럼 디스크 공간 꽉찬 것을 확인할 수 있음

 

du -sh /* 확인했더니 /var 디렉토리만 182G 차지. /var 뎁스로 들어갔더니 /var/lib, /var/lib/docker, /var/lib/docker/containers, /var/lib/docker/containers/ffb2f802bf*** 컨테이너가 무려 176G 차지

 

docker ps 해당 컨테이너ID 조회했더니, weave net 컨테이너 였음...

 

root@DS-DEV-003:/home/pomnwq24# docker ps |grep ffb2f802
ffb2f802bfec   df29c0a4002c           "/home/weave/launch.…"   5 months ago   Up 5 months             k8s_weave_weave-net-n59cj_kube-system_160e39e1-c7a9-4fb4-93cb-079c1cf6863f_0

 

 

해결 방법 (cordon으로 노드 disabled 중인 상태에서)

우선 노드를 빨리 복귀 시켜야하기 때문에!!! 아래와 같이 두 방법 중 하나를 선택한다.

1. weave net 컨테이너로 접속하여 로그를 쌓는 파일을 삭제

2. 컨테이너를 삭제하고, pod 컨테이너를 재생성하게

 

그 다음 systemctl restart kubelet

 

다시 디스크공간 조회시 여유공간 확보된것을 확인할 있음

root@DS-DEV-003:/home/pomnwq24# df -h
Filesystem                    Size  Used Avail Use% Mounted on
udev                           16G     0   16G   0% /dev
tmpfs                         3.2G  2.0M  3.2G   1% /run
/dev/mapper/ubuntu--vg-lv--0  195G   17G  169G   9% /

 

사후처리

weavenet 로그를 쌓는 수준이 전체로 설정되어있기 때문이다.

따라서 weave net daemon set spec.template.spec.containers.env WEAVE_DEBUG false 지정해줄 필요가 있다.