Community

๐ŸŒˆ Kurly๋งŒ์˜ MLOps ๊ตฌ์ถ•ํ•˜๊ธฐ - ์ดˆ์„ ๋‹ค์ง€๊ธฐ

MLOps๋Š” ์ตœ๊ทผ ๋งŽ์€ ํšŒ์‚ฌ๋“ค์ด ๊ด€์‹ฌ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ถ„์•ผ์ž…๋‹ˆ๋‹ค. ๊ฐœ๋…์„ ๋‹ค๋ฃจ๋Š” ๊ธ€์€ ๋งŽ์ด ์žˆ์œผ๋‚˜, ์ด์ œ ์‹ค๋ฌด๋ฅผ ์–ด๋–ป๊ฒŒ ํ•˜๋Š”์ง€ ๊ถ๊ธˆํ•ด์ง‘๋‹ˆ๋‹ค. ์ด๋ฒˆ์— ๊ณต์œ ๋“œ๋ฆฐ ๊ธ€์€ ์ปฌ๋ฆฌ์—์„œ MLOps ๊ตฌ์ถ•ํ•˜๋Š” ๊ณผ์ •์„ ๋‹ด์€ ๊ธ€์ž…๋‹ˆ๋‹ค. MLOps ์ค‘ ์ฒ˜์Œ์œผ๋กœ GPU ์‚ฌ์šฉ ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค โœจ๏ธ ์ถ”์ฒœ๋“œ๋ฆฌ๊ณ  ์‹ถ์€ ๋ถ„ - ํšŒ์‚ฌ์˜ MLOps ์ผ€์ด์Šค๊ฐ€ ๊ถ๊ธˆํ•˜์‹  ๋ถ„ - ์ฟ ๋ฒ„๋„คํ‹ฐ์Šค ๊ธฐ๋ฐ˜์—์„œ GPU๋ฅผ ์‚ฌ์šฉํ•˜๋Š” Use Case๊ฐ€ ๊ถ๊ธˆํ•˜์‹  ๋ถ„ ๐ŸŽ ์š”์•ฝ Karpenter - AWS์—์„œ ๋งŒ๋“ค๊ณ  ์šด์˜์ค‘์ธ ์˜คํ”ˆ์†Œ์Šค - ์ฟ ๋ฒ„๋„คํ‹ฐ์Šค์˜ ์›Œ์ปค ๋…ธ๋“œ ์˜คํ†  ์Šค์ผ€์ผ๋ง์„ ๋‹ด๋‹นํ•˜๋Š” ์˜คํ”ˆ์†Œ์Šค - ๋ถˆํ•„์š”ํ•œ Worker Node๋ฅผ ์ •๋ฆฌํ•˜๋Š” ๊ธฐ๋Šฅ๋„ ์กด์žฌํ•จ - AWS EKS๋Š” ASG(Auto Scaling Group)์™€ Launch Template์„ ์‚ฌ์šฉํ•ด worker node๋ฅผ ๊ทธ๋ฃนํ•‘ํ•ด Node Group์œผ๋กœ ๊ด€๋ฆฌํ•˜๋Š” ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•จ ๊ฐ€์ƒ ์ผ€์ด์Šค(AWS EKS์˜ ๋™์ž‘ ๋ฐฉ์‹) - ์ฟ ๋ฒ„๋„คํ‹ฐ์Šค์—์„œ ํ˜„์žฌ ๋…ธ๋“œ ๋‚ด์— Pod์ด ๊ฐ€๋“์ฐผ๊ณ , ๋” ๋ฐฐํฌํ•  ์ˆ˜ ์—†๋Š” ์ƒํ™ฉ - ์ƒˆ๋กœ์šด pod ์ƒ์„ฑ ์š”์ฒญ์ด ๋“ค์–ด์˜ด - kube-scheduler๋Š” ์‹ ๊ทœ pod์„ ๋ฐฐ์น˜ํ•  ๋…ธ๋“œ๋ฅผ ์„ ์ •ํ•จ => ๋…ธ๋“œ ์„ ์ • ์ „๊นŒ์ง„ Pod์ด Pending ์ƒํƒœ - ๋…ธ๋“œ ์„ ์ •์— ์‹คํŒจํ•˜๋ฉด Node Group์˜ ASG ๊ฐ’ ์ค‘ Desired Capacity๋ฅผ ํ•˜๋‚˜ ๋Š˜๋ฆฌ๊ณ  ์›Œ์ปค ๋…ธ๋“œ ๊ฐœ์ˆ˜๋ฅผ ์ฆ๊ฐ€์‹œํ‚ด - AWS๋Š” ASG์— ์˜ํ•ด desired capacity ๊ฐ’์„ ์ฝ์–ด EC2 ์›Œ์ปค ๋…ธ๋“œ๋ฅผ ๋ฐฐํฌํ•จ Karpenter์˜ ๋™์ž‘ ๋ฐฉ์‹ - Cloud Provider์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ๋™์ž‘ - Karpenter๋Š” ์‹ ๊ทœ Pod์˜ ์ƒํƒœ๋ฅผ ํ™•์ธํ•˜๊ณ  ํ•„์š”ํ•˜๋ฉด ์›Œ์ปค๋…ธ๋“œ์˜ ๋ฐฐํฌ์™€ ์‚ญ์ œ๋ฅผ ์ง์ ‘ ์ˆ˜ํ–‰ํ•จ. kube-scheduler๋ฅผ ๋Œ€์‹ ํ•ด pod์„ ํŠน์ • ์›Œ์ปค๋…ธ๋“œ๋กœ ๋ฐ”์ธ๋”ฉํ•˜๋Š” ์š”์ฒญ๋„ ์ˆ˜ํ–‰ํ•จ - ์œ„์— ๊ฐ€์ƒ ์ผ€์ด์Šค๋ฅผ ๋™์ผํ•˜๊ฒŒ ์ ์šฉํ•˜๋ฉด, ์–ด๋–ค ํƒ€์ž…์— ๋ฐฐํฌํ• ์ง€๋Š” Karpenter์˜ Custom Resource์ธ Provisioner์— ์˜ํ•ด ๊ฒฐ์ •๋จ - Provisioner์— ์˜ํ•ด ์ƒˆ ๋…ธ๋“œ๊ฐ€ ๋ฐฐํฌ๋œ ํ›„ Read ์ƒํƒœ๊ฐ€ ๋˜๋ฉด ์ง์ ‘ Pod์„ ์ƒˆ๋กœ์šด ์›Œ์ปค ๋…ธ๋“œ์— ๋ฐฐํฌ๋  ์ˆ˜ ์žˆ๋„๋ก ๋ฐ”์ธ๋”ฉ ์š”์ฒญํ•จ AWS EKS์™€ Karpenter์˜ ์ฐจ์ด์  - Karpenter๋Š” Auto Scaling ํ•  ๋•Œ ์ง์ ‘ ์ฒ˜๋ฆฌํ•˜๊ณ , CA๋Š” Cloud Provider๊ฐ€ ์ œ๊ณตํ•˜๋Š” ๊ธฐ๋Šฅ์„ ์—ฐ๊ณ„ํ•ด ๋” ์˜ค๋ž˜ ๊ฑธ๋ฆผ Karpenter์˜ ์ฃผ์š” ๊ฐœ๋… - Watching : unschedulableํ•œ Pod์„ ๊ณ„์† ํ™•์ธํ•จ - Evaluating : ์Šค์ผ€์ค„๋ง์— ์ œ์•ฝ์ด ์—†๋Š”์ง€ ํ™•์ธ - Provisioning : ์š”๊ตฌ์‚ฌํ•ญ์— ๋งž๋Š” ๋…ธ๋“œ์— Pod์„ ๋ฐฐํฌ - Removing : ๋” ์ด์ƒ ๋…ธ๋“œ๊ฐ€ ์—†์œผ๋ฉด ์‚ญ์ œ Provisioner - Custom Resource์— ๋Œ€ํ•ด ์ž‘์„ฑ๋œ ๋‚ด์šฉ - ์ œ์•ฝ์‚ฌํ•ญ์ด๋‚˜ ๋…ธ๋“œ๊ฐ€ ํ•„์š”์—†๋‹ค๊ณ  ํŒ๋‹จํ•  ์„ค์ •, timeout ๋“ฑ์„ ์„ค์ •ํ•จ Karpenter GPU ์„ธํŒ… ํŒ - NVDP : Nvidia device plugin for Kubernetes๋ฅผ ์„ค์น˜ํ•ด์•ผ ํ•จ (๋ฐ๋ชฌ ์…‹์— ์„ค์น˜) - Deprovisioning - 1. Affinity -> Label์„ ํ™œ์šฉ - Deprovisioning - 2. Consolidation -> KArpenter์˜ CRD ๋ฒ„์ „ ์—…๊ทธ๋ ˆ์ด๋“œ ํ•„์š”

์•Œ๋ฆผ

์•Œ๋ฆผ์ด ์—†์Šต๋‹ˆ๋‹ค