Rocm kubernetes. Dec 27, 2023 · It looks like the main.

Rocm kubernetes 0+ Access to an internal container registry Required Images # The following images must be mirrored to your internal registry: NVIDIA’s CUDA and AMD’s ROCm provide frameworks to take advantage of the respective GPU platforms. 04 or later Kubernetes cluster v1. Assume there is already a kube-prometheus-stack deployment on your kubernetes cluster. I dont know how that stuff works, or why it's not receiving a signal for your gpu plugin. Jan 22, 2025 · AMD GPUs combined with OpenShift AI accelerate model training and tuning by leveraging parallel computing and Kubernetes-based resource management. It also offers support for various programming models such as HIP (a AMD ROCm Community is an excellent place access real-time discussion support, knowledge bases, and the latest news through blogs on the ROCm open software platform. Mar 4, 2025 · vLLM: A versatile framework for serving large language models, particularly well-suited for Kubernetes environments and high-throughput inference scenarios. Jun 12, 2025 · ROCm Device Plugin for Kubernetes for GPU discovery, health checks, configuration of GPU-enabled containers, and time slicing. Visit their profile and explore images they maintain. Together, they offer a scalable, cost-effective solution for industries ranging from healthcare to autonomous vehicles. Actual Behavior: The pod crashes immediately after starting, with the log indicating a segmentation fault. Contribute to ROCm/rccl-tests development by creating an account on GitHub. On Kubernetes, RDMA is enabled via device plugins (e. If you already have a Kubernetes cluster, skip to the GPU Operator installation or follow the official helm installation guide. Deploying Llama-3. Additionally, NVIDIA currently holds a dominant market share Optimized for AI Workloads AMD ROCm™ is an open software stack offering a suite of optimizations for AI workloads and supporting the broader AI software ecosystem including open frameworks, models, and tools. 0 $ apt show rocm-libs -a Package: rocm-libs Version: 6. Jun 26, 2025 · This document provides a detailed technical overview of the main `k8s-device-plugin` component, which serves as the core Kubernetes device plugin for AMD GPU resource registration and allocation. RDC offers a suite of features to enhance your GPU management and monitoring. 2+ dkms driver on Kubernetes worker nodes, they can also optionally choose to use inbox or pre-installed driver on the worker nodes The document discusses innovations in machine learning and distributed deep learning, highlighting various technologies like AMD ROCm, Docker, and Kubernetes. Although if you're using kserve to manage your Kubernetes inference service then you'll probably end up using vLLM because of network effects and lock-in. Learn more about OPEA here. Feb 23, 2024 · In this blog post we will show you, step-by step, how to set-up and fine-tune a Stable Diffusion XL (SDXL) model in a multinode Oracle Cloud Infrastructure’s (OCI) Kubernetes Engine (OKE) on AMD GPUs using ROCm. 27. Mar 12, 2025 · Developed by Google and now maintained by the Cloud Native Computing Foundation, Kubernetes enables developers to build, run, and manage applications across any infrastructure. 6 LTS Release: 20. Apr 11, 2025 · Explore ROCm 6. AMD is actively working with the vLLM team to GitHub is where people build software. Running rocminfo and amd-smi list on bare metal will enumerate all ROCm-capable GPUs on the machine. Serving using vLLM # vLLM is a fast and easy-to-use library for LLM inference and serving. With robust drivers, flexible scheduling, and comprehensive management, ROCm enables scalable, high-performance GPU solutions for HPC and AI—across bare metal, virtualized, and containerized deployments. When i run Ollama on Arch directly with ROCm support Feb 23, 2024 · In this blog post we will show you, step-by step, how to set-up and fine-tune a Stable Diffusion XL (SDXL) model in a multinode Oracle Cloud Infrastructure’s (OCI) Kubernetes Engine (OKE) on AMD GPUs using ROCm. The driver exposes device classes and implements allocation and lifecycle behavior for GPU resources on nodes. 1 8B using vLLM # vLLM is an open-source library designed to deliver high throughput and low latency for large language model (LLM) inference. Dockerfile. 60000-91~20. Blog aims to provide a comprehensive guide for deploying and scaling AI inference workloads on serverless infrastructre. Jan 16, 2025 · This post, the second in a series, provides a walkthrough for building a vLLM container that can be used for both inference and benchmarking. Kubernetes Installation # For this tutorial, we’ll use MicroK8s, a lightweight Kubernetes distribution ideal for development and testing. You’ll want at least 2–4 high-memory GPUs (e. Still, you need significant resources to host it: GPU Requirements: While NVIDIA GPUs remain the most common choice, DeepSeek-V3 also supports AMD GPUs with the ROCm ecosystem. It simplifies the deployment and management of Kubernetes clusters while maintaining compatibility with standard Kubernetes tools and APIs. Built on ROCm™, integrated with Kubernetes, and aligned with enterprise workflows, AIM provides flexible, high-performance building blocks for serving LLMs, vision models, speech models, and agentic AI pipelines. Nov 29, 2024 · Here show you the steps to deploy vLLM inference service with K8S over AMD ROCm GPU. This section focuses on deploying transformers-based LLM models. We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. How to choose Resource Naming Strategy # To customize the way device plugin reports gpu resources to kubernetes as allocatable k8s resources, use the single or mixed resource naming strategy in DeviceConfig CR Before understanding each strategy, please note the definition of homogeneous and heterogeneous nodes Kubernetes (Helm) # This guide walks through installing the AMD GPU Operator on a Kubernetes cluster using Helm. Kubernetes (k8s) device plugin to enable registration of AMD GPU to a container cluster Jul 3, 2025 · For many, this is where Kubernetes comes into play. go from this ROCm plugin Here entered some k8s plugin manager code. Apr 14, 2025 · The scale and complexity of modern AI workloads continue to grow—but so do the expectations around performance and ease of deployment. 0 or later kubectl command-line tool configured with access to the cluster Cluster admin privileges Cluster Requirements # A functioning Kubernetes cluster with: All system pods running and ready 6 days ago · Getting Started Installation Installing is as simple as pip install deepspeed, see more details. k8s-device-plugin Kubernetes (k8s) device plugin to enable registration of AMD GPU to a container cluster (by ROCm) Kubernetes kubernetes-device-plugins K8s Rocm Source Code Suggest alternative Edit details Managed Driver Installations: Users will be able to install ROCm 6. See Training a model with Primus and PyTorch for details. We’ve deployed vLLM on AMD Instinct MI300X GPUs, exposed it using MetalLB, and scaled it efficiently in . High-Performance Successful validation of k0rdent with the cutting-edge AMD Instinct MI300X GPUs ensures reliable performance for demanding AI and HPC workloads. We started with the fundamentals of ROCm containers, learning how to build custom development environments and ML training setups. System requirements # ROCm 6. 3 and AMD ROCm 6. Following We will set up a Kubernetes pod to monitor the GPU usage and check the AMD-SMI version using a Ubuntu ROCm image. This documentation is particularly crafted for system administrators, developers, and DevOps professionals who manage and operate their Akash Providers. The system also provides OpenCL compatibility with an Installable Client Driver (ICD) loader to get parallel processing ROCm programs installed on different platforms. 0+ Access to an internal container registry Required Images # The following images must be mirrored to your internal registry: Jan 31, 2025 · The above command runs the SGlang inference server container in detached mode with ROCm support, enabling GPU access and necessary permissions. Features # Automated driver installation and management Easy deployment Oct 15, 2024 · This blog demonstrates how to set-up and fine-tune a Stable Diffusion XL (SDXL) model in a multinode Oracle Cloud Infrastructure’s (OCI) Kubernetes Engine (OKE) on a cluster of AMD GPUs using ROCm Jan 8, 2025 · Democratizing AI compute means making advanced infrastructure and tools accessible to everyone—empowering startups, researchers, and developers to train, deploy, and scale AI models without being constrained by proprietary systems or vendor lock-in. AMD Enterprise AI Suite provides a unified platform that integrates GPU infrastructure, workload orchestration, model inference, and lifecycle governance without dependence on proprietary systems. Discover official Docker images from AMD ROCm (TM) Platform, a Verified Publisher on Docker Hub. An interface like ROCR_VISIBLE_DEVICES for AMD GPU isolation would be better, like NVIDIA device plugin passes NVIDIA_VISIBLE_DEVICES to nvidia-container-runtime to use GPUs in k8s. Integrated with ROCm and compatible with PyTorch, Triton, and multi-GPU setups, OPEA helps Instinct customers optimize performance and scale from edge to cloud. This page explains how to install AMD Device Metrics Exporter using Kubernetes. 1, build 29cf629 Kubernates, Single Node Rancher System You can refer to this link for kubernates Contribute to ROCm/gpu-operator development by creating an account on GitHub. The AMD GPU is successfully detected, and the AMD GPU operator is installed and functioning correctly. 0. Docker images in the ROCm ecosystem # The ROCm Docker repository hosts Dockerfiles useful for building your own ROCm-capable containers. Prerequisites # Before installing the AMD GPU device plugin, ensure your environment meets the following requirements: System Requirements # Kubernetes: v1. We automatically deploy this component when you create or update a cluster. Kubernetes (k8s) device plugin to enable registration of AMD GPU to a container cluster - ROCm/k8s-device-plugin May 23, 2024 · Unlock the potential of Large Language Models with AMD GPUs and Ollama. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge. 29. g. Feb 14, 2025 · This blog is part 2 of a series aimed at providing a comprehensive, step-by-step guide for deploying and scaling AI inference workloads with Kubernetes and the AMD GPU Operator on the AMD Instinct platform AMD GPU Device Plugin for Kubernetes Introduction This is a Kubernetes ⁠ device plugin ⁠ implementation that enables the registration of AMD GPU in a container cluster for compute workload. There have many choice like K8S, K3S, AMD GPU Operator simplifies the deployment and management of AMD Instinct GPU accelerators within Kubernetes clusters. , k8s-rdma-device-plugin) to expose RDMA resources, and kernel modules like nvidia-peermem (for NVIDIA GPUs) or rocm-core with libfabric (for AMD GPUs). Prerequisites # Before installing the AMD GPU driver: Ensure the AMD GPU Operator and its dependencies are successfully deployed Have cluster admin permissions Have access to an image registry for driver images (if trying to install out-of-tree driver by operator Azure Kubernetes Service Compatibility AMD Instinct MI300X # This guide outlines the compatibility of various Azure VM sizes, operating systems, and software versions with specific GPU types and ROCm versions. 04 $ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 20. 7, 6. 0 or later kubectl command-line tool configured with access to the cluster Installation # See GPU Operator Documentation for installation instructions. 04 Codename: focal Docker 25 $ docker -v Docker version 25. RCCL Performance Benchmark Tests. ROCm Communication Collectives Library Note: The published documentation is available at RCCL in an organized easy-to-read format that includes a table of contents and search functionality. The focus here is to guide you through the process of integrating AMD GPUs into your Kubernetes/Akash setup Contribute to ROCm/gpu-operator development by creating an account on GitHub. It optimizes text generation workloads by efficiently batching requests and making full use of GPU resources, empowering developers to manage complex tasks like code generation and large-scale conversational AI. 4 is a leap forward for organizations building the future Aug 19, 2024 · NOTE: powerml/rocm-smi-exporter:0. Sep 6, 2024 · What is the issue? Hi, i'm pretty new to Ollama, and recently replaced my RX580 with a RX7600 to be able to use Ollama in Kubernetes with ROCm. It mounts required directories, allocates shared memory, and starts the server on port 30000 using the DeepSeek V3 model with tensor parallelism (TP) set to 8. Air-gapped Installation Guide # This guide explains how to install the AMD GPU Operator in an air-gapped environment where the Kubernetes cluster has no external network connectivity. More information about ROCm Oct 19, 2022 · This is a Kubernetes device plugin implementation that enables the registration of AMD GPU in a container cluster for compute workload. 0 or later Ubuntu 22. Feb 7, 2025 · In this post, we’ll establish the essential infrastructure by setting up a Kubernetes cluster using MicroK8s, configuring Helm for streamlined deployments, implementing persistent storage for model caching, and installing the AMD GPU Operator to seamlessly integrate AMD hardware with Kubernetes. Dec 2, 2024 · Learn to install K3s with ROCm GPU Operator on Ubuntu 24. We begin by outlining the end-to-end architecture and then This is a Kubernetes device plugin implementation that enables the registration of AMD GPU in a container cluster for compute workload. Jun 6, 2025 · Containers: Kubernetes and ROCm orchestration tools support dynamic GPU allocation and workload balancing in containerized/cloud environments. The documentation source files reside in the rccl/docs folder in this repository. Feb 7, 2025 · Victor RoblesVictor Robles, PhD # Victor Robles, PhD is an AI Architect at AMD, working in the Center of Excellence in AI and HPC. Jul 3, 2025 · Simplify GPU acceleration in containers with the AMD Container Toolkit—streamlined setup, runtime hooks, and full ROCm integration. He led efforts to integrate the first deep learning and generative AI algorithms for threat Mar 19, 2024 · BBI’s dacreo AI software stack for AMD Ryzen 7 4000 and V2000 series devices is based on a tailored AMD ROCm open-source compute stack, to fully harness onboard GPUs for artificial intelligence applications and enable cloud-native functionalities, including Kubernetes in space. It provides performance benchmarks for NVIDIA and AMD graphics cards and outlines the support for machine learning frameworks and libraries. 04. 4's key advancements: AI/HPC performance boosts, enhanced profiling tools, better Kubernetes support and modular drivers, accelerating AI and HPC workloads on AMD GPUs. May 26, 2024 · With Ollama and ROCm working in tandem on your AMD-powered Kubernetes cluster, you're well-equipped to tackle demanding LLM tasks. 1 in older vLLM branches. Mar 19, 2024 · An interesting innovation that ROCm has but CUDA does not is compatibility with Docker for containers and Kubernetes for container orchestration. Aug 12, 2024 · When it comes to AI applications, the choice between AMD's ROCm and NVIDIA's CUDA platforms plays a crucial role in shaping the landscape of ai development. 0 release. This test involves creating a manifest file, deploying the pod, and retrieving the pod logs. Not recommended for production environments The document discusses innovations in machine learning and distributed deep learning, highlighting various technologies like AMD ROCm, Docker, and Kubernetes. May 10, 2025 · Compare NVIDIA CUDA 12. ROCm is optimized for Generative AI and HPC applications, and it is easy to migrate existing code into ROCm software. K3s is a lightweight, certified Kubernetes distribution designed for resource-constrained environments, such as edge computing and IoT devices. Oct 1, 2025 · AMD ROCm™ 6. With the appropriate hardware and this plugin deployed in your Kubernetes cluster, you will be able to run jobs that require AMD GPU. 04 for efficient AMD GPU management in Kubernetes clusters. Furthermore, it emphasizes the significance of open-source solutions for machine learning, data We would like to show you a description here but the site won’t allow us. Mar 6, 2024 · ROCm 6. 4 or later Supported Linux Distributions Environment Setup # 1. Prerequisites # Kubernetes v1. This is a Kubernetes device plugin implementation that enables the registration of AMD GPU in a container cluster for compute workload. Using Kubernetes Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. When combined with the ROCm GPU Operator, K3s can efficiently manage AMD GPUs within the cluster, automating Installation Guide # This guide walks through the process of installing the AMD GPU device plugin on a Kubernetes cluster. 0 performance for lottery simulations with real benchmarks, code examples, and optimization tips. ROCm Docker only works with the first one. 6+k3s1 AMD CPU: AMD Ryzen 9 7940hs About Demo of a simple distributed training job using Volcano and PyTorch on a Kubernetes cluster with AMD GPUs Readme MIT license Activity This is a Kubernetes device plugin implementation that enables the registration of AMD GPU in a container cluster for compute workload. 2+ dkms driver on Kubernetes worker nodes, they can also optionally choose to use inbox or pre-installed driver on the worker nodes Oct 24, 2025 · 13 February 2025 - Navigating vLLM Inference with ROCm and Kubernetes 09 February 2025 - PyTorch Fully Sharded Data Parallel (FSDP) on AMD GPUs with ROCm 09 February 2025 - MI300A - Exploring the APU advantage 09 February 2025 - Deep dive into the MI300 compute and memory partition modes We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. 0 or later Helm v3. Oct 16, 2025 · ROCm enables inference and deployment for various classes of models including CNN, RNN, LSTM, MLP, and transformers. AMD Instinct Data Center GPU Documentation # The AMD Instinct Documentation site provides comprehensive guides and technical documentation for system administrators and technical users deploying AMD Instinct Data Center GPUs in enterprise environments. More information about ROCm We would like to show you a description here but the site won’t allow us. The ROCm Kubernetes Node Labeller Docker image automates the process of labeling Kubernetes nodes for ROCm compatibility and features. The rocm/primus Docker containers will cover PyTorch training ecosystem frameworks, including torchtitan and Megatron-LM. More information about RadeonOpenCompute (ROCm) Prerequisites ROCm capable Oct 7, 2025 · Applications & models # Explore the latest blogs about applications and models in the ROCm ecosystem, including machine learning frameworks, AI models, and application case studies. 04 Ubuntu 22. For Built on ROCm™, integrated with Kubernetes, and aligned with enterprise workflows, AIM provides flexible, high-performance building blocks for serving LLMs, vision models, speech models, and agentic AI pipelines. In addition, the AMD GPU Operator simplifies Kubernetes-native deployment of AMD GPUs for production AI environments. Reporting test results as Kubernetes events Under the hood the Device Test runner leverages the ROCm Validation Suite (RVS) and AMD GPU Field Health Check (AGFHC) toolkit to run any number of tests including GPU stress tests, PCIE bandwidth benchmarks, memory tests, and longer burn-in tests if so desired. Explore the ROCm Terminal container image on Docker Hub for application containerization and seamless integration with AMD ROCm platform. AMD GPU Operator simplifies the deployment and management of AMD Instinct GPU accelerators within Kubernetes clusters. You should ask DevOps Admin to setup k8s cluster or by yourself. There is a device plugin to use AMD GPUs in k8s, but it is not enough for scheduler extender. Jan 13, 2024 · Expected Behavior: The AMD GPU device plugin should install without errors and run successfully in the Kubernetes cluster. g We would like to show you a description here but the site won’t allow us. This cluster will run inference services that allow testing of certain supported models. Welcome back to the final part of our series! So far, we’ve successfully setup up a Kubernetes cluster and installed the AMD GPU Operator to seamlessly integrate AMD hardware with Kubernetes in . ROCm 6. 1 or greater minikube v1. AWS EKS supports RDMA over AWS EFA but only on selected instance types (no support for g4ad), however the performance increase is substantial: Mar 2, 2024 · Compare k8s-device-plugin vs kubernetes and see what are their differences. 6+k3s1 AMD CPU: AMD Ryzen 9 7940hs This third article in the series on Distributed MLOps explores overcoming vendor lock-in by unifying AMD and NVIDIA GPUs in mixed clusters for distributed PyTorch training, all without requiring code rewrites: Mixing GPU Vendors: It demonstrates how to combine AWS g4ad (AMD) and g4dn (NVIDIA) instances, bridging ROCm and CUDA to avoid being tied to a single vendor. 0 Ubuntu 22. Mar 23, 2025 · ROCm was built to support multiple domains beyond just GPGPU, including high-performance computing and heterogeneous computing. Feb 13, 2025 · Quick introduction to Kubernetes (K8s) and a step-by-step guide on how to use K8s to deploy vLLM using ROCm. Kubernetes allows customers to easily manage and deploy their AI workloads at scale by providing a robust platform for automating deployment, scaling, and operations of application containers across clusters of hosts. The AMD open AI ecosystem, built on AMD ROCm™ Software, pre-built optimized Docker images, and AMD Developer Cloud, provides the foundation Feb 25, 2025 · This blog helps targeted audience in setting up AI inference serverless deployment in a kubernetes cluster with AMD accelerators. Read more Introduction AMD GPU Operator simplifies the deployment and management of AMD Instinct GPU accelerators within Kubernetes clusters. This guide walks you through deploying vLLM using native Kubernetes. It leverages Kubernetes-native toolkit to streamline LLM serving with features like KV-cache-aware routing, distributed scheduling, and integration with Inference Gateway (IGW). HuggingFace Transformers users can now easily accelerate their models with DeepSpeed through a simple --deepspeed flag + config Jun 13, 2025 · AMD ROCm Enterprise AI builds on ROCm 7 with support for Kubernetes and slurm with support for cluster provisioning and system telemetry. Kubernetes (Helm) # This guide walks through installing the AMD GPU Operator on a Kubernetes cluster using Helm. The AMD SMI Prometheus Exporter employs AMDSMI Library for its data acquisition and GO binding that provides an interface between the amdsmi and Want to take advantage of Kubernetes to manage clusters equipped with powerful AI accelerators like AMD’s Instinct MI300X? This guide shows you how to deploy and verify the vLLM inference The ROCm Data Center tool (RDC) addresses key infrastructure challenges regarding AMD GPUs in cluster and data center environments and simplifies their administration Optimized for AI Workloads AMD ROCm™ is an open software stack offering a suite of optimizations for AI workloads and supporting the broader AI software ecosystem including open frameworks, models, and tools. 0+ Helm v3. In this blog we showcase initial deployment Serving vLLM Models on Kubernetes Cluster with ROCm-Enabled AMD GPUs using KServe for Linux Introduction This tutorial goes over creating a Kubernetes bare-metal cluster using kubeadm that has AMD GPUs that are compatible with vllm. Jan 19, 2024 · CUDA vs ROCm: The Ongoing Battle for GPU Computing Supremacy GPU computing has become indispensable to modern artificial intelligence. Furthermore, it emphasizes the significance of open-source solutions for machine learning, data Dec 27, 2024 · Deploying DeepSeek-V3 with Docker and Kubernetes Preliminaries: Hardware and Software Requirements DeepSeek-V3 is large but has been optimized for efficient inference. The ROCm™ Data Center Tool (RDC) simplifies administration and addresses key infrastructure challenges in AMD GPUs within cluster and datacenter environments. 18 or higher AMD GPUs: ROCm-capable AMD GPU hardware Jan 29, 2025 · This post announces the AMD GPU Operator for Kubernetes and and the Device Metrics Exporter, including instructions for getting started with these new releases. Extra information Kubernetes cluster version: Server Version: v1. Kubernetes (k8s) device plugin to enable registration of AMD GPU to a container cluster - ROCm/k8s-device-plugin Kubernetes (k8s) device plugin to enable registration of AMD GPU to a container cluster - ROCm/k8s-device-plugin Optimized GPU Software Stack AMD ROCm™ is an open software stack including drivers, development tools, and APIs that enable GPU programming from low-level kernel to end-user applications. More than 150 million people use GitHub to discover, fork, and contribute to over 420 million projects. Prerequisites Linux Latest AMD GPU Drivers 6. 35. Kubernetes (k8s) device plugin to enable registration of AMD GPU to a container cluster - ROCm/k8s-device-plugin This page explains how to install AMD Device Metrics Exporter using Kubernetes. SGLang: A framework designed for efficient inference with advanced features like zero overhead batch scheduling and optimized attention kernels. Learn how to set up ROCm support on Kubernetes for faster training and inference. In particular: The AMD SMI Exporter is a standalone app that can be run as a daemon, written in GO Language, that exports AMD CPU & GPU metrics to the Prometheus server. Victor completed his PhD in Electrical Engineering at the University of Iowa during the pivotal transition from traditional machine learning to the rise of deep learning. This repository implements an AMD GPU resource driver for Kubernetes' Dynamic Resource Allocation (DRA) feature. To get started with DeepSpeed on AzureML, please see the AzureML Examples GitHub DeepSpeed has direct integrations with HuggingFace Transformers and PyTorch Lightning. 0 or later (docker driver only) Using the docker driver Ensure you have an AMD driver installed, you can check if one is installed by running rocminfo, if one Mar 28, 2025 · This blog highlights the new feature enhancements that were released as part of the AMD GPU Operator v1. This project enables seamless configuration and operation of GPU-accelerated workloads, including machine learning, Generative AI, and other GPU-intensive applications. Nov 17, 2025 · In this blog, you’ll learn how to operationalize enterprise AI on AMD Instinct™ GPUs using an open, Kubernetes-native software stack. Oct 23, 2025 · Note For a unified training solution on AMD GPUs with ROCm, the rocm/pytorch-training Docker Hub registry will be deprecated soon in favor of rocm/primus. Status: Experimental (alpha). ROCm supports vLLM and Hugging Face TGI as major LLM-serving frameworks. 2. 0 and 6. This site focuses on large-scale deployment, cluster management, monitoring, and operational best practices for both HPC and AI workloads. Dec 27, 2023 · It looks like the main. Remember to consult Ollama's official documentation for detailed instructions and troubleshooting. . Driver Installation Guide # This guide explains how to install AMD GPU drivers using the AMD GPU Operator on Kubernetes clusters. Both platforms offer unique features and capabilities, but they differ significantly in terms of software maturity, hardware support, and ecosystem integration. It provides flexibility to customize the build of docker image using the following arguments: BASE_IMAGE: specifies the base image used when running docker build, specifically the PyTorch on ROCm base image. Support is provided by the AMD GPU device plugin for Kubernetes. This tutorial guides Oct 16, 2025 · How to implement the LLM inference frameworks with ROCm acceleration. This synergy empowers Mar 4, 2025 · In my experience, SGLang is the easiest to get running. Contribute to ROCm/gpu-operator development by creating an account on GitHub. 0 or later kubectl command-line tool configured with access to the cluster Cluster admin privileges Cluster Requirements # A functioning Kubernetes cluster with: All system pods running and ready Jan 6, 2020 · Both commands work on host, but the second one is useful in Kubernetes. Dec 3, 2024 · I am attempting to run the vLLM ROCm image on a Kubernetes cluster. Prerequisites # System Requirements # Kubernetes cluster v1. New features that enhance the use of AMD Instinct GPUs on Kubernetes including Automated Upgrades, Health Checks and Open-sourcing the codebase. Feb 13, 2025 · Navigating vLLM Inference with ROCm and Kubernetes Quick introduction to Kubernetes (K8s) and a step-by-step guide on how to use K8s to deploy vLLM using ROCm. May 20, 2025 · AMD has successfully deployed the open-source llm-d framework on AMD Kubernetes infrastructure as part of our efforts for distributed large language model inference at scale. The vast parallel processing power of graphics cards allows … Mar 13, 2025 · The ROCm k8s-device-plugin serves as a bridge between Kubernetes and AMD GPUs, allowing the Kubernetes scheduler to keep track of available GPU resources and allocate them to pods that request them. 1a12 is a private image, you may need to build your own image from the pip package. Feb 23, 2024 · 13 February 2025 - Navigating vLLM Inference with ROCm and Kubernetes 09 February 2025 - PyTorch Fully Sharded Data Parallel (FSDP) on AMD GPUs with ROCm 06 February 2025 - GEMM Kernel Optimization For AMD GPUs 31 January 2025 - Enhancing AI Training with AMD ROCm Software Oct 24, 2025 · 13 February 2025 - Navigating vLLM Inference with ROCm and Kubernetes 09 February 2025 - PyTorch Fully Sharded Data Parallel (FSDP) on AMD GPUs with ROCm 09 February 2025 - MI300A - Exploring the APU advantage 09 February 2025 - Deep dive into the MI300 compute and memory partition modes Navigating vLLM Inference with ROCm and Kubernetes Quick introduction to Kubernetes (K8s) and a step-by-step guide on how to use K8s to deploy vLLM using ROCm. AMD’s ROCm ecosystem ensures seamless integration with AI frameworks like TensorFlow and PyTorch. rocm uses ROCm 6. The built images are available on Docker Hub. Welcome to the specialized guide designed to assist Akash Providers in enabling AMD GPU support within their Kubernetes clusters. Feb 21, 2025 · Summary # Over these two blog posts, we’ve explored the full spectrum of ROCm container development - from basic development environments to specialized AI inference solutions. This Managed Driver Installations: Users will be able to install ROCm 6. For more information, see Contribute to ROCm documentation. 2 by default, but also supports ROCm 5. Sep 16, 2025 · This post announces the AMD GPU Operator for Kubernetes and and the Device Metrics Exporter, including instructions for getting started with these new releases. As with all ROCm projects, the documentation is open source. Oct 11, 2024 · Using AMD GPUs with minikube This tutorial shows how to start minikube with support for AMD GPUs. Deployment with CPUs Deployment with GPUs Troubleshooting Startup Probe or Readiness Probe Failure, container log contains "KeyboardInterrupt: terminated" Conclusion Alternatively, you can deploy vLLM to Kubernetes AMD GPU Operator Documentation # The AMD GPU Operator simplifies the deployment and management of AMD Instinct GPU accelerators within Kubernetes clusters. rtlf lgtwq bqelf lrox znhvj myyaw cnq nrtal ucv ohgesn gxbo gztqm fjsv fct bujkyig