IT Brief Canada - Technology news for CIOs & IT decision-makers
Story image

Red Hat launches enterprise AI inference server for hybrid cloud

Yesterday

Red Hat has introduced Red Hat AI Inference Server, an enterprise-grade offering aimed at enabling generative artificial intelligence (AI) inference across hybrid cloud environments.

The Red Hat AI Inference Server emerges as an offering that leverages the vLLM community project, initially started by the University of California, Berkeley. Through Red Hat's integration of Neural Magic technologies, the solution aims to deliver higher speed, improved efficiency with a range of AI accelerators, and reduced operational costs. The platform is designed to allow organisations to run generative AI models on any AI accelerator within any cloud infrastructure.

The solution can be deployed as a standalone containerised offering or as part of Red Hat Enterprise Linux AI (RHEL AI) and Red Hat OpenShift AI. Red Hat says this approach is intended to empower enterprises to deploy and scale generative AI in production with increased confidence.

Joe Fernandes, Vice President and General Manager for Red Hat's AI Business Unit, commented on the launch: "Inference is where the real promise of gen AI is delivered, where user interactions are met with fast, accurate responses delivered by a given model, but it must be delivered in an effective and cost-efficient way. Red Hat AI Inference Server is intended to meet the demand for high-performing, responsive inference at scale while keeping resource demands low, providing a common inference layer that supports any model, running on any accelerator in any environment."

The inference phase in AI refers to the process where pre-trained models are used to generate outputs, a stage which can be a significant inhibitor to performance and cost efficiency if not managed appropriately. The increasing complexity and scale of generative AI models have highlighted the need for robust inference solutions capable of handling production deployments across diverse infrastructures.

The Red Hat AI Inference Server builds on the technology foundation established by the vLLM project. vLLM is known for high-throughput AI inference, ability to handle large input context, acceleration over multiple GPUs, and continuous batching to enhance deployment versatility. Additionally, vLLM extends support to a broad range of publicly available models, including DeepSeek, Google's Gemma, Llama, Llama Nemotron, Mistral, and Phi, among others. Its integration with leading models and enterprise-grade reasoning capabilities places it as a candidate for a standard in AI inference innovation.

The packaged enterprise offering delivers a supported and hardened distribution of vLLM, with several additional tools. These include intelligent large language model (LLM) compression utilities to reduce AI model sizes while preserving or enhancing accuracy, and an optimised model repository hosted under Red Hat AI on Hugging Face. This repository enables instant access to validated and optimised AI models tailored for inference, designed to help improve efficiency by two to four times without the need to compromise on the accuracy of results.

Red Hat also provides enterprise support, drawing upon expertise in bringing community-developed technologies into production. For expanded deployment options, the Red Hat AI Inference Server can be run on non-Red Hat Linux and Kubernetes platforms in line with the company's third-party support policy.

The company's stated vision is to enable a universal inference platform that can accommodate any model, run on any accelerator, and be deployed in any cloud environment. Red Hat sees the success of generative AI relying on the adoption of such standardised inference solutions to ensure consistent user experiences without increasing costs.

Ramine Roane, Corporate Vice President of AI Product Management at AMD, said: "In collaboration with Red Hat, AMD delivers out-of-the-box solutions to drive efficient generative AI in the enterprise. Red Hat AI Inference Server enabled on AMD InstinctTM GPUs equips organizations with enterprise-grade, community-driven AI inference capabilities backed by fully validated hardware accelerators."

Jeremy Foster, Senior Vice President and General Manager at Cisco, commented on the joint opportunities provided by the offering: "AI workloads need speed, consistency, and flexibility, which is exactly what the Red Hat AI Inference Server is designed to deliver. This innovation offers Cisco and Red Hat opportunities to continue to collaborate on new ways to make AI deployments more accessible, efficient and scalable—helping organizations prepare for what's next."

Intel's Bill Pearson, Vice President of Data Center & AI Software Solutions and Ecosystem, said: "Intel is excited to collaborate with Red Hat to enable Red Hat AI Inference Server on Intel Gaudi accelerators. This integration will provide our customers with an optimized solution to streamline and scale AI inference, delivering advanced performance and efficiency for a wide range of enterprise AI applications."

John Fanelli, Vice President of Enterprise Software at NVIDIA, added: "High-performance inference enables models and AI agents not just to answer, but to reason and adapt in real time. With open, full-stack NVIDIA accelerated computing and Red Hat AI Inference Server, developers can run efficient reasoning at scale across hybrid clouds, and deploy with confidence using Red Hat Inference Server with the new NVIDIA Enterprise AI validated design."

Red Hat has stated its intent to further build upon the vLLM community as well as drive development of distributed inference technologies such as llm-d, aiming to establish vLLM as an open standard for inference in hybrid cloud environments.

Follow us on:
Follow us on LinkedIn Follow us on X
Share on:
Share on LinkedIn Share on X