Pipeshift cuts GPU usage for AI inferences 75% with modular interface engine

Join our daily and weekly newsletters for the latest updates and exclusive content covering cutting-edge AI. Learn more

DeepSeek R1 release this week was a watershed moment in the field of AI. No one thought that a Chinese startup would be the first to abandon a reasoning model corresponding to OpenAI’s o1 and open source it at the same time (in line with OpenAI’s original mission).

Companies can easily download R1’s weights through Hugging Face, but access has never been an issue: more than 80% of teams use or plan to use open models. Deployment is the real culprit. If you opt for hyperscaler services, like Vertex AI, you are locked into a specific cloud. On the other hand, if you work alone and build in-house, you face the challenge of resource constraints, as you have to configure a dozen different components just to get started, let alone optimize or scale. downstream ladder.

To meet this challenge, Y Combinator and SenseAI, supported Pipe change launches an end-to-end platform that enables enterprises to train, deploy and scale open source generative AI models (LLM, vision models, audio models and image models) on any cloud or on-premises GPU. The company competes with a rapidly growing field that includes Baseten, Domino Data Lab, AI set and Simplismart.

The key value proposition? Pipeshift uses a modular inference engine that can be quickly optimized for speed and efficiency, helping teams not only deploy 30x faster, but also achieve more with the same infrastructure, leading to cost savings ranging up to 60%.

Imagine running inferences worth four GPUs with just one.

The orchestration bottleneck

When you need to run different models, assemble a internally functional MLOps stack — from access to compute, training and fine-tuning to production-level deployment and monitoring — becomes the problem. You have to configure 10 different components and inference instances to get things up and running, then spend thousands of engineering hours on even the smallest of optimizations.

“An inference engine has several components,” Arko Chattopadhyay, co-founder and CEO of Pipeshift, told VentureBeat. “Each combination of these components creates a distinct engine with varying performance for the same workload. Identifying the optimal combination to maximize ROI requires weeks of repetitive experimentation and parameter adjustment. In most cases, it can take years for internal teams to develop pipelines that enable flexibility and modularization of infrastructure, delaying companies to market while accumulating massive technology debt.

While there are startups that offer platforms for deploying open models in cloud or on-premises environments, Chattopadhyay says most of them are GPU brokers, offering universal inference solutions. As a result, they maintain separate GPU instances for different LLMs, which doesn’t help when teams want to reduce costs and optimize performance.

To solve this problem, Chattopadhyay launched Pipeshift and developed a framework called Modular Architecture for GPU-based Inference Clusters (MAGIC), aimed at distributing the inference stack into different plug-and-play pieces. The work created a Lego-like system that allows teams to configure the right inference stack for their workloads, without the hassle of infrastructure engineering.

This way, a team can quickly add or swap different inference components to piece together a custom inference engine capable of extracting more from existing infrastructure to meet expectations for cost, throughput, or even scalability .

For example, a team could build a unified inference system, where multiple domain-specific LLMs could run with hot swapping on a single GPU, making full use of it.

Run four GPU workloads on one

Since claiming to offer a modular inference solution is one thing and implementing it is another, the Pipeshift founder was quick to point out the benefits of the company’s offering.

“In terms of operational expenses… MAGIC allows you to run LLMs like Llama 3.1 8B at >500 tokens/s on a given set of Nvidia GPUs without any quantization or model compression,” he said. “This enables a massive reduction in scaling costs, as GPUs can now handle workloads an order of magnitude 20 to 30 times greater than they could initially achieve using the platforms native solutions offered by cloud providers.”

The CEO noted that the company already works with 30 companies on an annual license-based model.

One of them is a Fortune 500 retailer that initially used four independent GPU instances to run four fine-tuned open models for its automated support and document processing workflows. Each of these GPU clusters scaled independently, adding to considerable costs.

“Large-scale fine-tuning was not possible as data sets became larger and all pipelines supported single-GPU workloads while requiring you to download all data at once. Additionally, there was no auto-scaling support with tools like AWS Sagemaker, making it difficult to ensure optimal usage of the infrastructure, leading the company to pre-approve the quotas and to reserve capacity in advance for a theoretical scale which only reached 5% of the time. ” Chattopadhyay noted.

Interestingly, after moving to Pipeshift’s modular architecture, all settings were reduced to a single GPU instance that served them in parallel, without any memory partitioning or model degradation. This reduced the need to run these workloads from four GPUs to a single GPU.

“Without additional optimizations, we were able to expand the capabilities of the GPU to a point where it delivered inference tokens five times faster and could handle four times the scale,” the CEO added. In total, he said the company saw a 30x faster deployment schedule and a 60% reduction in infrastructure costs.

With a modular architecture, Pipeshift aims to position itself as the go-to platform for deploying all cutting-edge open source AI models, including DeepSeek R-1.

However, this will not be an easy task as competitors continue to evolve their offerings.

For example, Simplismart, which raised 7 million dollars a few months agoadopts a similar software-optimized inference approach. Cloud service providers like Google Cloud and Microsoft Azure are also strengthening their respective offerings, although Chattopadhyay believes these CSPs will look more like partners than competitors in the long run.

“We are a tooling and orchestration platform for AI workloads, like Databricks has been for data intelligence,” he explained. “In most scenarios, most CSPs will transform into growth-stage GTM partners because of the type of value their customers will be able to derive from Pipeshift on their AWS/GCP/Azure clouds. »

In the coming months, Pipeshift will also introduce tools to help teams build and scale their datasets, alongside model evaluation and testing. This will exponentially accelerate the experimentation and data preparation cycle, allowing customers to leverage orchestration more effectively.

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you insight into what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thank you for subscribing. Learn more VB newsletters here.

An error has occurred.