Hugging Face shrinks AI vision models to phone-friendly size, slashing computing costs

Join our daily and weekly newsletters for the latest updates and exclusive content on AI coverage. Learn more

hugged face achieved a remarkable breakthrough in AIintroducing visual language models that work on devices as small as smartphones while outperforming their predecessors that require massive data centers.

Company news Model smolvlm-256mrequiring less than a gigabyte of GPU memory, exceeds the performance of its idefics 80b model Just 17 months ago – a system 300 times bigger. This dramatic reduction in size and improvement in capabilities marks a watershed moment for the practical deployment of AI.

“When we released IdeFics 80B in August 2023, we were the first company to open source A video language model,” said Andrés Marafioti, machine learning research engineer at Hugging Face, in an exclusive interview with VentureBeat. “By achieving 300x size reduction while improving performance, Smolvlm marks a breakthrough in vision language models.”

Comparing the performance of Hugging Face’s new SmolVLM models shows that the smaller versions (256m and 500m) consistently outperform their 80 billion parameter predecessor on key visual reasoning tasks. (Credit: hugged face)

Smaller AI models that run on everyday devices

The advancement comes at a crucial time for businesses struggling with Astronomical IT costs to implement AI systems. The new Smolvlm models – available in 256m And 500m Parameter sizes – process images and understand visual content at speeds previously unattainable in their size class.

The smallest version processes 16 examples per second while using just 15 GB of RAM with a batch size of 64, making it particularly attractive to businesses looking to process large volumes of visual data. “For a mid-sized company processing 1 million images per month, this translates to substantial annual savings in compute costs,” Marafioti told VentureBeat. “The reduced memory footprint means businesses can deploy to cheaper cloud instances, reducing infrastructure costs.”

The development has already attracted the attention of major tech players. IBM has partnered with Hugging Face to integrate the 256m model into Color blindnesstheir document processing software. “While IBM certainly has access to substantial computing resources, using smaller models like these allows them to efficiently process millions of documents at a fraction of the cost,” Marafioti said.

*SMOLVLM model processing speeds across different batch sizes, showing how the smaller 256m and 500m variants significantly outperform version 2.2b on A100 and L4 graphics cards. (Credit: hugged face)*

How to Face Hug Reduced Model Size Without Compromising Power

Efficiency gains come from technical innovations in both vision processing and linguistic components. The team moved from a 400m parameter vision encoder to a 93m parameter version and implemented more aggressive token compression techniques. These changes maintain high performance while significantly reducing computational requirements.

For startups and small businesses, these developments could be transformative. “Startups can now launch sophisticated computer vision products in weeks instead of months, with infrastructure costs that were prohibitive just months ago,” Marafioti said.

The impact goes beyond cost savings to enable entirely new applications. Templates power advanced document search capabilities through Colipalian algorithm that creates searchable databases from document archives. “They achieve performance very close to that of 10X size models while dramatically increasing the speed at which the database is created and searched, making enterprise-wide visual search accessible to businesses of all types for the first time,” Marafioti said.

*A breakdown of the 1.7 billion Smolvlm training examples shows document processing and image captioning comprising almost half of the dataset. (Credit: hugged face)*

Why smaller AI models are the future of AI development

The breakthrough challenges conventional wisdom on the relationship between model size and capacity. While many researchers have assumed that larger models are needed for advanced vision language tasks, SmolVLM demonstrates that smaller, more efficient architectures can achieve similar results. The 500m settings version achieves 90% of the performance of its 2.2b setting sibling on key benchmarks.

Rather than suggesting a plateau in effectiveness, Marafioti sees these results as evidence of untapped potential: “Until today, the norm was to release VLMs starting with 2B parameters; We thought the smaller models weren’t useful. We prove that, in fact, models at 1/10 the size can be extremely useful for businesses. »

This development comes amid growing concerns about AI environmental impact And IT costs. By significantly reducing the resources required for vision AI, embrace innovation could help solve both problems while making advanced AI capabilities accessible to a wider range of organizations.

The models are Open source availableContinuing the Face Hugging tradition of increasing access to AI technology. This accessibility, combined with the efficiency of the models, could accelerate the adoption of visual language AI across industries, from healthcare to healthcare commerce, where processing costs have already been prohibitive.

In a field where bigger meant longer, Faceging Face’s achievement suggests a new paradigm: The future of AI might not be found in ever-larger models running in distant data centers, but in agile systems and efficient working directly on our devices. As the industry grapples with questions of scale and sustainability, these smaller models may well represent the biggest breakthrough yet.

Daily insights into business use cases with VB daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thank you for subscribing. Discover more VB newsletters here.

An error has occurred.