Apple’s MM1 AI Model Shows a Sleeping Giant Is Waking Up

While the tech industry went gaga for generative artificial intelligence, one giant has held back: Apple. The company has yet to introduce so much as an AI-generated emoji, and according to a New York Times report today and earlier reporting from Bloomberg, it is in preliminary talks with Google about adding the search company’s Gemini AI model to iPhones.

Yet a research paper quietly posted online last Friday by Apple engineers suggests that the company is making significant new investments into AI that are already bearing fruit. It details the development of a new generative AI model called MM1 capable of working with text and images. The researchers show it answering questions about photos and displaying the kind of general knowledge skills shown by chatbots like ChatGPT. The model’s name is not explained but could stand for MultiModal 1.

MM1 appears to be similar in design and sophistication to a variety of recent AI models from other tech giants, including Meta’s open source Llama 2 and Google’s Gemini. Work by Apple’s rivals and academics shows that models of this type can be used to power capable chatbots or build “agents” that can solve tasks by writing code and taking actions such as using computer interfaces or websites. That suggests MM1 could yet find its way into Apple’s products.

“The fact that they’re doing this, it shows they have the ability to understand how to train and how to build these models,” says Ruslan Salakhutdinov, a professor at Carnegie Mellon who led AI research at Apple several years ago. “It requires a certain amount of expertise.”

MM1 is a multimodal large language model, or MLLM, meaning it is trained on images as well as text. This allows the model to respond to text prompts and also answer complex questions about particular images.

One example in the Apple research paper shows what happened when MM1 was provided with a photo of a sun-dappled restaurant table with a couple of beer bottles and also an image of the menu. When asked how much someone would expect to pay for “all the beer on the table,” the model correctly reads off the correct price and tallies up the cost.

When ChatGPT launched in November 2022, it could only ingest and generate text, but more recently its creator OpenAI and others have worked to expand the underlying large language model technology to work with other kinds of data. When Google launched Gemini (the model that now powers its answer to ChatGPT) last December, the company touted its multimodal nature as beginning an important new direction in AI. “After the rise of LLMs, MLLMs are emerging as the next frontier in foundation models,” Apple’s paper says.

MM1 is a relatively small model as measured by its number of “parameters,” or the internal variables that get adjusted as a model is trained. Kate Saenko, a professor at Boston University who specializes in computer vision and machine learning, says this could make it easier for Apple’s engineers to experiment with different training methods and refinements before scaling up when they hit on something promising.

Saenko says the MM1 paper provides a surprising amount of detail on how the model was trained for a corporate publication. For instance, the engineers behind MM1 describe tricks for improving the performance of the model including increasing the resolution of images and mixing text and image data. Apple is famed for its secrecy, but it has previously shown unusual openness about AI research as it has sought to lure the talent needed to compete in the crucial technology.

Saenko says it’s hard to draw too many conclusions about Apple’s plans from the research paper. Multimodal models have proven adaptable to many different use cases. But she suggests that MM1 could perhaps be a step toward building “some type of multimodal assistant that can describe photos, documents, or charts and answer questions about them.”

Apple’s flagship product, the iPhone, already has an AI assistant—Siri. The rise of ChatGPT and its rivals has quickly made the once revolutionary helper look increasingly limited and out-dated. Amazon and Google have said they are integrating LLM technology into their own assistants, Alexa and Google Assistant. Google allows users of Android phones to replace the Assistant with Gemini.

Reports from The New York Times and Bloomberg that Apple may add Google’s Gemini to iPhones suggest Apple is considering expanding the strategy it has used for search on mobile devices to generative AI. Rather than develop web search technology in-house, the iPhone maker leans on Google, which reportedly pays more than $18 billion to make its search engine the iPhone default. Apple has also shown it can build its own alternatives to outside services, even when it starts from behind. Google Maps used to be the default on iPhones but in 2012 Apple replaced it with its own maps app.

Apple CEO Tim Cook has promised investors that the company will reveal more of its generative AI plans this year. The company faces pressure to keep up with rival smartphone makers, including Samsung and Google, that have introduced a raft of generative AI tools for their devices.

Apple could end up tapping both Google and its own, in-house AI, perhaps by introducing Gemini as a replacement for conventional Google Search while also building new generative AI tools on top of MM1 and other homegrown models. Last September, several of the researchers behind MM1 published details of MGIE, a tool that uses generative AI to manipulate images based on a text prompt.

Salakhutdinov believes his former employer may focus on developing LLMs that can be installed and run securely on Apple devices. That would fit with the company’s past emphasis on using “on-device” algorithms to safeguard sensitive data and avoid sharing it with other companies. A number of recent AI research papers from Apple concern machine-learning methods designed to preserve user privacy. “I think that’s probably what Apple is going to do,” he says.

When it comes to tailoring generative AI to devices, Salakhutdinov says, Apple may yet turn out to have a distinct advantage because of its control over the entire software-hardware stack. The company has included a custom “neural engine” in the chips that power its mobile devices since 2017, with the debut of the iPhone X. “Apple is definitely working in that space, and I think at some point they will be in the front, because they have phones, the distribution.”

In a thread on X, Apple researcher Brandon McKinzie, lead author of the MM1 paper wrote: “This is just the beginning. The team is already hard at work on the next generation of models.”