In October, Apple unveiled Ferret, a multimodal large language model, developed in collaboration with Cornell University researchers. This model, publicly accessible via GitHub, uniquely processes image fragments as queries. Despite a low-key release, Ferret garnered attention from industry experts due to its innovative functionality.
Ferret’s Operation and Functionality
Ferret operates by analyzing specified image fragments, identifying objects within, and delineating them. The model’s key function involves recognizing objects within the image fragment, providing textual answers to user queries. For instance, users can select an animal within a picture, prompting Ferret to identify its species. Furthermore, users can ask follow-up questions for additional contextual information regarding other objects or actions depicted.
Significance of Ferret’s Open Model
Described by Apple AI researcher Zhe Gan as a system capable of “referencing and justifying anything, anywhere, in any detail,” Ferret represents a notable departure from Apple’s traditional closed nature. Experts highlight the importance of this release, showcasing Apple’s embrace of openness. Speculations arise regarding Apple’s motivation for this move, suggesting a strategic decision to compete with tech giants like Microsoft and Google. Limited computing resources may have steered Apple towards this path, where creating a competitor to ChatGPT wasn’t feasible. Consequently, their options were narrowed down to partnering with a cloud hyperscaler or adopting an open format similar to Meta’s approach, notes NIX Solutions.
Overall, Ferret’s emergence as an image-querying language model represents a significant stride for Apple, blending innovation with a newfound openness in their technological pursuits.