The advent of large language models like Gemini represents a significant step in artificial intelligence, ushering in a new era of human-computer interaction reminiscent of the shift brought about by the internet.
In late 2023, Google unveiled Gemini, its groundbreaking AI creation. Gemini offers sophisticated language processing, knowledge gathering, and creative text generation. These have the potential to revolutionise how we communicate, access information, and engage in creative tasks.
This post offers a concise overview of Gemini’s capabilities while outlining a few potential use cases. The primary focus is to explore three key factors that will shape Gemini’s success and growth. A bonus section highlights key developments to watch for as Gemini reshapes our world.
Gemini Team will have benefitted from observing Open AI's missteps and successes
Gemini is Mighty AI
Google Gemini is capable of understanding and processing information in various data formats including audio, code, images, text and video, making it a truly multimodal model. The product is being launched in three versions, each tailored for distinct purposes and user experiences.
Nano: most efficient model for on-device tasks
Gemini Nano powers select Pixel devices, specifically the Pixel 8 Pro. This lightweight version is optimised for on-device processing and currently enables enhanced AI-powered features like Summarise in Recorder, Smart Reply in Gboard and many others (checkout their post to learn more).
Pro: best model for scaling across a wide range of tasks
Gemini Pro, designed for English language text-based prompts, is now available in over 170 countries and territories. It is integrated into some core Google products, including Bard, which uses a fine-tuned version of Gemini Pro for advanced planning, reasoning and understanding.
NB at the time of writing Gemini Pro is not available in Europe. This includes the UK where Bard is currently running on PaLM2. Google says that Gemini Pro will ‘be available in… more places, like Europe, in the near future’.
Ultra: Google’s largest and most capable model for highly complex tasks
Gemini Ultra, the most powerful version, will be released in early 2024. It can handle multiple modalities, including audio, code, images, text and video, making it suitable for specialised applications that require sophisticated multimodal processing. It will be available through a new experience called Bard Advanced.
Google will continue to integrate Gemini into its broader ecosystem of products and services. For a comprehensive technical overview of Gemini, delve into
's insightful article.YouTube, a perfect stage for Gemini
A Few Use Cases
Now, let's take a look at a few examples of Gemini’s capabilities that hint at some exciting use cases to come:
Interacting with a human in a human-like way: Gemini understands and responds to a user's drawing, including asking questions about the drawing and making suggestions for games based on it. It can also identify objects in images and answer questions about them.
Turning images into code: an image of a tree is turned into an SVG. Then Gemini creates an interactive demo in JavaScript. The demo of a fractal tree can be changed and moved by using a slider, all created by Gemini.
Finding connection in images: Gemini identifies and understands visual patterns in images to find connections between seemingly unrelated images. It correctly identifies similarities between the Bosjes Chapel and a print by Hokusai, the moon and a golf ball, and the zebra and printed stripes on clothing.
Figuring out visual puzzles: Gemini guesses, correctly, that an image of a breakfast plus an image of an engagement ring, ie 🥞 + 💍 , represents the film Breakfast at Tiffany's.
Understanding how unusual images were created using emojis. Emojis can be combined to make new ones. Gemini correctly
identifies the emojis used to create the images,
explains the visual details of the images, and
even gives them names and taglines.
Helping with homework: Gemini solves maths problems and explains the concepts that need more clarification. It can also provide personalised practice problems based on mistakes.
Google showcases other exciting use cases for Gemini such as reasoning about user intent to generate bespoke experiences, understanding environments, unlocking insights in scientific literature, processing and understanding raw audio, and even its fashion sense along with a sense of humour.
To top it all off, Gemini Pro - aka Bard Advanced - is taken for a test flight by Mark Rober a youtuber with 28.5 million subs.
You can explore Google’s multimodal prompting techniques here.
Google: LLMs have limitations
Three Key Factors that Hint Gemini Could Eclipse ChatGPT
It’s too early to declare which LLM - ChatGPT or Gemini - is a definitive winner. However, three product-related factors suggest that Gemini has the potential to surpass ChatGPT’s current dominance.
1 Later Mover Advantage: we hear a lot about first mover advantage but there is a lot to be said for its opposite - later mover advantage. Google's expertise in product development, its experience in the technology industry and the time it’s had to observe competitors like OpenAI position it well to refine and optimise Gemini's capabilities. Just as Google's search engine triumphed over early search engines like AltaVista and Yahoo! Search, Gemini Team will have benefitted from observing and learning from Open AI's missteps and successes.
2 Google's Huge Reach and Resources: Google's tremendous reach across its vast ecosystem provides a powerful platform for raising awareness of Gemini's capabilities. YouTube, in particular, serves as an ideal platform for showcasing Gemini's potential. With its billions of users and diverse content, YouTube provides a stage for Gemini to engage with a global audience and demonstrate its ability to understand and respond to complex questions, generate creative content in different formats, and translate languages.
Through engaging video demonstrations and tutorials (like the one on Google’s site of Mark Rober figuring out how to launch a plane through a ring of fire), YouTube can effectively educate users about Gemini's capabilities and drive widespread adoption. Of course, Microsoft - a major investor in Open AI - is no pauper but does the company have the same reach?
3 Addressing the long tail of challenges: While Gemini's capabilities are impressive, it's important to acknowledge that there is still a significant long tail of challenges for AI models like Gemini that needs to be addressed. To their credit, Google acknowledge in the Gemini technical report (p. 23) that more work is needed to tackle limitations of LLMs:
Despite their impressive capabilities, we should note that there are limitations to the use of LLMs. There is a continued need for ongoing research and development on “hallucinations” generated by LLMs to ensure that model outputs are more reliable and verifiable. LLMs also struggle with tasks requiring high-level reasoning abilities like causal understanding, logical deduction, and counterfactual reasoning even though they achieve impressive performance on exam benchmarks. This underscores the need for more challenging and robust evaluations to measure their true understanding as the current state-of-the-art LLMs saturate many benchmarks.
Machine hallucinations - nonsensical or factually inaccurate outputs - are easier to spot in images but with text they have led some ChatGPT users astray. Adding contextual help, like Bard’s “Google it” button to double-check its answers is somewhat helpful but it can be time-consuming and tedious to manually check responses. A more efficient approach is needed to address this issue.
It is hard to see how the long tail of challenges will be solved completely. A fresh approach is needed here. But what could it be? Exciting to see what solutions will crop up over the coming months and, probably, years. A tonne of kudos - and riches - awaits anyone who finds a workable solution to this problem.
One to Watch
It has been widely reported that Google pays billions a year to be Apple’s default search engine. Will Apple need to strike a similar deal with Google to prioritise Google’s Bard models? Will Google want to strike such a deal?
Apple is ‘avoiding the AI hype train’. The company has focused on leveraging its Apple silicon to equip ‘ML researchers with the best tools it can make’, which avoids a head-on competition with ChatGPT, Gemini and other LLMs.
If Google's Bard models prove to be as popular and valuable as its search engine, it is quite possible that Apple will seek to secure an arrangement similar to that Google has with Apple for its search engine. This could involve Apple paying Google a portion of the revenue generated from Bard models used on Apple devices, or it could involve Apple integrating Bard models into its own services, such as Siri or Messages. Ultimately, the future of Google's relationship with Apple will depend on how successful Bard models are and how much value they can bring to both companies.
Successful or not, the launch of Gemini is very well timed coming barely a week after the saga at Open AI came full circle, when the prodigal son returned to the mothership on 29 November.
The turmoil of the last few weeks must have seemed like a Christmas miracle for Google: in late October 2023 The Information reported that Open AI corporate sales were under pressure as customers were looking for cheaper options. On 6 December Google unveils Gemini Ultra…
Hold on to your hats. It’ll be a wild ride.
If you enjoy reading Product Delights, you can support my work by buying me a coffee