Google’s trying to make waves with Gemini, a new generative AI platform that recently made its big debut. But while Gemini appears to be promising in a few aspects, it’s falling short in others. So what is Gemini? How can you use it? And how does it stack up to the competition?

To make it easier to keep up with the latest Gemini developments, we’ve put together this handy guide, which we’ll keep updated as new Gemini models and features are released.

What is Gemini?

Gemini is Google’s long-promised, next-gen generative AI model family, developed by Google’s AI research labs DeepMind and Google Research. It comes in three flavors:

  • Gemini Ultra, the flagship Gemini model
  • Gemini Pro, a “lite” Gemini model
  • Gemini Nano, a smaller “distilled” model that runs on mobile devices like the Pixel 8 Pro

All Gemini models were trained to be “natively multimodal” — in other words, able to work with and use more than just text. They were pre-trained and fine-tuned on a variety audio, images and videos, a large set of codebases, and text in different languages.

That sets Gemini apart from models such as Google’s own large language model LaMDA, which was only trained on text data. LaMDA can’t understand or generate anything other than text (e.g. essays, email drafts and so on) — but that isn’t the case with Gemini models. Their ability to understand images, audio and other modalities is still limited, but it’s better than nothing.

What’s the difference between Bard and Gemini?

Image Credits: Google

Google, proving once again that it lacks a knack for branding, didn’t make it clear from the outset that Gemini is separate and distinct from Bard. Bard is simply an interface through which certain Gemini models can be accessed — think of it as an app or client for Gemini and other gen AI models. Gemini, on the other hand, is a family of models — not an app or frontend. There’s no standalone Gemini experience, nor will there likely ever be. If you were to compare to OpenAI’s products, Bard corresponds to ChatGPT, OpenAI’s popular conversational AI app, and Gemini corresponds to the language model that powers it, which in ChatGPT’s case is GPT-3.5 or 4.

Incidentally, Gemini is also totally independent from Imagen-2, a text-to-image model that may or may not fit into the company’s overall AI strategy. Don’t worry, you’re not the only one confused by this!

What can Gemini do?

Because the Gemini models are multimodal, they can in theory perform a range of tasks, from transcribing speech to captioning images and videos to generating artwork. Few of these capabilities have reached the product stage yet (more on that later), but Google’s promising all of them — and more — at some point in the not-too-distant future.

Of course, it’s a bit hard to take the company at its word.

Google seriously under-delivered with the original Bard launch. And more recently it ruffled feathers with a video purporting to show Gemini’s capabilities that turned out to have been heavily doctored and was more or less aspirational. Gemini is, to the tech giant’s credit, available in some form today — but a rather limited form.

Still, assuming Google is being more or less truthful with its claims, here’s what the different tiers of Gemini models will be able to do once they’re released:

Gemini Ultra

Few people have gotten their hands on Gemini Ultra, the “foundation” model on which the others are built, so far — just a “select set” of customers across a handful of Google apps and services. That won’t change until sometime later this year, when Google’s largest model launches more broadly. Most info about Ultra has come from Google-led product demos, so it’s best taken with a grain of salt.

Google says that Gemini Ultra can be used to help with things like physics homework, solving problems step-by-step on a worksheet and pointing out possible mistakes in already filled-in answers. Gemini Ultra can also be applied to tasks such as identifying scientific papers relevant to a particular problem, Google says — extracting information from those papers and “updating” a chart from one by generating the formulas necessary to recreate the chart with more recent data.

Gemini Ultra technically supports image generation, as alluded to earlier. But that capability won’t make its way into the productized version of the model at launch, according to Google — perhaps because the mechanism is more complex than how apps such as ChatGPT generate images. Rather than feed prompts to an image generator (like DALL-E 3, in ChatGPT’s case), Gemini outputs images “natively” without an intermediary step.

Gemini Pro

Unlike Gemini Ultra, Gemini Pro is available publicly today. But confusingly, its capabilities depend on where it’s used.

Google says that in Bard, where Gemini Pro launched first in text-only form, the model is an improvement over LaMDA in its reasoning, planning and understanding capabilities. An independent study by Carnegie Mellon and BerriAI researchers found that Gemini Pro is indeed better than OpenAI’s GPT-3.5 at handling longer and more complex reasoning chains.

But the study also found that, like all large language models, Gemini Pro particularly struggles with math problems involving several digits, and users have found plenty of examples of bad reasoning and mistakes. It made plenty of factual errors for simple queries like who won the latest Oscars. Google has promised improvements, but it’s not clear when they’ll arrive.

Gemini Pro is also available via API in Vertex AI, Google’s fully managed AI developer platform, which accepts text as input and generates text as output. An additional endpoint, Gemini Pro Vision, can process text and imagery — including photos and video — and output text along the lines of OpenAI’s GPT-4 with Vision model.

Gemini

Using Gemini Pro in Vertex AI.

Within Vertex AI, developers can customize Gemini Pro to specific contexts and use cases using a fine-tuning or “grounding” process. Gemini Pro can also be connected to external, third-party APIs to perform particular actions.

Sometime in “early 2024,” Vertex customers will be able to tap Gemini Pro to power custom-built conversational voice and chat agents (i.e. chatbots). Gemini Pro will also become an option for driving search summarization, recommendation and answer generation features in Vertex AI, drawing on documents across modalities (e.g. PDFs, images) from different sources (e.g. OneDrive, Salesforce) to satisfy queries.

Gemini

Image Credits: Gemini

In AI Studio, Google’s web-based tool for app and platform developers, there’s workflows for creating freeform, structured and chat prompts using Gemini Pro. Developers have access to both Gemini Pro and the Gemini Pro Vision endpoints, and they can adjust the model temperature to control the output’s creative range and provide examples to give tone and style instructions — and also tune the safety settings.

Gemini Nano

Gemini Nano is a much smaller version of the Gemini Pro and Ultra models, and it’s efficient enough to run directly on (some) phones instead of sending the task to a server somewhere. So far it powers two features on the Pixel 8 Pro: Summarize in Recorder and Smart Reply in Gboard.

The Recorder app, which lets users push a button to record and transcribe audio, includes a Gemini-powered summary of your recorded conversations, interviews, presentations and other snippets. Users get these summaries even if they don’t have a signal or Wi-Fi connection available — and in a nod to privacy, no data leaves their phone in the process.

Gemini Nano is also in Gboard, Google’s keyboard app, as a developer preview. There, it powers a feature called Smart Reply, which helps to suggest the next thing you’ll want to say when having a conversation in a messaging app. The feature initially only works with WhatsApp, but will come to more apps in 2024, Google says.

Is Gemini better than OpenAI’s GPT-4?

There’s no way to know how the Gemini family really stacks up until Google releases Ultra later this year, but the company has claimed improvements on the state of the art — which is usually OpenAI’s GPT-4.

Google has several times touted Gemini’s superiority on benchmarks, claiming that Gemini Ultra exceeds current state-of-the-art results on “30 of the 32 widely used academic benchmarks used in large language model research and development.” The company says that Gemini Pro, meanwhile, is more capable at tasks like summarizing content, brainstorming and writing than GPT-3.5.

But leaving aside the question of whether benchmarks really indicate a better model, the scores Google points to appear to be only marginally better than OpenAI’s corresponding models. And — as mentioned earlier — some early impressions haven’t been great, with users and academics pointing out that Gemini Pro tends to get basic facts wrong, struggles with translations, and gives poor coding suggestions.

How much will Gemini cost?

Gemini Pro is free to use in Bard and, for now, AI Studio and Vertex AI.

Once Gemini Pro exits preview in Vertex, however, the model will cost $0.0025 per character while output will cost $0.00005 per character. Vertex customers pay per 1,000 characters (about 140 to 250 words) and, in the case of models like Gemini Pro Vision, per image ($0.0025).

Let’s assume a 500-word article contains 2,000 characters. Summarizing that article with Gemini Pro would cost $5. Meanwhile, generating an article of a similar length would cost $0.1.

Where you can try Gemini?

Gemini Pro

The easiest place to experience Gemini Pro is in Bard. A fine-tuned version of Pro is answering text-based Bard queries in English in the U.S. right now, with additional languages and supported countries set to arrive down the line.

Gemini Pro is also accessible in preview in Vertex AI via an API. The API is free to use “within limits” for the time being and supports 38 languages and regions including Europe, as well as features like chat functionality and filtering.

Elsewhere, Gemini Pro can be found in AI Studio. Using the service, developers can iterate prompts and Gemini-based chatbots and then get API keys to use them in their apps — or export the code to a more fully featured IDE.

Duet AI for Developers, Google’s suite of AI-powered assistance tools for code completion and generation, will start using a Gemini model in the coming weeks. And Google plans to bring Gemini models to dev tools for Chrome and its Firebase mobile dev platform around the same time, in early 2024.

Gemini Nano

Gemini Nano is on the Pixel 8 Pro — and will come to other devices in the future. Developers interested in incorporating the model into their Android apps can sign up for a sneak peek.

We’ll keep this post up to date with the latest developments.



Source link