Tag Archives: multimodal AI

Farewell Photoshop? Google’s new AI lets you edit images by asking.

Multimodal output opens up new possibilities Having true multimodal output opens up interesting new possibilities in chatbots. For example, Gemini 2.0 Flash can play interactive graphical games or generate stories with consistent illustrations, maintaining character and setting continuity throughout multiple images. It’s far from perfect, but character consistency is a new capability in AI assistants.… Read More »

Cheap AI “video scraping” can now extract data from any screen recording

Video scraping is just one of many new tricks possible when the latest large language models (LLMs), such as Google’s Gemini and GPT-4o, are actually “multimodal” models, allowing audio, video, image, and text input. These models translate any multimedia input into tokens (chunks of data), which they use to make predictions about which tokens should… Read More »

ChatGPT update enables its AI to “see, hear, and speak,“ according to OpenAI

reader comments 42 with On Monday, OpenAI announced a significant update to ChatGPT that enables its GPT-3.5 and GPT-4 AI models to analyze images and react to them as part of a text conversation. Also, the ChatGPT mobile app will add speech synthesis options that, when paired with its existing speech recognition features, will enable… Read More »

Meta’s “massively multilingual” AI model translates up to 100 languages, speech or text

Getty Images reader comments 25 with On Tuesday, Meta announced SeamlessM4T, a multimodal AI model for speech and text translations. As a neural network that can process both text and audio, it can perform text-to-speech, speech-to-text, speech-to-speech, and text-to-text translations for “up to 100 languages,” according to Meta. Its goal is to help people who… Read More »

Google’s PaLM-E is a generalist robot brain that takes commands

Enlarge / A robotic arm controlled by PaLM-E reaches for a bag of chips in a demonstration video. Google Research reader comments 12 with Share this story On Monday, a group of AI researchers from Google and the Technical University of Berlin unveiled PaLM-E, a multimodal embodied visual-language model (VLM) with 562 billion parameters that… Read More »

Microsoft unveils AI model that understands image content, solves visual puzzles

Enlarge / An AI-generated image of an electronic brain with an eyeball. Ars Technica reader comments 88 with Share this story On Monday, researchers from Microsoft introduced Kosmos-1, a multimodal model that can reportedly analyze images for content, solve visual puzzles, perform visual text recognition, pass visual IQ tests, and understand natural language instructions. The… Read More »