OpenAI’s GPT-4 exhibits “human-level performance” on professional benchmarks

By | March 14, 2023
A colorful AI-generated image of a radiating silhouette.
Ars Technica

On Tuesday, OpenAI announced GPT-4, a large multimodal model that can accept text and image inputs while returning text output that “exhibits human-level performance on various professional and academic benchmarks,” according to OpenAI. Also on Tuesday, Microsoft announced that Bing Chat has been running on GPT-4 all along.

If it performs as claimed, GPT-4 potentially represents the opening of a new era in artificial intelligence. “It passes a simulated bar exam with a score around the top 10% of test takers,” writes OpenAI in its announcement. “In contrast, GPT-3.5’s score was around the bottom 10%.”

OpenAI plans to release GPT-4’s text capability through ChatGPT and its commercial API, but with a waitlist at first. GPT-4 is currently available to subscribers of ChatGPT Plus. Also, the firm is testing GPT-4’s image input capability with a single partner, Be My Eyes, an upcoming smartphone app that can recognize a scene and describe it.

Along with the introductory website, OpenAI also released a technical paper describing GPT-4’s capabilities and a system model card describing its limitations in detail.

A screenshot of GPT-4's introduction to ChatGPT Plus customers from March 14, 2023.
Enlarge / A screenshot of GPT-4’s introduction to ChatGPT Plus customers from March 14, 2023.
Benj Edwards / Ars Technica

GPT stands for “generative pre-trained transformer,” and GPT-4 is part of a series of foundational language models extending back to the original GPT in 2018. Following the original release, OpenAI announced GPT-2 in 2019 and GPT-3 in 2020. A further refinement called GPT-3.5 arrived in 2022. In November, OpenAI released ChatGPT, which at that time was a fine-tuned conversational model based on GPT-3.5.

AI models in the GPT series have been trained to predict the next token (a fragment of a word) in a sequence of tokens using a large body of text pulled largely from the Internet. During training, the neural network builds a statistical model that represents relationships between words and concepts. Over time, OpenAI has increased the size and complexity of each GPT model, which has resulted in generally better performance, model-over-model, compared to how a human would complete text in the same scenario, although it varies by task.

As far as tasks go, GPT-4’s performance is notable. As with its predecessors, it can follow complex instructions in natural language and generate technical or creative works, but it can do so with more depth: It supports generating and processing up to 32,768 tokens (around 25,000 words of text), which allows for much longer content creation or document analysis than previous models.

While analyzing GPT-4’s capabilities, OpenAI made the model take tests like the Uniform Bar Exam, the Law School Admission Test (LSAT), the Graduate Record Examination (GRE) Quantitative, and various AP subject tests. On many of the tasks, it scored at a human level. That means if GPT-4 were a person being judged solely on test-taking ability, it could get into law school—and likely many universities as well.

Source