GPT-4 Unveiled: Exploring OpenAI's Advanced Multimodal Language Model
Beyond GPT-3.5: Unraveling the Enhancements and Potential of OpenAI's GPT-4
By Shira Eisenberg, intern
On Tuesday, OpenAI released GPT-4, its “most capable and aligned,” though “still flawed,” language model yet.
The new model is multimodal, meaning it accepts both text and image inputs. The text-input feature is available now to all ChatGPT-Plus subscribers via the ChatGPT interface, with a waitlist for API usage. The image input ability is still a research preview until further tested for safety.
Organizations such as Microsoft Bing, Stripe, Duolingo, Morgan Stanley, Khan Academy, and OpenAI Converge portfolio companies already had access to the GPT-4 API and integrated GPT-4 into their products. The rest of us have to wait.
In many cases, GPT4 represents a vast improvement on the prior model, GPT-3.5, scoring in the top 10% of test takers on the U.S. bar exam for law school graduates, versus GPT-3.5, which scored in the bottom 10%. We’ll talk more about test results later on in this article.
Capabilities and the Demo
According to OpenAI, this next generation language model advances upon its predecessor in three key areas: visual input, creativity, and longer context. It is remarkable at collaborating on creative projects, and can generate HTML, CSS, and javascript code for simple games given straightforward prompts.
Further examples of the model’s creative ability include writing lyrics, screenplays, technical writing, and even “learning a user’s writing style.”
While GPT4 appears similar to its predecessor in conversations, the difference emerges in the complexity of tasks it can take on. “GPT-4 is more reliable, creative, and able to handle much more nuanced instructions,” said OpenAI.
In an online demo streamed on YouTube yesterday, OpenAI’s president, Greg Brockman, demonstrated some of the model’s capabilities. It is remarkable at coding and can build a full discord bot from scratch, even correcting code based on error messages. It can take a simple hand-drawn notebook sketch for a website and generate a fully functional web page based off of it. It can also help individuals calculate their taxes, by feeding in sections of the tax code as context. A version of the model has a context window of 32,000 tokens, the equivalent of about 100 pages, a dramatic improvement over GPT-3.5. The longer context window and ability to send GPT-4 a weblink and ask it to interact with text from that page are helpful for creating long-form content and extended conversations.
GPT-4’s multimodality means it can now receive images as a basis for engagement. In an example on the website, it is fed an image of a physics problem and asked to solve the problem. It is also fed an image of a few baking ingredients and asked what can be made from those ingredients. In the online demo, Brockman feeds GPT-4 a screenshot of a discord window and asks it to describe the screenshot, at which it succeeds. Again, the image-input capability is still in safety testing for users outside of a few select companies.
OpenAI claims that GPT-4 is significantly safer to use than GPT-3.5. In internal testing, GPT-4 produced 40% more factual responses and was 82% less likely to respond to requests for disallowed content. Some creative prompt engineering got it to generate a plan to overthrow humanity by roleplaying a misaligned AI, however.
Its guardrails seem both more flexible and more robust than those of GPT-3.5’s.
OpenAI claims training with more Reinforcement Learning with Human Feedback (RLHF) helped make these improvements. There was also a red team meant to try to elicit responses with disallowed content.
Exam Results
To understand the difference between GPT-4 and its predecessor, GPT-3.5, OpenAI tested on a variety of benchmarks, including simulation exams that were originally designed for humans. They used the most recent, publicly-available tests or purchased 2022-2023 editions of practice exams. They did no explicit training for these exams. See the results below.
Limitations
While the model is ultimately more powerful and capable than its predecessor, it is not without limitations. Like GPT-3.5, it still struggles with social biases, hallucinations, and adversarial prompts. In the words of Greg Brockman, “GPT-4 is not perfect, but neither are you.”
The Paper and Lack of Architecture Details
The following are some takeaways from the paper for GPT-4.
The model is multimodal, as priorly discussed, meaning it accepts both images and text. The paper is honest about safety challenges and used both in-house and outsourced evaluations for safety. The model performs well on exams.
There is a lack of information about the model or its architecture except “GPT-4 is a Transformer-style model pre-trained to predict the next token in a document, using both publicly available data (such as internet data) and data licensed from third-party providers.” We learn nothing about the sizes of the model or the dataset.
We learn that the model was “fine-tuned using RLHF,” but it is unclear what exact method was used or whether it was similar to those used for InstructGPT or ChatGPT.
Conclusion
There are clearly a lot of improvements with GPT-4, but it is far from perfect. OpenAI knows this and is continuing to make progress on future models. Who knows what GPT-5 will hold?