Apr 15, 2025

Manage your all of your contacts at scale

Short description goes here so it complements the heading.

Text

Listen to article Loading…

Video

test

‍

---

‍

Video BG

‍

Wide Image

Test

‍

‍

Table 1

‍

Label	Details
Example A	Short description for A.
Example B	Longer description that wraps; no horizontal scroll.
Example C	Another line with more content to test wrapping.

‍

Table 2

‍

Model Evaluation Scores

Solution	Challenge it solves	How it works
Smart Filters	Traditional search relied on drop-down menus and checkboxes, limiting travelers to a small number of filters.	Uses GPT-4o mini to understand prompts like “sunset views” or “great gym.” Goes beyond predefined filters by analyzing reviews, images, and listing details. Surfaces more relevant results, driving engagement and conversions.
Property Q&A	Many travelers have specific questions about properties that aren’t easily answered in a static listing.	OpenAI’s LLMs were fine-tuned on user content and property descriptions. Handles queries like “Is there a crib available?” or “Is the pool open in winter.” Adapts to ambiguity in pet-policy definitions.
AI Review Summaries	Travelers often struggle to sift through thousands of reviews when comparing properties.	GPT-4o mini analyzes and summarizes reviews into themes (cleanliness, location, amenities). Generates concise summaries, speeding decisions and boosting confidence.
Help Me Reply	Manages guest communications efficiently and cuts response times.	Auto-generates responses and templates via OpenAI’s models. Hosts track replies with a reply-score metric.

‍

Table 3

‍

Model Evaluation Scores

	GPT-4.5	GPT-4o
GPQA (science)	71.4%	53.6%
AIME ‘24 (math)	36.7%	9.3%
MMMLU (multilingual)	85.1%	81.5%
MMMU (multimodal)	74.4%	69.1%
SWE-Lancer Diamond (coding)*	32.6% $186,125	23.3% $138,750
SWE-Bench Verified (coding)*	38.0%	30.7%

* Numbers shown represent best internal performance.

‍

Table 4

‍

Model Evaluation: GPT-4 Comparison

	GPT-4.5	GPT-4o	Baseline
GPQA (science)	71.4%	53.6%	50.0%
AIME ‘24 (math)	36.7%	9.3%	25.0%
MMMLU (multilingual)	85.1%	81.5%	70.0%
MMMU (multimodal)	74.4%	69.1%	60.0%
SWE-Lancer Diamond (coding)*	32.6%	23.3%	10.0%
SWE-Bench Verified (coding)*	38.0%	30.7%	20.0%

* Numbers shown represent best internal performance.

‍

Block Quotes:

This doesn't look anything like what I want

‍

Video

test

‍

---

‍

Video BG

‍

Wide Image

‍

Table 1

Label	Details
Example A	Short description for A.
Example B	Longer description that wraps; no horizontal scroll.
Example C	Another line with more content to test wrapping.

‍

Table 2

‍

Booking.com Flagship AI Solutions

Solution	Challenge it solves	How it works
Smart Filters	Traditional search relied on drop-down menus and checkboxes, limiting travelers to a small number of filters.	Uses GPT-4o mini to understand prompts like “sunset views” or “great gym.” Goes beyond predefined filters by analyzing reviews, images, and listing details. Surfaces more relevant results, driving engagement and conversions.
Property Q&A	Many travelers have specific questions about properties that aren’t easily answered in a static listing.	OpenAI’s LLMs were fine-tuned on user content and property descriptions. Handles queries like “Is there a crib available?” or “Is the pool open in winter.” Adapts to ambiguity in pet-policy definitions.
AI Review Summaries	Travelers often struggle to sift through thousands of reviews when comparing properties.	GPT-4o mini analyzes and summarizes reviews into themes (cleanliness, location, amenities). Generates concise summaries, speeding decisions and boosting confidence.
Help Me Reply	Manages guest communications efficiently and cuts response times.	Auto-generates responses and templates via OpenAI’s models. Hosts track replies with a reply-score metric.

‍

Table 3

‍

Model Evaluation Scores

	GPT-4.5	GPT-4o
GPQA (science)	71.4%	53.6%
AIME ‘24 (math)	36.7%	9.3%
MMMLU (multilingual)	85.1%	81.5%
MMMU (multimodal)	74.4%	69.1%
SWE-Lancer Diamond (coding)*	32.6% $186,125	23.3% $138,750
SWE-Bench Verified (coding)*	38.0%	30.7%

* Numbers shown represent best internal performance.

‍

Table 4

Model Evaluation: GPT-4 Comparison

	GPT-4.5	GPT-4o	Baseline
GPQA (science)	71.4%	53.6%	50.0%
AIME ‘24 (math)	36.7%	9.3%	25.0%
MMMLU (multilingual)	85.1%	81.5%	70.0%
MMMU (multimodal)	74.4%	69.1%	60.0%
SWE-Lancer Diamond (coding)*	32.6%	23.3%	10.0%
SWE-Bench Verified (coding)*	38.0%	30.7%	20.0%

* Numbers shown represent best internal performance.

‍

It works fine...

‍

Block Quotes:

This doesn't look anything like what I want

‍

Frequently asked questions

Short heading goes here

Author

Footnotes

Incorporating AI into customer engagement strategies can significantly enhance personalization and efficiency, leading to improved customer satisfaction.

Blocked domains that we have observed the model inaccuracy from in the past.

Used a reasoning model as a monitor to examine all tokens in each attempt, in order to identify suspicious behavior.

* Any attempts where the monitor deemed the rollout suspicious are counted as incorrect. Most samples failed by this check were problems whose exact solution was available on multiple internet sources unrelated to HLE. Used a reasoning model as a monitor to examine all tokens in each attempt, in order to identify suspicious behavior. Suspicious behavior is defined as "a page, file, or snippet whose main purpose is to provide the exact answer to this specific question—e.g., an official grading key, leaked “solutions” gist, or discussion quoting the finished answer verbatim."

** Any attempts where the monitor deemed the rollout suspicious are counted as incorrect. Most samples failed by this check were problems whose exact solution was available on multiple internet sources unrelated to HLE.

Odio felis sagittis, morbi feugiat tortor vitae fusce aliquet. Nam elementum urna nisi aliquet erat dolor enim. Ornare id morbi eget ipsum. Aliquam senectus neque ut id eget consectetur dictum. Donec posuere pharetra odio consequat scelerisque et nunc tortor.

By Team Size

By Industry

All Industries

Discover

Quick Links

Education

Services

About

Research

Safety

For Partners

Manage your all of your contacts at scale

Manage your all of your contacts at scale

Video

Video BG

Wide Image

Test

Model Evaluation Scores

Model Evaluation Scores

Model Evaluation: GPT-4 Comparison

Video

Video BG

Wide Image

Table 1

Table 2

Booking.com Flagship AI Solutions

Table 3

Model Evaluation Scores

Table 4

Model Evaluation: GPT-4 Comparison

‍

Frequently asked questions

Short heading goes here