How ChatGPT picks sources
ChatGPT, when answering a query that triggers web search, retrieves a small set of candidate URLs through a Bing-backed retrieval layer, evaluates each for relevance and source authority, and composes an answer that may quote or paraphrase the candidates with attribution. The signals that influence selection include passage structure, schema clarity, classical authority (links and brand mentions), and freshness. Sites that disallow GPTBot are excluded from the corpus entirely.
ChatGPT uses two paths to source content. The first is the model's training data, which is a snapshot of the open web through the model's training cutoff. The second is real-time web retrieval, which OpenAI added in late 2023 and has expanded since. Citations almost always come from the retrieval path, because OpenAI does not surface training-data sources for compliance reasons.
The retrieval layer for ChatGPT runs on a Bing-backed index plus OpenAI's own crawl through GPTBot. When a user asks a question that requires fresh information, the model fetches a set of candidate URLs, reads them, and composes an answer with citations. The selection of which URLs to fetch is driven by classical retrieval signals, and the selection of which passages to cite from those URLs is driven by passage structure.