OpenAI’s new web crawling tool “GPTBot” may be useful in enhancing future ChatGPT models, the company claims.
OpenAI claimed in a recent blog post that using data gathered from web pages crawled using the GPTBot user agent has the potential to increase model accuracy and functionality in subsequent iterations. You can identify GPTBot by the following user agent and string:
User agent token: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
A web crawler, often known as a web spider, is a sort of bot used to create an index of the content of websites. Websites rely on them to be indexed by search engines like Google and Bing.
According to OpenAI, the web crawler will collect freely available data from the internet, but it will exclude content from sites that use paywalls, are known data collectors, or include text that breaches its principles.
How to disallow GPTBot
Note that webmasters may block the spider by including a “disallow” command in a generic web server file.
User-agent: GPTBot
Disallow: /
How to Allow GPTBot but with Limited Access
If you want to allow GPTBot but has limited access, you can do so by configuring your robots.txt file into:
User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/
Webmasters can prevent GPTBot access to their sites by identifying it in robots.txt, however, some people say that unlike search engine crawlers, which boost traffic, permitting GPTBot serves no useful purpose. Unauthorized usage of protected works is a major problem.
Source: OpenAI