Question 1

What is robots.txt?

Accepted Answer

robots.txt is a plain-text file at the root of a website that tells crawlers which paths they are allowed or disallowed from fetching. It is a convention (the Robots Exclusion Protocol, now RFC 9309), not a security mechanism — any crawler can ignore it. It is primarily used to prevent search engines from indexing staging pages, admin areas, or duplicate content.

Question 2

What is the longest-match rule?

Accepted Answer

When multiple rules match a URL, the most specific one (longest pattern) wins. If both "Disallow: /private" and "Allow: /private/public" match a URL, the Allow wins because it is longer. This is the Google/Bing interpretation of RFC 9309 and what this tool implements.

Question 3

What do the * and $ wildcards do?

Accepted Answer

* matches any sequence of characters (including none). $ anchors the pattern to the end of the URL path — "Disallow: /file.pdf$" blocks /file.pdf but not /file.pdf?version=2. Wildcards are a Google extension to the original spec and are now part of RFC 9309.

Question 4

Does robots.txt affect all crawlers equally?

Accepted Answer

No. Each User-agent group applies only to the named crawler. A rule under "User-agent: Googlebot" does not affect Bingbot. The wildcard "User-agent: *" applies to any crawler not matched by a more specific group. AI crawlers like GPTBot, ClaudeBot, and PerplexityBot respect robots.txt if you add entries for them.

Question 5

What is Crawl-delay?

Accepted Answer

Crawl-delay tells a crawler how many seconds to wait between requests. Google ignores it (use Google Search Console instead); Bing and many other crawlers respect it. Setting it too high can slow down legitimate indexing.

robots.txt Validator & Tester

About robots.txt Validator & Tester

What this tool does

Pipeline

Frequently asked