Menu
πŸ“‹
Technical SEO Β· Robots.txt

One Wrong Line in Robots.txt
Can Deindex Your Whole Site

Google fetches robots.txt separately for every origin, which means example.com, www.example.com, and the https version can each be running different rules without anyone realizing it. TechySEO checks the actual file against your real URL inventory, catches the wildcard that's blocking more than intended, and flags the kind of mistake that took one well-known site weeks to recover from after a single stray Disallow: /.

The Most Dangerous File on Your Site Is Probably Twelve Lines Long

Robots.txt is small, plain text, and easy to underestimate, which is exactly why it's caused some of the most dramatic SEO incidents on record. A single misplaced wildcard can take out entire sections of a site's crawl access in one edit. There's a well-known case of a site accidentally shipping a Disallow: / to production and watching itself disappear from Google over the following days, a mistake that took weeks to fully recover from once it was caught and fixed.

The subtler failures matter too. Block CSS and JavaScript and Google can't render the page the way a real visitor sees it, which affects how it evaluates content, page experience, and mobile usability all at once. Skip the Sitemap declaration and Googlebot has to find the sitemap some other way instead of being pointed straight at it.

Here's the part that catches people off guard: robots.txt is fetched per origin, not per site. https://example.com, https://www.example.com, and the plain HTTP version are each treated as separate origins with their own robots.txt. It's entirely possible for one to be configured correctly and another to still be running an old, more restrictive version nobody remembered to update.

🚫
Critical Pages Blocked
An overly broad disallow pattern catches a landing page nobody meant to block, and indexing for it just stops.
🌐
A Different Robots.txt Per Origin
The www and non-www versions of your domain are checked independently. One getting updated doesn't mean the other did.
🎨
CSS and JS Blocked
Google renders the page without its styling or scripts, which changes what it thinks the page actually looks like.
πŸ”§
Syntax Errors
Different crawlers can interpret a malformed rule differently, which means blocking behavior stops being predictable.

From Syntax to What the Rules Actually Block

A robots.txt file can be syntactically perfect and still be doing something nobody intended. Both get checked.

πŸ”§
Syntax Validation
Field names, colon placement, spacing, encoding, and the BOM character that quietly breaks parsing for some crawlers, all checked against spec.
🚫
Critical Pages Actually Blocked
Disallow rules get cross-checked against your real crawled URLs, so a high-traffic page caught by an overly broad pattern doesn't go unnoticed.
🎨
CSS & JavaScript Blocking
Catches rules that keep Google from rendering the page fully, which quietly affects content understanding, CWV checks, and mobile usability all at once.
πŸ—ΊοΈ
Sitemap Declaration Check
Confirms a Sitemap: line actually exists and that the URL it points to is live and returning the right content type.
🎯
Wildcard Pattern Analysis
Estimates how many real URLs a given wildcard rule actually catches, since "looks fine on paper" and "blocks half the site" can be the exact same pattern.
🌐
Per-Origin Comparison
Checks robots.txt on each origin variant of your domain separately, since Google does too, and a mismatch between them is otherwise invisible.

How the Robots.txt Audit Actually Runs

1
Fetched and Broken Into Real Rules
Pulled from the root of each origin variant and parsed into user-agent groups, allow/disallow paths, and sitemap declarations.
2
Every Rule Gets Tested Against Real URLs
Each disallow pattern runs against your actual crawl inventory, so you see exactly which pages, CSS files, and JS files it catches, not just what the pattern looks like on paper.
3
Blocked Pages Get Ranked by What They'd Cost You
A blocked page with strong internal links, sitemap presence, and ranking history jumps to the top of the review list, ahead of a blocked page nobody would miss.
4
Each Issue Comes With What to Actually Do About It
Remove the rule, narrow the wildcard, add the missing Sitemap line, or carve out an exception for specific CSS/JS paths while leaving the rest of the block in place.

Robots.txt Auditing in Practice

Staging to Production
The Staging Disallow That Went Live by Mistake
This is the classic, expensive one: the staging robots.txt with Disallow: / makes it into the production deploy. A Disallow: / rule gets flagged as critical the moment it's found, instead of being discovered later through a traffic graph that's quietly gone to zero.
CMS Platform Migrations
When the New Platform's URLs Don't Match the Old Rules
A new CMS generates a different URL structure, and the robots.txt rules written for the old structure can end up blocking patterns that didn't even exist before. Checking the rules against the new URL inventory right after migration catches this while it's still a same-day fix.
Ongoing Auditing
Knowing the Moment Someone Edits the File
A developer touches robots.txt without necessarily understanding what it'll block. Getting a flag the moment the file changes, with exactly what's new and what it affects, beats finding out three weeks later why a section stopped getting crawled.

Robots.txt Analysis β€” FAQs

My www and non-www domains could have different robots.txt files? That seems wrong.
It's correct, just not well known. Google treats robots.txt as specific to each origin, scheme, host, and port together, not the domain as a general concept. https://example.com/robots.txt and https://www.example.com/robots.txt are two separate fetches as far as Google is concerned. Update one during a migration and forget the other, and you've got two different rule sets live at once with nothing visibly wrong on either page.
Does blocking a page in robots.txt actually keep it out of Google's index?
Not reliably, and this trips people up constantly. Robots.txt controls crawling, not indexing. If Google finds the URL through links elsewhere, it can index that bare URL based on link signals alone, without ever crawling it, since the disallow rule never let it look at the content. If the actual goal is "keep this out of search results," that's a job for noindex, not disallow. Robots.txt is for saving crawl budget on pages you don't need crawled, which is a different goal entirely.
Why shouldn't I block CSS and JavaScript?
Google renders pages like a browser does, which means it needs those files to see the page the way a visitor actually would. Block them and Google's working from an unstyled, unscripted version that can look nothing like reality, which throws off content understanding, mobile usability checks, and Core Web Vitals detection all at once.
What happens if robots.txt has a syntax error?
Google's crawler is fairly forgiving about minor syntax issues, generally skipping a line it doesn't understand rather than failing outright. The bigger risk is the file returning a server error. A robots.txt that 5xxs gets treated by Google as the entire site being blocked, for up to 24 hours, which is a serious outcome for what's usually just a flaky server hiccup.
Should the sitemap URL be declared in robots.txt?
Yes. A Sitemap: line gives any crawler direct discovery without relying on a Search Console submission, which matters most for crawlers other than Google that never had access to that submission in the first place. Multiple sitemaps just mean multiple Sitemap: lines.

See What Your Robots.txt Is Actually Blocking

Check it against your real URLs, across every origin variant of your domain, before a wildcard you forgot about costs you a section of the site.

No credit card required Β· Free 7-day trial Β· Cancel anytime