📋

Technical SEO · Robots.txt

One Wrong Line in Robots.txt
Can Deindex Your Whole Site

Q: My www and non-www domains could have different robots.txt files? That seems wrong.

It's correct. Google treats robots.txt as specific to each origin (scheme, host, and port), so example.com and www.example.com are fetched and evaluated as separate robots.txt files.

Q: Does blocking a page in robots.txt actually keep it out of Google's index?

Not reliably. Robots.txt controls crawling, not indexing. Google can index a blocked URL based on link signals alone without ever crawling it. Use noindex to reliably keep a page out of search results.

Q: Why shouldn't I block CSS and JavaScript?

Google renders pages like a browser and needs those files to see the page as a visitor would. Blocking them distorts content understanding, mobile usability checks, and Core Web Vitals detection.

Q: What happens if robots.txt has a syntax error?

Google is generally forgiving of minor syntax issues. A more serious risk is the file returning a server error, which causes Google to treat the entire site as blocked for up to 24 hours.

Q: Should the sitemap URL be declared in robots.txt?

Yes. A Sitemap: directive gives any crawler direct discovery without relying on a Search Console submission, which matters for crawlers other than Google.

Google fetches robots.txt separately for every origin, which means example.com, www.example.com, and the https version can each be running different rules without anyone realizing it. TechySEO checks the actual file against your real URL inventory, catches the wildcard that's blocking more than intended, and flags the kind of mistake that took one well-known site weeks to recover from after a single stray Disallow: /.

Audit Robots.txt ← All Features

Why It Matters

The Most Dangerous File on Your Site Is Probably Twelve Lines Long

Robots.txt is small, plain text, and easy to underestimate, which is exactly why it's caused some of the most dramatic SEO incidents on record. A single misplaced wildcard can take out entire sections of a site's crawl access in one edit. There's a well-known case of a site accidentally shipping a Disallow: / to production and watching itself disappear from Google over the following days, a mistake that took weeks to fully recover from once it was caught and fixed.

The subtler failures matter too. Block CSS and JavaScript and Google can't render the page the way a real visitor sees it, which affects how it evaluates content, page experience, and mobile usability all at once. Skip the Sitemap declaration and Googlebot has to find the sitemap some other way instead of being pointed straight at it.

Here's the part that catches people off guard: robots.txt is fetched per origin, not per site. https://example.com, https://www.example.com, and the plain HTTP version are each treated as separate origins with their own robots.txt. It's entirely possible for one to be configured correctly and another to still be running an old, more restrictive version nobody remembered to update.

🚫

Critical Pages Blocked

An overly broad disallow pattern catches a landing page nobody meant to block, and indexing for it just stops.

🌐

A Different Robots.txt Per Origin

The www and non-www versions of your domain are checked independently. One getting updated doesn't mean the other did.

🎨

CSS and JS Blocked

Google renders the page without its styling or scripts, which changes what it thinks the page actually looks like.

🔧

Syntax Errors

Different crawlers can interpret a malformed rule differently, which means blocking behavior stops being predictable.

What It Checks

From Syntax to What the Rules Actually Block

A robots.txt file can be syntactically perfect and still be doing something nobody intended. Both get checked.

🔧

Syntax Validation

Field names, colon placement, spacing, encoding, and the BOM character that quietly breaks parsing for some crawlers, all checked against spec.

🚫

Critical Pages Actually Blocked

Disallow rules get cross-checked against your real crawled URLs, so a high-traffic page caught by an overly broad pattern doesn't go unnoticed.

🎨

CSS & JavaScript Blocking

Catches rules that keep Google from rendering the page fully, which quietly affects content understanding, CWV checks, and mobile usability all at once.

🗺️

Sitemap Declaration Check

Confirms a Sitemap: line actually exists and that the URL it points to is live and returning the right content type.

🎯

Wildcard Pattern Analysis

Estimates how many real URLs a given wildcard rule actually catches, since "looks fine on paper" and "blocks half the site" can be the exact same pattern.

🌐

Per-Origin Comparison

Checks robots.txt on each origin variant of your domain separately, since Google does too, and a mismatch between them is otherwise invisible.

How It Works

How the Robots.txt Audit Actually Runs

Fetched and Broken Into Real Rules

Pulled from the root of each origin variant and parsed into user-agent groups, allow/disallow paths, and sitemap declarations.

Every Rule Gets Tested Against Real URLs

Each disallow pattern runs against your actual crawl inventory, so you see exactly which pages, CSS files, and JS files it catches, not just what the pattern looks like on paper.

Blocked Pages Get Ranked by What They'd Cost You

A blocked page with strong internal links, sitemap presence, and ranking history jumps to the top of the review list, ahead of a blocked page nobody would miss.

Each Issue Comes With What to Actually Do About It

Remove the rule, narrow the wildcard, add the missing Sitemap line, or carve out an exception for specific CSS/JS paths while leaving the rest of the block in place.

Use Cases

Robots.txt Auditing in Practice

Staging to Production

The Staging Disallow That Went Live by Mistake

This is the classic, expensive one: the staging robots.txt with Disallow: / makes it into the production deploy. A Disallow: / rule gets flagged as critical the moment it's found, instead of being discovered later through a traffic graph that's quietly gone to zero.

CMS Platform Migrations

When the New Platform's URLs Don't Match the Old Rules

A new CMS generates a different URL structure, and the robots.txt rules written for the old structure can end up blocking patterns that didn't even exist before. Checking the rules against the new URL inventory right after migration catches this while it's still a same-day fix.

Ongoing Auditing

Knowing the Moment Someone Edits the File

A developer touches robots.txt without necessarily understanding what it'll block. Getting a flag the moment the file changes, with exactly what's new and what it affects, beats finding out three weeks later why a section stopped getting crawled.

FAQ

Robots.txt Analysis — FAQs

My www and non-www domains could have different robots.txt files? That seems wrong.

It's correct, just not well known. Google treats robots.txt as specific to each origin, scheme, host, and port together, not the domain as a general concept. https://example.com/robots.txt and https://www.example.com/robots.txt are two separate fetches as far as Google is concerned. Update one during a migration and forget the other, and you've got two different rule sets live at once with nothing visibly wrong on either page.

Does blocking a page in robots.txt actually keep it out of Google's index?

Not reliably, and this trips people up constantly. Robots.txt controls crawling, not indexing. If Google finds the URL through links elsewhere, it can index that bare URL based on link signals alone, without ever crawling it, since the disallow rule never let it look at the content. If the actual goal is "keep this out of search results," that's a job for noindex, not disallow. Robots.txt is for saving crawl budget on pages you don't need crawled, which is a different goal entirely.

Why shouldn't I block CSS and JavaScript?

Google renders pages like a browser does, which means it needs those files to see the page the way a visitor actually would. Block them and Google's working from an unstyled, unscripted version that can look nothing like reality, which throws off content understanding, mobile usability checks, and Core Web Vitals detection all at once.

What happens if robots.txt has a syntax error?

Google's crawler is fairly forgiving about minor syntax issues, generally skipping a line it doesn't understand rather than failing outright. The bigger risk is the file returning a server error. A robots.txt that 5xxs gets treated by Google as the entire site being blocked, for up to 24 hours, which is a serious outcome for what's usually just a flaky server hiccup.

Should the sitemap URL be declared in robots.txt?

Yes. A Sitemap: line gives any crawler direct discovery without relying on a Search Console submission, which matters most for crawlers other than Google that never had access to that submission in the first place. Multiple sitemaps just mean multiple Sitemap: lines.

Menu

One Wrong Line in Robots.txt
Can Deindex Your Whole Site

The Most Dangerous File on Your Site Is Probably Twelve Lines Long

From Syntax to What the Rules Actually Block

How the Robots.txt Audit Actually Runs

Robots.txt Auditing in Practice

Robots.txt Analysis — FAQs

See What Your Robots.txt Is Actually Blocking

Menu

One Wrong Line in Robots.txtCan Deindex Your Whole Site

The Most Dangerous File on Your Site Is Probably Twelve Lines Long

From Syntax to What the Rules Actually Block

How the Robots.txt Audit Actually Runs

Robots.txt Auditing in Practice

Robots.txt Analysis — FAQs

See What Your Robots.txt Is Actually Blocking

More Technical SEO Features

One Wrong Line in Robots.txt
Can Deindex Your Whole Site