Using robots.txt to block indexing isn't always right, and how to fix it with Orchard CMS

Friday, September 14, 2018 google webmaster tools, orchard, orchard 1.10.x, seo

Look at all these pages Google Webmaster Tools was whinging about!

Since Google started revamping their webmaster tools recently it's made seeing some of the issues they consider my site to have more apparent. One of these is "Indexed, though blocked by robots.txt" which you'd think would be a contradiction, which is why I shouldn't be allowed to play around with SEO.

When I first setup Orchard as my blog software I created a robots.txt file with the following content:

User-agent: *
Disallow: /Admin/
Disallow: /Users/
Disallow: /Media/Default/Blog/Sage_The_Oracle_Instruction_Manual.pdf

This was to cover my supposition that I should tell search engines not to crawl links that went to pages in the admin area (/Admin) or for logging in (/User) which seemed quite logical at the time.

Thanks to the new webmaster tools UI making issues more visible, and providing a link through to some information describing the issue, I now know that in this scenario (my emphasis):

Indexed, though blocked by robots.txt: The page was indexed, despite being blocked by robots.txt (Google always respects robots.txt, but this doesn't help if someone else links to it). This is marked as a warning because we're not sure if you intended to block the page from search results. If you do want to block this page, robots.txt is not the correct mechanism to avoid being indexed. To avoid being indexed you should either use 'noindex' or prohibit anonymous access to the page using auth. You can use the robots.txt tester to determine which rule is blocking this page. Because of the robots.txt, any snippet shown for the page will probably be sub-optimal. If you do not want to block this page, update your robots.txt file to unblock your page.

So, it appears that the correct solution to preventing pages in the /Users area from being crawled, indexed and displayed in search results is to block search indexing with 'noindex', either by returning an HTTP response header for those pages or by adding a <meta name="robots" content="noindex"> tag to the page.

Fixing it with Orchard

As my blog runs on Orchard, sending an HTTP response header isn't as simple as adding it to the web.config (by hand or via IIS Manager) as there's no virtual directory to do it in, there's also the slight complexity that I need to work out where in Orchard is adding a robots meta tag saying:

<meta name="robots" content="index, follow, archive">

My first go-to was to use shape tracing to have a look through the page and see if there was a shape that was responsible. Unfortunately this didn't show up a shape that was responsible for generating this meta tag that I could then override or otherwise tweak. Next up was to grab a copy of the source for Orchard from GitHub and search it for 'robots' to see if there was a view, or component, that sets this. Apart from finding an intriguing reference in Modules\Orchard.Resources\Scripts\angular.js to $scope.names = ['pizza', 'unicorns', 'robots']; there were a lot of other matches in TinyMCE but none of them appeared to be relevant to the meta tag, in fact they were all in JavaScript so unlikely to be the culprit.

That means that the only place left to look was any modules, or themes, I have that bespoke my instance of Orchard. Running the same search against the PJS.Bootstrap theme turned up several hits, one specifically in Document.cshtml which looks like it's the "master" view that is used for generating every page of my site. Here's the bit of code in question:

    <meta charset="utf-8" />
    <meta name="robots" content="index, follow, archive" />
    <title>@Html.Title(title, siteName)</title>

    @Display(Model.Head)

And here's the equivalent part of \Orchard.Web\Core\Shapes\Views\Document.cshtml which is what I assume is being overriden by that view:

    <meta charset="utf-8" />
    <title>@Html.Title(title, siteName)</title> 
    @Display(Model.Head)

It looks like the meta tag is being added to the page, in every page, by my themes base theme (I've put all my theme customisations in a separate theme that uses PJS.Bootstrap as a starting point) and not in a way that's readily customisable. That only really leaves me with one option, pulling that view into my theme and customising it there. Because the author of the base theme appears to have abandoned it and development of Orchard 1x has all but stopped in favour of the .NET Core based Orchard Core there's relatively little risk of this causing me problems in the future so that's what I've elected to do.

A quick and dirty tweak to replace the <meta name="robots" content="index, follow, archive" /> with:

@if (WorkContext.HttpContext.Request.Url.PathAndQuery.StartsWith("/Users"))
{
    <meta name="robots" content="noindex" />
}
else
{
    <meta name="robots" content="index, follow, archive" />
}

Seems to do the trick to eliminate the instruction to search engines to index the pages in question. I've also removed the robots.txt entry and now just have to wait and see what Google has to make of it!

Using robots.txt to block indexing isn't always right, and how to fix it with Orchard CMS

Fixing it with Orchard

About Rob

No Comments