Phantom “Noindex” Tag – SiteGround Anti-Bot CAPTCHA Causes Google to De-Index Websites

If you are having issues with your website getting deindexed by Google due to a phantom META robots noindex tag that you can’t seem to figure out – well then this blog post might just help you fix the issue! 

Especially if use both SiteGround for hosting and also use Cloudflare for a CDN (or DNS) and/or if you are using Ezoic to run ads on your site.

Background

I noticed one of my websites was really struggling lately with its traffic and rankings.  At first, I assumed it was a victim of the latest Google “Helpful Content” update.  But then by chance I also happened to notice that the site’s homepage was not showing up in Google at all.

Next I consulted Google Search Console and we noticed that the number of pages indexed in the past few weeks had tanked, and the number of pages crawled but marked “not indexed” had skyrocketed.  I then input the homepage URL into the search box to “Inspect any URL” and sure enough, GSC confirmed that the page was not indexed.

Yikes!  Was this a manual penalty?  Or an algorithmic slap related the Helpful Content update?  Clearly I’ve got a major SEO problem – my site’s homepage is not indexed.

The Investigation – why are my pages getting deindexed?

There were two specific clues that I noticed that led me to the solution. 

Firstly, when I inspected the URL in GSC it was telling me specifically that the page was not indexed due to a robots noindex command, and thus Google could not access that page.  Naturally I checked the page’s source code but I could not find a noindex tag there.  In fact, I found a specific “index” tag (i.e. “doindex”) in the code that was put there by the RankMath plugin we use.  So not only did we lack a noindex tag, but we were specially saying to index this page.

I also checked the robots.txt file and again no issues there.  So why would GSC tell me I was blocking that page from being indexed?

The second clue was when I viewed the long list of pages not indexed within GSC.  You can specifically view pages by error, and in this case it was the obvious “Excluded by noindex tag” section. There to my horror I realized that this issue wasn’t just affecting the homepage but most pages on the site.  Yet oddly not all – some pages were indeed still indexed and generating traffic.  But I also saw many URLs that looked something like this:

https://mysite.com/.well-known/sgcaptcha/?r=/page-url-here&y=ipc:xxx.xx.xx.xxx:xxxxxxxxxx.xxx

Obviously I’ve anonymized this URL but it’s the /.well-known/sgcaptcha/ part that obviously looked odd to me and didn’t relate to our site.  Captcha stood out and then I figured (correctly) that “sg” was short for SiteGround.

SiteGround’s “Anti-Bot AI System” causing our pages to get deindexed

Sure enough, we use SiteGround as a hosting company for several of our WordPress websites.  In general, they are a great host.  I did a little digging though and learned that a few years back SiteGround released an Anti-Bot AI system to block brute-force attacks on their servers.  It is designed to detect unusual traffic to the site and then runs a check and in some cases serves the user a CAPTCHA.

Now thinking back, I do recall that a couple of months ago I noticed that sometimes when I would visit the site, I would see a little screen served from Siteground saying they were verifying that I was a real human.  The screen would generally only flash for a second, maybe even less than that, and then redirect me to the website.  I didn’t love it, but figured it was a non-issue.  That was wrong.  It was an issue.  A big issue!

Now that I have this major indexing issue saying there is a META noindex tag, and also that I’m seeing that Google thinks /.well-known/sgcaptcha/ is part of my site, well, I’m thinking its pretty likely that this captcha page is the culprit.  So I want to go view that page’s code.

If you try and load that page in a browser it redirects very quickly so you don’t have a chance to view the source code.  I’m sure there is some way around this if you are a developer, but I am not.  So instead a few Google searches led me to this thread on Slickstack which was literally the only post on the internet that I could find directly about this issue.  Hence me writing the super long rambling post about this issue.

Thanks to “Madison” on Slickstack who posted a Laravel of the code used in the /.well-known/sgcaptcha/ file.  Sure enough, the SiteGround anti-bot system file HAS A META NOINDEX TAG IN IT.  See below.

So apparently this is what’s happening…  when Googlebot goes to fetch the URL, it is triggering the anti-bot system from Siteground (more on why shortly).  The SiteGround server serves Googlebot a file that has a META noindex, nofollow tag in it.  But Google interprets this as the actual page it was trying to view and thus Googlebot obeys and does not index the page or follow the links.

To be clear, this issue was affecting my entire site – not just my homepage. However because Google crawls different pages at different times, the effect was that my homepage (and other top pages) were the first to be dropped from the index since they were crawled once this issue started happening. Other less-popular pages took longer to get re-crawled and thus stayed in Google’s index longer.

Ironically, the whole thing actually does sort of what its supposed to do – in that it blocks the bots.  I’m just pretty upset with SiteGround that their system couldn’t identify Googlebot as a good bot (vs. malicious brute-force attack bot) and allow Googlebot in.  Kind of a big deal, right? 

SiteGround + Cloudflare and/or Ezoic?

Well in fairness I think most of the time SiteGround’s system does allow Googlebot to crawl the sites they host and it does not deliver the captcha page file with the noindex tag.  Otherwise there would be a great many posts on the internet about this issue.  Rather, it seems only happen when a site is hosted on SiteGround and also uses Cloudflare as well. Or possibly if you use SiteGround plus Ezoic. I honestly don’t know which of those is the issue, or what combination thereof might be causing the issue. The SiteGround support team seemed to think it was Ezoic.

As to why this combination causes this issue?  Heck I have no idea other than the way the traffic is routed from Cloudflare/Ezoic seems to trip SiteGround’s filter as brute-force bot traffic.  But all of that is more technical than I can help with. 

So of course once we found this problem and fixed it, I then immediately thought about all of our other sites and realized that the issue indeed if affecting a second site too!  And all this time I thought it was the Google algorithm that was hating us.  Nope, we’re telling their bot not to index our sites! 

The Solution – disable SiteGround’s Anti-Bot System to fix the META noindex issue

To fix the issue, we reached out to SiteGround support and had them disable the anti-bot AI system on our site’s hosting account.  Once I cleared the cache (on the SG server as well as in Ezoic), sure enough the issue was fixed.  I then submitted the homepage and 4 others in Google Search Console to get them re-crawled.  Success!  A green check mark showing the URL is on Google and is now indeed indexed.

So that is good.  However, I won’t sleep well knowing there is a setting that was manually disabled that, if re-enabled, will kill our indexing and search traffic.  So I’ve either got to part ways with SiteGround or with Cloudflare it seems. Or Ezoic. We’ll investigate a bit more to narrow down if its Cloudflare or Ezoic and then decide if we want to keep that one or keep SiteGround as the host. TBD.

FYI, here’s a quick timeline of pages indexed before and after implementing the fix.  For reference, this is a site with a DA in the mid-30’s and typically would get 1,000 to 2,000 visits a day before this issue.  I honestly don’t even know how long we’ve had this issue b/c I don’t have a record of pages indexed previously and GSC only shows a 3 month view.  My guess is it was around that 3-month mark or maybe just a bit prior.

12/27/23 – 270 pages indexed
01/27/24 – 227 pages indexed
02/27/24 – 96 pages indexed
03/27/24 – 62 pages indexed
03/28/24 – Fix Implemented.  Immediately asked Google to manually re-crawl the homepage and 4 other top pages.
03/29/24 – 296 pages indexed
04/03/24 – 401 pages indexed

P.S. I also recently discovered that this same issue was the cause of a Google Analytics issue that was misreporting organic traffic as direct traffic. Go check out that post for more, but the takeaway is that the same solution here (disabling the SG intermediary anti-bot page) fixed both problems.

About the Author

Jon Payne

Jon is the founder and lead consultant of Vocational Media Group. He works directly with brands to increase their sales on Amazon, while also tightly controlling costs and protecting margins. Jon also practices what he preaches, by building, acquiring and operating his own private label brands on the Amazon Marketplace.

Vocational Media Group is a digital marketing agency located just outside of Charlotte, NC. We specialize in full-service SEO for businesses looking to improve their presence in Google, as well as Amazon FBA channel management to include Amazon SEO and PPC campaign management.

Contact

Vocational Media Group LLC
2058 Carolina Place Drive
Fort Mill, SC 29708

803.928.3010

Connect