8

My WordPress blog is completely cloned. That clone site is updating in real time with my blog. I am surprised that someone can actually do that.

What should I do to stop harmful impact in my search engine ranking? Is there any way to tell Google not to index that site?

Stephen Ostermiller
  • 98,758
  • 18
  • 137
  • 361
Tanvir Hasan
  • 81
  • 1
  • 4
  • I see this now seems to have been "fixed" - the cloned site is no longer "cloning". How did you achieve this in the end? – MrWhite Mar 31 '15 at 23:38
  • 1
    @w3d After two weeks and several conversation with Amazon Hosting, they decided to shut down that cloned site. Thanks everyone. – Tanvir Hasan Apr 04 '15 at 20:23
  • note that https://www.dmca.com and https://www.google.com/webmasters/tools/dmca-dashboard are different when looking to claim infringement. – Sunrise Joz Jan 28 '16 at 16:44

3 Answers3

10

They're simply loading your site via a server-side script. All you need to do is block their server's IP address via .htaccess. Simply open up your server's access logs, open the cloned page on their site, then view your log for the new entry and you'll have their IP address.

It also wouldn't hurt to submit a DMCA request to Google as well but this will not really be necessary as that content will instantly disappear once you block their IP address.

John Conde
  • 86,255
  • 27
  • 146
  • 241
  • 2
    I am going to double-down on the suggestion to make a DMCA request to Google. We are seeing various forms of this lately and I am just not sure of what the payoff would be. I would, however, suggest not blocking them for a period while Google does it's thing- then I would block them- but you may not have to when Google de-lists them. I am just suggesting that if you file a DMCA complaint with Google,give them a period to investigate before blocking. Otherwise just block them right away. – closetnoc Mar 21 '15 at 23:13
  • 1
    Hi John Conde, I have tried to block block their ip address via .htaccess file using this code " "command Order Deny,Allow Deny from [that IP address] " But that clone site is still updating in real time with mine. Is that the right code to block that ip? – Tanvir Hasan Mar 22 '15 at 10:04
  • 2
    @TanvirHasan That is the right idea, providing you have the correct IP address. Is that IP address still appearing in your access log when you visit the "cloned site"? – MrWhite Mar 22 '15 at 18:25
  • My hosting provider give put those comand at .htaccess file and they are the one who get that ip address from log. But it is not working. – Tanvir Hasan Mar 23 '15 at 01:04
  • Did this ever get resolved?? – closetnoc Mar 24 '15 at 00:31
  • I do not know WP, but is it possible to make all links on your site fully qualified instead of relative? Meaning www.example.com/products/ instead of just /products/. This might have interesting results. Also, I see there is advertising. Is this advertising using your code?? I also saw cloudFlare in the code- not knowing enough about this, are you using a CDN? There may be some interesting solutions/hacks that people can suggest. Who knows?? If they are getting your site in real-time, my question would be how this is actually done? I was not able to find any good clues. – closetnoc Mar 24 '15 at 00:44
  • @closetnoc No seccuess yet. I have tried many things to stop it. Someone told me that cloudflare has aa anti-scraping program which can stop cloning but that didn't work either. I don't know how someone can do this. I blocked few ip address at .htaccess from server log but that didn't work either. Clone site is updating in real time with ad codes and minor changes. I have filed a complain with the clone site hosting company but got no response yet. I filed DMCA complain with google but I don't think that is going to work because clone site blocking robot.txt file. I am frustrated now... – Tanvir Hasan Mar 24 '15 at 05:13
  • Filing a DMCA complaint will help. Google is just a front man who will investigate and remove all sites from it's index. The complaint goes to Chilling Effects which is a clearing house for these complaints. From there, it will be removed by all search engines and other entities that participate. You have a legal recourse and can get past the private registration in the U.S. You will want to talk to eNom who is the registrar and Amazon who is the host. Do not let them blow you off. Refer to your DMCA complaint. I will try and snoop around some more. Think about resetting any passwords. – closetnoc Mar 24 '15 at 05:38
  • John- I am not doubting your expertise- but how do you know this is a JS technique?? I have wget, but that does not really tell me anything. Did you use another tool?? Curl perhaps? Or does this appear to be a JS technique. Just a curiosity question because this is still happening. – closetnoc Mar 24 '15 at 15:39
  • This isn't done with JS. It's done with a server-side language like Java or PHP. Just a few lines of code to read in the HTML and spit it out to the screen. CURL is definitely one such possible tool. – John Conde Mar 24 '15 at 15:42
  • Clone site hosting provider amazon contacted me and said that they are investigating this and they will take action soon. So that's good news. But I can't stop someone from doing that again unless I know how they did it. This time I was lucky enough to spot that clone site early. So if anyone has any idea to find out how they did it please let me know. – Tanvir Hasan Mar 24 '15 at 17:22
  • You could try outputting the client's IP address (as seen from your server) to the page (ie. $_SERVER['REMOTE_ADDR'] in PHP) - somewhere very discreet - normally this would be the IP address of the end user, however, when viewed on the "cloned site" I suspect it would be the IP address of the server that is cloning your site (acting as a proxy and passing the response on). This is the IP address you should be blocking. However, I wonder... it's technically possible that they are in control of many servers, many IP addresses, blocking one wouldn't stop them, you'd need to block them all! – MrWhite Mar 26 '15 at 16:30
  • ... make sure the IP address you see isn't your own! (Although if it is, I'll eat my hat!) – MrWhite Mar 26 '15 at 16:33
  • Does anybody have a clue about WHY they are doing this? What could possibly be their payoff? The cloned websites have a lot of spammy content, for instance almost every link anchor is substitute with viagra/porn/prostitution keywords. This happened to me too several times now. Every time I block their ip via NGINX but after a few month another one arises... – Darme Mar 17 '17 at 14:52
5

(In addition to @John's answer.)

Is there any way to tell Google not to index that site?

Rather curious that whilst they appear to have cloned everything (including your XML sitemaps*1), they have not cloned your robots.txt file. In fact, the robots.txt on that site actively blocks crawling of everything! So there would not seem to be anything to do in this respect. Doing a site search on that domain returns just the bare domain and a notice stating that its blocked by robots.txt.

(Rather curious what their intention would be in doing this? You could perhaps just assume that they made a mistake with robots.txt - and that maybe so - but this looks more like a deliberate exception to me?)

Also, whilst your XML sitemaps are cloned, they aren't updating the URLs in them (as they are doing on the main site pages), so they are still pointing back to your site.

*1 Regarding the XML sitemap(s). On your site "sitemap.xml" is actually a redirect to "sitemap_index.xml" and the cloned site has actually cloned the redirect... which redirects back to your site! (Surely a mistake on their part.) "sitemap_index.xml" is just an index, linking to 4 other sitemaps. If any of these actual sitemaps are requested directly on the cloned site then they are correctly cloned and the URLs updated. However, I would have said that these sitemaps are unlikely to be found on the cloned site because of the initial redirect of "sitemap.xml". (?) Although if they did submit "sitemap_index.xml" directly then that would obviously get around the redirect.

MrWhite
  • 42,784
  • 4
  • 49
  • 90
  • 1
    I made a request to the sitemap just a few minutes ago and there is a 301 redirect from the spam site to the original site. – closetnoc Mar 24 '15 at 00:19
  • @closetnoc Ah yes! I missed that before. "sitemap.xml" is actually a redirect on the original site as well... it redirects to "sitemap_index.xml". The spam site appears to be cloning this redirect which sends the user back to the original site! If you request any of the 4 sitemaps listed in "sitemap_index.xml" directly on the spam site then the spam site correctly clones them, however, because of the initial redirect I would guess they will be difficult to find, unless they know to submit "sitemap_index.xml" instead of "sitemap.xml". I've updated the answer. Thanks. – MrWhite Mar 24 '15 at 08:44
3

If the site produces backlinks to you it is important to use the Google Disavow tool otherwise the algorithm will be working against you, regardless.

https://www.google.com/webmasters/tools/disavow-links-main

create a .txt file and add:

domain:thedamnsitethatcloned.com

then upload it to Google via Webmaster Tools.

Here are exactly the steps that I would take to resolve this issue. I know that a lot of webmasters face this issue. I have had this problem before and there does not seem to be a straight answer on Google (ironically) (which is why I want to help). Matt Cutts is the dude who you are supposed to listen to about these issues, but listening to him is like trying to win a game of chess against a supercomputer inside a burning house (no help to be found).

The short Cutts:

  1. Register with DMCA and put the badge on your website.
  2. Gather all copied content by pasting the first 60 words from your website into Google and submut VIA https://www.google.com/webmasters/tools/dmca-dashboard DMCA requests will only accept permalinks.
  3. Disavow EVERY site which has copied content linking back to you. Do this on every page of your website.

My first answer was to disavow the domain, but I forgot mention that you need to disavow:

  • www. AND
  • non www.

(Google counts them as two separate domains).

John Conde
  • 86,255
  • 27
  • 146
  • 241
John
  • 309
  • 2
  • 9