Fighting SEMrushBot

Dandello

Member
I've been fighting a battle with SEMrushBot creating hundreds of over usage problems since December 4th. I never had a problem with it before.(It's possible it only found my site on December 4th but I have my doubts.)

SEMrushBot hits a 403 error from an IP block while trying to directly access a stored URL (a URL with a forbidden Query String that would throw a 403 even if the IP wasn't already blocked). But it still (apparently) triggers processes in the executable it's trying to get to. These processes don't shut down like they're supposed to - hence the over usage problem.
 
Hi Dandello,

Where are you blocking the offending IP addresses because normally when an IP is blocked it can't reach anything let alone cause Apache to throw a 403. KH has been supplying it's VPSs with CSF for a while now so I'd guess you have it on yours, to manually block an IP in CSF (and IPtables) in SSH type in 'csf -d ###.###.###.###' and that will block the IP. If you have a CIDR range to block you can edit the deny file at /etc/csf/csf.deny (be sure to restart CSF afterwards with 'csf -r'). There's also an interface in WHM to block IP addresses in CSF that's pretty straight forward I just always use SSH.

Hope that helps!

Dan
 
Well, Obviously they didn't get the message because I have 120+ over usage notices in my inbox and have gotten a CPU abuse warning.
I've been blocking the IP addresses in .htaccess.
 
What I see when comparing the error log, the over usage notices, and the visitors log - Semrushbot gets a 403 on the blocked IP address, BUT the parent PID just keeps going and going and going - sometimes for 60+ minutes.

I've asked support to create a cron job for me that will stop any process calling the executable in question after 30 minutes.
The Error Log should tell me if I've inadvertently done something to get the CGI timeout.
 
SEMrushBot hits a 403 error from an IP block while trying to directly access a stored URL (a URL with a forbidden Query String that would throw a 403 even if the IP wasn't already blocked).
A direct IP block in htaccess would just drop the process that was invoked. An IP block in the CSF firewall would disallow access entirely.
Well, Obviously they didn't get the message
so from that I take it you tried the robots.txt file and they ignored it. That isn't the same as .htaccess.
the parent PID just keeps going and going and going - sometimes for 60+ minutes.
I've asked support to create a cron job for me that will stop any process calling the executable in question
Which executable is running that long?
 
It's a heavily modded version of YaBB forum which is in Perl. At least with the cron job in place I can keep an eye on the error log and figure out exactly what other things are going on that might be causing problems without causing CPU issues.

Yes, I know robots.txt isn't the same as .htaccess. But what happens in YaBB: if a 'guest' creates 3 errors too close together to be done by a human, that IP address gets written to the .htaccess file. Semrushbot got IP blocked quite a while back but that same IP address shows up in the Apache error log as a 403 with a matching PID in the excess usage warning.

Lots of other IP addresses get 403 in the Apache error log, but those don't get excess usage warnings. Semrushbot seems to be the one that just won't stop.

I did find more on blocking Semrushbot in the robots.txt so now we wait again for its IP address to show up in the error log.
 
Ah, I think I see the issue then. The forum is watching for this, and if it sees the bot it then triggers the htaccess deny, so the process isn't told to quit as it's waiting for a signal the connection is done. Basically it's happening too late, since the damage is already done. If you're sure it's that bot (or more than one) and it presents itself as that bot, you might consider this instead of waiting for the forum to see it.

https://www.inmotionhosting.com/sup...-unwanted-users-from-your-site-using-htaccess
 
I would expect the problem when the bot or whatever first gets its IP address written to the htaccess. The other processes it has going might not get the stopped properly.
It does turn out that Semrushbot has two identities and blocking one identity in robots.txt doesn't block the other.
 
I think the main issue is that the forum uses PERL. It loads an instance of the script(s) into memory (essentially compiling it as a temporary program), then processes things. It can and will generate as many instances as it needs to handle requests, and killing some connections for the bots isn't clearing that instance from memory. Anything you can do to eliminate the bad ones before they reach the forum script(s) would likely help.

I wonder if this will help in any way?
http://search.cpan.org/~tobyink/Time-Limit-0.003/lib/Time/Limit.pm
 
Thanks WBW. I'll test that out.
Another problem that's cropped up - some of these bots just don't take '404' for an answer. That's not a PERL problem even if the file in question is missing from the cgi-bin.
 
Even though your site may be producing 403/404 errors it may still require your scripts to be executed to produce those error pages. Bots may continue to scan 403/404'ed pages to see if they return to normal. If you don't care about that bot at all I'd just collect the IPs and add them to CSF, as WBW recommended, to block them completely. That will stop those offending IPs from executing those scripts. If your robots.txt is configured correctly and that bot isn't adhering to it, it's a bad bot to begin with and should be blocked completely.
 
Semrush is one of those bots that when it hits websites, it hits them hard. It doesn't 'trickle' at a couple hundred requests per hour, or even a couple thousand.

In a span of a few hours; you can see over 10,000 requests from Semrush alone.

The downside to Semrush blocking; is that they even advertise that an .htaccess block won't work against them.

--
Please do not try to block SEMrushBot via IP in .htaccess as we do not use any consecutive IP blocks.
--

Semrush does adhere to a proper robots.txt; but they state it can take up to two weeks before noticed.

--
Please note that there might be a delay up to two weeks before SEMrushBot discovers the changes you made to robots.txt.
--

Semrush is considered a 'good' bot due to the reason it collects data; you can try blocking the IP's listed here but it's not all of them.

--
https://www.distilnetworks.com/bot-directory/bot/semrush-bot/
--

If you do block Semrush, you want to block based on user-agent and not IP.
https://www.knownhost.com/
 
The IPs get written to the .htaccess automatically when the 'guest' triggers certain conditions and 99% of the time that works fine.

I've already got Semrushbot and a couple others blocked by user-agent.

Another interesting and possibly related issue cropped up while trying to get a handle on this - when the script had a cgi Timeout (and that timeout was documented in the error log), Apache didn't kill the process - it just sat there using resources until the cron script one of the support techs wrote for me (temporary for dev only) killed it at 1800 seconds (just long enough to send me an over usage note so the PID could be traced).

Now it's a matter of waiting until something triggers a Timeout so I can figure out how to reproduce the Timeout and figure out why Apache didn't didn't kill it.
 
Hi, I saw the very good post but I can't block it, is there some kind of a more special way to block them? because I saw more than one related article in Spanish and English forums. And I also used a pluging "spiderblocker.1.3.1" but it does not block it, I have my site in wordpress recently installed with the latest version of the market.
 
Top