The reason that robots.txt generally worked was because nobody was trying to really leverage it against bot operators. I’m not sure that this might not just kill robots.txt. Historically, search engines wanted to index stuff and websites wanted to be indexed. Their interests were aligned, so the convention worked. This no longer holds if things like the Google-Reddit partnership become common.
Reddit can also try to detect and block crawlers; robots.txt isn’t the only tool in their toolbox.
Microsoft, unlike most companies, does actually have a technical counter that Reddit probably cannot stop, if it comes to that and Microsoft wants to do a “hostile index” of Reddit.
Microsoft’s browser, Edge, is used by a bunch of people, and Microsoft can probably rig it up to send content of Reddit pages requested by their browser’s users sufficient to build their index. Reddit can’t stop that without blocking Edge users. I expect that that’d probably be exploring a lot of unexplored legal territory under the laws of many countries. It also wouldn’t be as good as Google’s (I assume real-time) access to the comments, but they’d get to them.
Browsers do report the host-referrer, which would permit Reddit to detect that a given user has arrived from Bing and block them:
In HTTP, “Referer” (a misspelling of “Referrer”[1]) is an optional HTTP header field that identifies the address of the web page (i.e., the URI or IRI), from which the resource has been requested. By checking the referrer, the server providing the new web page can see where the request originated.
In the most common situation, this means that when a user clicks a hyperlink in a web browser, causing the browser to send a request to the server holding the destination web page, the request may include the Referer field, which indicates the last page the user was on (the one where they clicked the link).
Web sites and web servers log the content of the received Referer field to identify the web page from which the user followed a link, for promotional or statistical purposes.[2] This entails a loss of privacy for the user and may introduce a security risk.[3] To mitigate security risks, browsers have been steadily reducing the amount of information sent in Referer. As of March 2021, by default Chrome,[4] Chromium-based Edge, Firefox,[5] Safari[6] default to sending only the origin in cross-origin requests, stripping out everything but the domain name.
Reddit could block browsers with a host-referrer off bing.com, killing the ability of Bing to link to them. I don’t know if there’s a way for a linking site to ask a browser to not give or forge the host-referrer. For Edge users – not all Bing users – Microsoft could modify the browser to do so, forcing Reddit to decide whether to block all Edge users or not.
They will not succeed without restricting access to Reddit to an unusable degree, since crawlers can be coded to imitate real users close enough. Combine that with enough proxies and they can’t do jack shit
Also you could get arround the Referer header quite easily via redirects (unless Reddit went ahead and used a Whitelist for those, which again would be a very stupid decision) and some more methods
The reason that robots.txt generally worked was because nobody was trying to really leverage it against bot operators. I’m not sure that this might not just kill robots.txt. Historically, search engines wanted to index stuff and websites wanted to be indexed. Their interests were aligned, so the convention worked. This no longer holds if things like the Google-Reddit partnership become common.
Reddit can also try to detect and block crawlers; robots.txt isn’t the only tool in their toolbox.
Microsoft, unlike most companies, does actually have a technical counter that Reddit probably cannot stop, if it comes to that and Microsoft wants to do a “hostile index” of Reddit.
Microsoft’s browser, Edge, is used by a bunch of people, and Microsoft can probably rig it up to send content of Reddit pages requested by their browser’s users sufficient to build their index. Reddit can’t stop that without blocking Edge users. I expect that that’d probably be exploring a lot of unexplored legal territory under the laws of many countries. It also wouldn’t be as good as Google’s (I assume real-time) access to the comments, but they’d get to them.
Browsers do report the host-referrer, which would permit Reddit to detect that a given user has arrived from Bing and block them:
https://en.wikipedia.org/wiki/HTTP_referer
Reddit could block browsers with a host-referrer off bing.com, killing the ability of Bing to link to them. I don’t know if there’s a way for a linking site to ask a browser to not give or forge the host-referrer. For Edge users – not all Bing users – Microsoft could modify the browser to do so, forcing Reddit to decide whether to block all Edge users or not.
It is possible to remove the referer header:
They can try to block crawlers all they want
They will not succeed without restricting access to Reddit to an unusable degree, since crawlers can be coded to imitate real users close enough. Combine that with enough proxies and they can’t do jack shit
Also you could get arround the Referer header quite easily via redirects (unless Reddit went ahead and used a Whitelist for those, which again would be a very stupid decision) and some more methods