Title: | A 'robots.txt' Parser and 'Webbot'/'Spider'/'Crawler' Permissions Checker |
---|---|
Description: | Provides functions to download and parse 'robots.txt' files. Ultimately the package makes it easy to check if bots (spiders, crawler, scrapers, ...) are allowed to access specific resources on a domain. |
Authors: | Pedro Baltazar [ctb], Jordan Bradford [cre], Peter Meissner [aut], Kun Ren [aut, cph] (Author and copyright holder of list_merge.R.), Oliver Keys [ctb] (original release code review), Rich Fitz John [ctb] (original release code review) |
Maintainer: | Jordan Bradford <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.7.15.9000 |
Built: | 2024-12-15 05:01:32 UTC |
Source: | https://github.com/ropensci/robotstxt |
Convert robotstxt_text to list
## S3 method for class 'robotstxt_text' as.list(x, ...)
## S3 method for class 'robotstxt_text' as.list(x, ...)
x |
class robotstxt_text object to be transformed into list |
... |
further arguments (inherited from |
Add http protocal if missing from URL
fix_url(url)
fix_url(url)
url |
a character string containing a single URL |
Download a robots.txt file
get_robotstxt( domain, warn = getOption("robotstxt_warn", TRUE), force = FALSE, user_agent = utils::sessionInfo()$R.version$version.string, ssl_verifypeer = c(1, 0), encoding = "UTF-8", verbose = FALSE, rt_request_handler = robotstxt::rt_request_handler, rt_robotstxt_http_getter = robotstxt::get_robotstxt_http_get, on_server_error = on_server_error_default, on_client_error = on_client_error_default, on_not_found = on_not_found_default, on_redirect = on_redirect_default, on_domain_change = on_domain_change_default, on_file_type_mismatch = on_file_type_mismatch_default, on_suspect_content = on_suspect_content_default )
get_robotstxt( domain, warn = getOption("robotstxt_warn", TRUE), force = FALSE, user_agent = utils::sessionInfo()$R.version$version.string, ssl_verifypeer = c(1, 0), encoding = "UTF-8", verbose = FALSE, rt_request_handler = robotstxt::rt_request_handler, rt_robotstxt_http_getter = robotstxt::get_robotstxt_http_get, on_server_error = on_server_error_default, on_client_error = on_client_error_default, on_not_found = on_not_found_default, on_redirect = on_redirect_default, on_domain_change = on_domain_change_default, on_file_type_mismatch = on_file_type_mismatch_default, on_suspect_content = on_suspect_content_default )
domain |
domain from which to download robots.txt file |
warn |
warn about being unable to download domain/robots.txt because of |
force |
if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens, |
user_agent |
HTTP user-agent string to be used to retrieve robots.txt file from domain |
ssl_verifypeer |
either 1 (default) or 0, if 0 it disables SSL peer verification, which might help with robots.txt file retrieval |
encoding |
Encoding of the robots.txt file. |
verbose |
make function print out more information |
rt_request_handler |
handler function that handles request according to the event handlers specified |
rt_robotstxt_http_getter |
function that executes HTTP request |
on_server_error |
request state handler for any 5xx status |
on_client_error |
request state handler for any 4xx HTTP status that is not 404 |
on_not_found |
request state handler for HTTP status 404 |
on_redirect |
request state handler for any 3xx HTTP status |
on_domain_change |
request state handler for any 3xx HTTP status where domain did change as well |
on_file_type_mismatch |
request state handler for content type other than 'text/plain' |
on_suspect_content |
request state handler for content that seems to be something else than a robots.txt file (usually a JSON, XML or HTML) |
Download multiple robotstxt files
get_robotstxts( domain, warn = TRUE, force = FALSE, user_agent = utils::sessionInfo()$R.version$version.string, ssl_verifypeer = c(1, 0), use_futures = FALSE, verbose = FALSE, rt_request_handler = robotstxt::rt_request_handler, rt_robotstxt_http_getter = robotstxt::get_robotstxt_http_get, on_server_error = on_server_error_default, on_client_error = on_client_error_default, on_not_found = on_not_found_default, on_redirect = on_redirect_default, on_domain_change = on_domain_change_default, on_file_type_mismatch = on_file_type_mismatch_default, on_suspect_content = on_suspect_content_default )
get_robotstxts( domain, warn = TRUE, force = FALSE, user_agent = utils::sessionInfo()$R.version$version.string, ssl_verifypeer = c(1, 0), use_futures = FALSE, verbose = FALSE, rt_request_handler = robotstxt::rt_request_handler, rt_robotstxt_http_getter = robotstxt::get_robotstxt_http_get, on_server_error = on_server_error_default, on_client_error = on_client_error_default, on_not_found = on_not_found_default, on_redirect = on_redirect_default, on_domain_change = on_domain_change_default, on_file_type_mismatch = on_file_type_mismatch_default, on_suspect_content = on_suspect_content_default )
domain |
domain from which to download robots.txt file |
warn |
warn about being unable to download domain/robots.txt because of |
force |
if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens, |
user_agent |
HTTP user-agent string to be used to retrieve robots.txt file from domain |
ssl_verifypeer |
either 1 (default) or 0, if 0 it disables SSL peer verification, which might help with robots.txt file retrieval |
use_futures |
Should future::future_lapply be used for possible parallel/async retrieval or not. Note: check out help pages and vignettes of package future on how to set up plans for future execution because the robotstxt package does not do it on its own. |
verbose |
make function print out more information |
rt_request_handler |
handler function that handles request according to the event handlers specified |
rt_robotstxt_http_getter |
function that executes HTTP request |
on_server_error |
request state handler for any 5xx status |
on_client_error |
request state handler for any 4xx HTTP status that is not 404 |
on_not_found |
request state handler for HTTP status 404 |
on_redirect |
request state handler for any 3xx HTTP status |
on_domain_change |
request state handler for any 3xx HTTP status where domain did change as well |
on_file_type_mismatch |
request state handler for content type other than 'text/plain' |
on_suspect_content |
request state handler for content that seems to be something else than a robots.txt file (usually a JSON, XML or HTML) |
Guess a domain from path
guess_domain(x)
guess_domain(x)
x |
path aka URL from which to infer domain |
Check if HTTP domain changed
http_domain_changed(response)
http_domain_changed(response)
response |
an httr response object, e.g. from a call to httr::GET() |
logical of length 1 indicating whether or not any domain change happened during the HTTP request
Check if HTTP subdomain changed
http_subdomain_changed(response)
http_subdomain_changed(response)
response |
an httr response object, e.g. from a call to httr::GET() |
logical of length 1 indicating whether or not any subdomain change happened during the HTTP request
Check if HTTP redirect occurred
http_was_redirected(response)
http_was_redirected(response)
response |
an httr response object, e.g. from a call to httr::GET() |
logical of length 1 indicating whether or not any redirect happened during the HTTP request
Function that checks if file is valid / parsable robots.txt file
is_suspect_robotstxt(text)
is_suspect_robotstxt(text)
text |
content of a robots.txt file provides as character vector |
Validate if a file is valid / parsable robots.txt file
is_valid_robotstxt(text, check_strickt_ascii = FALSE)
is_valid_robotstxt(text, check_strickt_ascii = FALSE)
text |
content of a robots.txt file provided as character vector |
check_strickt_ascii |
whether or not to check if content does adhere to the specification of RFC to use plain text aka ASCII |
Merge a number of named lists in sequential order
list_merge(...)
list_merge(...)
... |
named lists |
List merging is usually useful in the merging of program settings or configuraion with multiple versions across time, or multiple administrative levels. For example, a program settings may have an initial version in which most keys are defined and specified. In later versions, partial modifications are recorded. In this case, list merging can be useful to merge all versions of settings in release order of these versions. The result is an fully updated settings with all later modifications applied.
Kun Ren <[email protected]>
The function merges a number of lists in sequential order
by modifyList
, that is, the later list always
modifies the former list and form a merged list, and the
resulted list is again being merged with the next list.
The process is repeated until all lists in ...
or
list
are exausted.
Return default value if NULL
null_to_default(x, d)
null_to_default(x, d)
x |
value to check and return |
d |
value to return in case x is NULL |
Parse a robots.txt file
parse_robotstxt(txt)
parse_robotstxt(txt)
txt |
content of the robots.txt file |
a named list with useragents, comments, permissions, sitemap
Check if a bot has permissions to access page(s)
paths_allowed( paths = "/", domain = "auto", bot = "*", user_agent = utils::sessionInfo()$R.version$version.string, check_method = c("spiderbar"), warn = getOption("robotstxt_warn", TRUE), force = FALSE, ssl_verifypeer = c(1, 0), use_futures = TRUE, robotstxt_list = NULL, verbose = FALSE, rt_request_handler = robotstxt::rt_request_handler, rt_robotstxt_http_getter = robotstxt::get_robotstxt_http_get, on_server_error = on_server_error_default, on_client_error = on_client_error_default, on_not_found = on_not_found_default, on_redirect = on_redirect_default, on_domain_change = on_domain_change_default, on_file_type_mismatch = on_file_type_mismatch_default, on_suspect_content = on_suspect_content_default )
paths_allowed( paths = "/", domain = "auto", bot = "*", user_agent = utils::sessionInfo()$R.version$version.string, check_method = c("spiderbar"), warn = getOption("robotstxt_warn", TRUE), force = FALSE, ssl_verifypeer = c(1, 0), use_futures = TRUE, robotstxt_list = NULL, verbose = FALSE, rt_request_handler = robotstxt::rt_request_handler, rt_robotstxt_http_getter = robotstxt::get_robotstxt_http_get, on_server_error = on_server_error_default, on_client_error = on_client_error_default, on_not_found = on_not_found_default, on_redirect = on_redirect_default, on_domain_change = on_domain_change_default, on_file_type_mismatch = on_file_type_mismatch_default, on_suspect_content = on_suspect_content_default )
paths |
paths for which to check bot's permission, defaults to "/". Please note that path to a folder should end with a trailing slash ("/"). |
domain |
Domain for which paths should be checked. Defaults to "auto". If set to "auto" function will try to guess the domain by parsing the paths argument. Note however, that these are educated guesses which might utterly fail. To be on the safe side, provide appropriate domains manually. |
bot |
name of the bot, defaults to "*" |
user_agent |
HTTP user-agent string to be used to retrieve robots.txt file from domain |
check_method |
at the moment only kept for backward compatibility reasons - do not use parameter anymore –> will let the function simply use the default |
warn |
suppress warnings |
force |
if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens, |
ssl_verifypeer |
either 1 (default) or 0, if 0 it disables SSL peer verification, which might help with robots.txt file retrieval |
use_futures |
Should future::future_lapply be used for possible parallel/async retrieval or not. Note: check out help pages and vignettes of package future on how to set up plans for future execution because the robotstxt package does not do it on its own. |
robotstxt_list |
either NULL – the default – or a list of character vectors with one vector per path to check |
verbose |
make function print out more information |
rt_request_handler |
handler function that handles request according to the event handlers specified |
rt_robotstxt_http_getter |
function that executes HTTP request |
on_server_error |
request state handler for any 5xx status |
on_client_error |
request state handler for any 4xx HTTP status that is not 404 |
on_not_found |
request state handler for HTTP status 404 |
on_redirect |
request state handler for any 3xx HTTP status |
on_domain_change |
request state handler for any 3xx HTTP status where domain did change as well |
on_file_type_mismatch |
request state handler for content type other than 'text/plain' |
on_suspect_content |
request state handler for content that seems to be something else than a robots.txt file (usually a JSON, XML or HTML) |
Check if a spiderbar bot has permissions to access page(s)
paths_allowed_worker_spiderbar(domain, bot, paths, robotstxt_list)
paths_allowed_worker_spiderbar(domain, bot, paths, robotstxt_list)
domain |
Domain for which paths should be checked. Defaults to "auto". If set to "auto" function will try to guess the domain by parsing the paths argument. Note however, that these are educated guesses which might utterly fail. To be on the safe side, provide appropriate domains manually. |
bot |
name of the bot, defaults to "*" |
paths |
paths for which to check bot's permission, defaults to "/". Please note that path to a folder should end with a trailing slash ("/"). |
robotstxt_list |
either NULL – the default – or a list of character vectors with one vector per path to check |
Print robotstxt
## S3 method for class 'robotstxt' print(x, ...)
## S3 method for class 'robotstxt' print(x, ...)
x |
robotstxt instance to be printed |
... |
goes down the sink |
Print robotstxt's text
## S3 method for class 'robotstxt_text' print(x, ...)
## S3 method for class 'robotstxt_text' print(x, ...)
x |
character vector aka robotstxt$text to be printed |
... |
goes down the sink |
Remove domain from path
remove_domain(x)
remove_domain(x)
x |
path aka URL from which to first infer domain and then remove it |
Helper function to handle robotstxt handlers.
request_handler_handler(request, handler, res, info = TRUE, warn = TRUE)
request_handler_handler(request, handler, res, info = TRUE, warn = TRUE)
request |
the request object returned by call to httr::GET() |
handler |
the handler either a character string entailing various options or a function producing a specific list, see return. |
res |
a list with elements '[handler names], ...', 'rtxt', and 'cache' |
info |
info to add to problems list |
warn |
if FALSE warnings and messages are suppressed |
a list with elements '[handler name]', 'rtxt', and 'cache'
The function generates a list that entails data resulting from parsing a robots.txt file as well as a function called check that enables to ask the representation if bot (or particular bots) are allowed to access a resource on the domain.
robotstxt( domain = NULL, text = NULL, user_agent = NULL, warn = getOption("robotstxt_warn", TRUE), force = FALSE, ssl_verifypeer = c(1, 0), encoding = "UTF-8", verbose = FALSE, on_server_error = on_server_error_default, on_client_error = on_client_error_default, on_not_found = on_not_found_default, on_redirect = on_redirect_default, on_domain_change = on_domain_change_default, on_file_type_mismatch = on_file_type_mismatch_default, on_suspect_content = on_suspect_content_default )
robotstxt( domain = NULL, text = NULL, user_agent = NULL, warn = getOption("robotstxt_warn", TRUE), force = FALSE, ssl_verifypeer = c(1, 0), encoding = "UTF-8", verbose = FALSE, on_server_error = on_server_error_default, on_client_error = on_client_error_default, on_not_found = on_not_found_default, on_redirect = on_redirect_default, on_domain_change = on_domain_change_default, on_file_type_mismatch = on_file_type_mismatch_default, on_suspect_content = on_suspect_content_default )
domain |
Domain for which to generate a representation. If text equals to NULL, the function will download the file from server - the default. |
text |
If automatic download of the robots.txt is not preferred, the text can be supplied directly. |
user_agent |
HTTP user-agent string to be used to retrieve robots.txt file from domain |
warn |
warn about being unable to download domain/robots.txt because of |
force |
if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens, |
ssl_verifypeer |
either 1 (default) or 0, if 0 it disables SSL peer verification, which might help with robots.txt file retrieval |
encoding |
Encoding of the robots.txt file. |
verbose |
make function print out more information |
on_server_error |
request state handler for any 5xx status |
on_client_error |
request state handler for any 4xx HTTP status that is not 404 |
on_not_found |
request state handler for HTTP status 404 |
on_redirect |
request state handler for any 3xx HTTP status |
on_domain_change |
request state handler for any 3xx HTTP status where domain did change as well |
on_file_type_mismatch |
request state handler for content type other than 'text/plain' |
on_suspect_content |
request state handler for content that seems to be something else than a robots.txt file (usually a JSON, XML or HTML) |
Object (list) of class robotstxt with parsed data from a robots.txt (domain, text, bots, permissions, host, sitemap, other) and one function to (check()) to check resource permissions.
domain
character vector holding domain name for which the robots.txt file is valid; will be set to NA if not supplied on initialization
character
vector of text of robots.txt file; either supplied on initialization or automatically downloaded from domain supplied on initialization
bots
character vector of bot names mentioned in robots.txt
permissions
data.frame of bot permissions found in robots.txt file
host
data.frame of host fields found in robots.txt file
sitemap
data.frame of sitemap fields found in robots.txt file
other
data.frame of other - none of the above - fields found in robots.txt file
check()
Method to check for bot permissions. Defaults to the domains root and no bot in particular. check() has two arguments: paths and bot. The first is for supplying the paths for which to check permissions and the latter to put in the name of the bot. Please, note that path to a folder should end with a trailing slash ("/").
## Not run: rt <- robotstxt(domain="google.com") rt$bots rt$permissions rt$check( paths = c("/", "forbidden"), bot="*") ## End(Not run)
## Not run: rt <- robotstxt(domain="google.com") rt$bots rt$permissions rt$check( paths = c("/", "forbidden"), bot="*") ## End(Not run)
Get the robotstxt cache
rt_cache
rt_cache
An object of class environment
of length 0.
Storage for HTTP request response objects
Execute HTTP request for get_robotstxt()
rt_last_http get_robotstxt_http_get( domain, user_agent = utils::sessionInfo()$R.version$version.string, ssl_verifypeer = 1 )
rt_last_http get_robotstxt_http_get( domain, user_agent = utils::sessionInfo()$R.version$version.string, ssl_verifypeer = 1 )
domain |
the domain to get robots.txt file for. |
user_agent |
the user agent to use for HTTP request header. Defaults to current version of R. If 'NULL' is passed, httr will use software versions for the header, such as 'libcurl/8.7.1 r-curl/5.2.3 httr/1.4.7' |
ssl_verifypeer |
either 1 (default) or 0, if 0 it disables SSL peer verification, which might help with robots.txt file retrieval |
An object of class environment
of length 1.
A helper function for get_robotstxt() that will extract the robots.txt file from the HTTP request result object. It will inform get_robotstxt() if the request should be cached and which problems occurred.
rt_request_handler( request, on_server_error = on_server_error_default, on_client_error = on_client_error_default, on_not_found = on_not_found_default, on_redirect = on_redirect_default, on_domain_change = on_domain_change_default, on_sub_domain_change = on_sub_domain_change_default, on_file_type_mismatch = on_file_type_mismatch_default, on_suspect_content = on_suspect_content_default, warn = TRUE, encoding = "UTF-8" ) on_server_error_default on_client_error_default on_not_found_default on_redirect_default on_domain_change_default on_sub_domain_change_default on_file_type_mismatch_default on_suspect_content_default
rt_request_handler( request, on_server_error = on_server_error_default, on_client_error = on_client_error_default, on_not_found = on_not_found_default, on_redirect = on_redirect_default, on_domain_change = on_domain_change_default, on_sub_domain_change = on_sub_domain_change_default, on_file_type_mismatch = on_file_type_mismatch_default, on_suspect_content = on_suspect_content_default, warn = TRUE, encoding = "UTF-8" ) on_server_error_default on_client_error_default on_not_found_default on_redirect_default on_domain_change_default on_sub_domain_change_default on_file_type_mismatch_default on_suspect_content_default
request |
result of an HTTP request (e.g. httr::GET()) |
on_server_error |
request state handler for any 5xx status |
on_client_error |
request state handler for any 4xx HTTP status that is not 404 |
on_not_found |
request state handler for HTTP status 404 |
on_redirect |
request state handler for any 3xx HTTP status |
on_domain_change |
request state handler for any 3xx HTTP status where domain did change as well |
on_sub_domain_change |
request state handler for any 3xx HTTP status where domain did change but only to www-sub_domain |
on_file_type_mismatch |
request state handler for content type other than 'text/plain' |
on_suspect_content |
request state handler for content that seems to be something else than a robots.txt file (usually a JSON, XML or HTML) |
warn |
suppress warnings |
encoding |
The text encoding to assume if no encoding is provided in the headers of the response |
An object of class list
of length 4.
An object of class list
of length 4.
An object of class list
of length 4.
An object of class list
of length 2.
An object of class list
of length 3.
An object of class list
of length 2.
An object of class list
of length 4.
An object of class list
of length 4.
a list with three items following the following schema:
list( rtxt = "", problems = list( "redirect" = list( status_code = 301 ),
"domain" = list(from_url = "...", to_url = "...") ) )