Package 'robotstxt' reference manual

Title:	A 'robots.txt' Parser and 'Webbot'/'Spider'/'Crawler' Permissions Checker
Description:	Provides functions to download and parse 'robots.txt' files. Ultimately the package makes it easy to check if bots (spiders, crawler, scrapers, ...) are allowed to access specific resources on a domain.
Authors:	Pedro Baltazar [ctb], Jordan Bradford [cre], Peter Meissner [aut], Kun Ren [aut, cph] (Author and copyright holder of list_merge.R.), Oliver Keys [ctb] (original release code review), Rich Fitz John [ctb] (original release code review)
Maintainer:	Jordan Bradford <[email protected]>
License:	MIT + file LICENSE
Version:	0.7.15.9000
Built:	2025-01-14 03:22:25 UTC
Source:	https://github.com/ropensci/robotstxt

re-export magrittr pipe operator

Description

re-export magrittr pipe operator

Convert robotstxt_text to list

Description

Convert robotstxt_text to list

Usage

## S3 method for class 'robotstxt_text'
as.list(x, ...)
## S3 method for class 'robotstxt_text'
as.list(x, ...)

Arguments

`x`	class robotstxt_text object to be transformed into list
`...`	further arguments (inherited from `base::as.list()`)

Add http protocal if missing from URL

Description

Add http protocal if missing from URL

Usage

fix_url(url)
fix_url(url)

Arguments

url

a character string containing a single URL

Download a robots.txt file

Description

Download a robots.txt file

Usage

get_robotstxt(
  domain,
  warn = getOption("robotstxt_warn", TRUE),
  force = FALSE,
  user_agent = utils::sessionInfo()$R.version$version.string,
  ssl_verifypeer = c(1, 0),
  encoding = "UTF-8",
  verbose = FALSE,
  rt_request_handler = robotstxt::rt_request_handler,
  rt_robotstxt_http_getter = robotstxt::get_robotstxt_http_get,
  on_server_error = on_server_error_default,
  on_client_error = on_client_error_default,
  on_not_found = on_not_found_default,
  on_redirect = on_redirect_default,
  on_domain_change = on_domain_change_default,
  on_file_type_mismatch = on_file_type_mismatch_default,
  on_suspect_content = on_suspect_content_default
)
get_robotstxt(
  domain,
  warn = getOption("robotstxt_warn", TRUE),
  force = FALSE,
  user_agent = utils::sessionInfo()$R.version$version.string,
  ssl_verifypeer = c(1, 0),
  encoding = "UTF-8",
  verbose = FALSE,
  rt_request_handler = robotstxt::rt_request_handler,
  rt_robotstxt_http_getter = robotstxt::get_robotstxt_http_get,
  on_server_error = on_server_error_default,
  on_client_error = on_client_error_default,
  on_not_found = on_not_found_default,
  on_redirect = on_redirect_default,
  on_domain_change = on_domain_change_default,
  on_file_type_mismatch = on_file_type_mismatch_default,
  on_suspect_content = on_suspect_content_default
)

Arguments

`domain`	domain from which to download robots.txt file
`warn`	warn about being unable to download domain/robots.txt because of
`force`	if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens,
`user_agent`	HTTP user-agent string to be used to retrieve robots.txt file from domain
`ssl_verifypeer`	either 1 (default) or 0, if 0 it disables SSL peer verification, which might help with robots.txt file retrieval
`encoding`	Encoding of the robots.txt file.
`verbose`	make function print out more information
`rt_request_handler`	handler function that handles request according to the event handlers specified
`rt_robotstxt_http_getter`	function that executes HTTP request
`on_server_error`	request state handler for any 5xx status
`on_client_error`	request state handler for any 4xx HTTP status that is not 404
`on_not_found`	request state handler for HTTP status 404
`on_redirect`	request state handler for any 3xx HTTP status
`on_domain_change`	request state handler for any 3xx HTTP status where domain did change as well
`on_file_type_mismatch`	request state handler for content type other than 'text/plain'
`on_suspect_content`	request state handler for content that seems to be something else than a robots.txt file (usually a JSON, XML or HTML)

Download multiple robotstxt files

Description

Download multiple robotstxt files

Usage

get_robotstxts(
  domain,
  warn = TRUE,
  force = FALSE,
  user_agent = utils::sessionInfo()$R.version$version.string,
  ssl_verifypeer = c(1, 0),
  use_futures = FALSE,
  verbose = FALSE,
  rt_request_handler = robotstxt::rt_request_handler,
  rt_robotstxt_http_getter = robotstxt::get_robotstxt_http_get,
  on_server_error = on_server_error_default,
  on_client_error = on_client_error_default,
  on_not_found = on_not_found_default,
  on_redirect = on_redirect_default,
  on_domain_change = on_domain_change_default,
  on_file_type_mismatch = on_file_type_mismatch_default,
  on_suspect_content = on_suspect_content_default
)
get_robotstxts(
  domain,
  warn = TRUE,
  force = FALSE,
  user_agent = utils::sessionInfo()$R.version$version.string,
  ssl_verifypeer = c(1, 0),
  use_futures = FALSE,
  verbose = FALSE,
  rt_request_handler = robotstxt::rt_request_handler,
  rt_robotstxt_http_getter = robotstxt::get_robotstxt_http_get,
  on_server_error = on_server_error_default,
  on_client_error = on_client_error_default,
  on_not_found = on_not_found_default,
  on_redirect = on_redirect_default,
  on_domain_change = on_domain_change_default,
  on_file_type_mismatch = on_file_type_mismatch_default,
  on_suspect_content = on_suspect_content_default
)

Arguments

`domain`	domain from which to download robots.txt file
`warn`	warn about being unable to download domain/robots.txt because of
`force`	if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens,
`user_agent`	HTTP user-agent string to be used to retrieve robots.txt file from domain
`ssl_verifypeer`	either 1 (default) or 0, if 0 it disables SSL peer verification, which might help with robots.txt file retrieval
`use_futures`	Should future::future_lapply be used for possible parallel/async retrieval or not. Note: check out help pages and vignettes of package future on how to set up plans for future execution because the robotstxt package does not do it on its own.
`verbose`	make function print out more information
`rt_request_handler`	handler function that handles request according to the event handlers specified
`rt_robotstxt_http_getter`	function that executes HTTP request
`on_server_error`	request state handler for any 5xx status
`on_client_error`	request state handler for any 4xx HTTP status that is not 404
`on_not_found`	request state handler for HTTP status 404
`on_redirect`	request state handler for any 3xx HTTP status
`on_domain_change`	request state handler for any 3xx HTTP status where domain did change as well
`on_file_type_mismatch`	request state handler for content type other than 'text/plain'
`on_suspect_content`	request state handler for content that seems to be something else than a robots.txt file (usually a JSON, XML or HTML)

Guess a domain from path

Description

Guess a domain from path

Usage

guess_domain(x)
guess_domain(x)

Arguments

`x`	path aka URL from which to infer domain

Check if HTTP domain changed

Description

Check if HTTP domain changed

Usage

http_domain_changed(response)
http_domain_changed(response)

Arguments

response

an httr response object, e.g. from a call to httr::GET()

Value

logical of length 1 indicating whether or not any domain change happened during the HTTP request

Check if HTTP subdomain changed

Description

Check if HTTP subdomain changed

Usage

http_subdomain_changed(response)
http_subdomain_changed(response)

Arguments

response

an httr response object, e.g. from a call to httr::GET()

Value

logical of length 1 indicating whether or not any subdomain change happened during the HTTP request

Check if HTTP redirect occurred

Description

Check if HTTP redirect occurred

Usage

http_was_redirected(response)
http_was_redirected(response)

Arguments

response

an httr response object, e.g. from a call to httr::GET()

Value

logical of length 1 indicating whether or not any redirect happened during the HTTP request

Check if file is valid / parsable robots.txt file

Description

Function that checks if file is valid / parsable robots.txt file

Usage

is_suspect_robotstxt(text)
is_suspect_robotstxt(text)

Arguments

text

content of a robots.txt file provides as character vector

Validate if a file is valid / parsable robots.txt file

Description

Validate if a file is valid / parsable robots.txt file

Usage

is_valid_robotstxt(text, check_strickt_ascii = FALSE)
is_valid_robotstxt(text, check_strickt_ascii = FALSE)

Arguments

`text`	content of a robots.txt file provided as character vector
`check_strickt_ascii`	whether or not to check if content does adhere to the specification of RFC to use plain text aka ASCII

Merge a number of named lists in sequential order

Description

Merge a number of named lists in sequential order

Usage

list_merge(...)
list_merge(...)

Arguments

...

named lists

Details

List merging is usually useful in the merging of program settings or configuraion with multiple versions across time, or multiple administrative levels. For example, a program settings may have an initial version in which most keys are defined and specified. In later versions, partial modifications are recorded. In this case, list merging can be useful to merge all versions of settings in release order of these versions. The result is an fully updated settings with all later modifications applied.

Author(s)

Kun Ren <[email protected]>

The function merges a number of lists in sequential order by modifyList, that is, the later list always modifies the former list and form a merged list, and the resulted list is again being merged with the next list. The process is repeated until all lists in ... or list are exausted.

Return default value if NULL

Description

Return default value if NULL

Usage

null_to_default(x, d)
null_to_default(x, d)

Arguments

`x`	value to check and return
`d`	value to return in case x is NULL

Parse a robots.txt file

Description

Parse a robots.txt file

Usage

parse_robotstxt(txt)
parse_robotstxt(txt)

Arguments

txt

content of the robots.txt file

Value

a named list with useragents, comments, permissions, sitemap

Check if a bot has permissions to access page(s)

Description

Check if a bot has permissions to access page(s)

Usage

paths_allowed(
  paths = "/",
  domain = "auto",
  bot = "*",
  user_agent = utils::sessionInfo()$R.version$version.string,
  check_method = c("spiderbar"),
  warn = getOption("robotstxt_warn", TRUE),
  force = FALSE,
  ssl_verifypeer = c(1, 0),
  use_futures = TRUE,
  robotstxt_list = NULL,
  verbose = FALSE,
  rt_request_handler = robotstxt::rt_request_handler,
  rt_robotstxt_http_getter = robotstxt::get_robotstxt_http_get,
  on_server_error = on_server_error_default,
  on_client_error = on_client_error_default,
  on_not_found = on_not_found_default,
  on_redirect = on_redirect_default,
  on_domain_change = on_domain_change_default,
  on_file_type_mismatch = on_file_type_mismatch_default,
  on_suspect_content = on_suspect_content_default
)
paths_allowed(
  paths = "/",
  domain = "auto",
  bot = "*",
  user_agent = utils::sessionInfo()$R.version$version.string,
  check_method = c("spiderbar"),
  warn = getOption("robotstxt_warn", TRUE),
  force = FALSE,
  ssl_verifypeer = c(1, 0),
  use_futures = TRUE,
  robotstxt_list = NULL,
  verbose = FALSE,
  rt_request_handler = robotstxt::rt_request_handler,
  rt_robotstxt_http_getter = robotstxt::get_robotstxt_http_get,
  on_server_error = on_server_error_default,
  on_client_error = on_client_error_default,
  on_not_found = on_not_found_default,
  on_redirect = on_redirect_default,
  on_domain_change = on_domain_change_default,
  on_file_type_mismatch = on_file_type_mismatch_default,
  on_suspect_content = on_suspect_content_default
)

Arguments

`paths`	paths for which to check bot's permission, defaults to "/". Please note that path to a folder should end with a trailing slash ("/").
`domain`	Domain for which paths should be checked. Defaults to "auto". If set to "auto" function will try to guess the domain by parsing the paths argument. Note however, that these are educated guesses which might utterly fail. To be on the safe side, provide appropriate domains manually.
`bot`	name of the bot, defaults to "*"
`user_agent`	HTTP user-agent string to be used to retrieve robots.txt file from domain
`check_method`	at the moment only kept for backward compatibility reasons - do not use parameter anymore –> will let the function simply use the default
`warn`	suppress warnings
`force`	if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens,
`ssl_verifypeer`	either 1 (default) or 0, if 0 it disables SSL peer verification, which might help with robots.txt file retrieval
`use_futures`	Should future::future_lapply be used for possible parallel/async retrieval or not. Note: check out help pages and vignettes of package future on how to set up plans for future execution because the robotstxt package does not do it on its own.
`robotstxt_list`	either NULL – the default – or a list of character vectors with one vector per path to check
`verbose`	make function print out more information
`rt_request_handler`	handler function that handles request according to the event handlers specified
`rt_robotstxt_http_getter`	function that executes HTTP request
`on_server_error`	request state handler for any 5xx status
`on_client_error`	request state handler for any 4xx HTTP status that is not 404
`on_not_found`	request state handler for HTTP status 404
`on_redirect`	request state handler for any 3xx HTTP status
`on_domain_change`	request state handler for any 3xx HTTP status where domain did change as well
`on_file_type_mismatch`	request state handler for content type other than 'text/plain'
`on_suspect_content`	request state handler for content that seems to be something else than a robots.txt file (usually a JSON, XML or HTML)

Check if a spiderbar bot has permissions to access page(s)

Description

Check if a spiderbar bot has permissions to access page(s)

Usage

paths_allowed_worker_spiderbar(domain, bot, paths, robotstxt_list)
paths_allowed_worker_spiderbar(domain, bot, paths, robotstxt_list)

Arguments

`domain`	Domain for which paths should be checked. Defaults to "auto". If set to "auto" function will try to guess the domain by parsing the paths argument. Note however, that these are educated guesses which might utterly fail. To be on the safe side, provide appropriate domains manually.
`bot`	name of the bot, defaults to "*"
`paths`	paths for which to check bot's permission, defaults to "/". Please note that path to a folder should end with a trailing slash ("/").
`robotstxt_list`	either NULL – the default – or a list of character vectors with one vector per path to check

Print robotstxt

Description

Print robotstxt

Usage

## S3 method for class 'robotstxt'
print(x, ...)
## S3 method for class 'robotstxt'
print(x, ...)

Arguments

`x`	robotstxt instance to be printed
`...`	goes down the sink

Print robotstxt's text

Description

Print robotstxt's text

Usage

## S3 method for class 'robotstxt_text'
print(x, ...)
## S3 method for class 'robotstxt_text'
print(x, ...)

Arguments

`x`	character vector aka robotstxt$text to be printed
`...`	goes down the sink

Remove domain from path

Description

Remove domain from path

Usage

remove_domain(x)
remove_domain(x)

Arguments

`x`	path aka URL from which to first infer domain and then remove it

Handle robotstxt handlers

Description

Helper function to handle robotstxt handlers.

Usage

request_handler_handler(request, handler, res, info = TRUE, warn = TRUE)
request_handler_handler(request, handler, res, info = TRUE, warn = TRUE)

Arguments

`request`	the request object returned by call to httr::GET()
`handler`	the handler either a character string entailing various options or a function producing a specific list, see return.
`res`	a list with elements '[handler names], ...', 'rtxt', and 'cache'
`info`	info to add to problems list
`warn`	if FALSE warnings and messages are suppressed

Value

a list with elements '[handler name]', 'rtxt', and 'cache'

Generate a representation of a robots.txt file

Description

The function generates a list that entails data resulting from parsing a robots.txt file as well as a function called check that enables to ask the representation if bot (or particular bots) are allowed to access a resource on the domain.

Usage

robotstxt(
  domain = NULL,
  text = NULL,
  user_agent = NULL,
  warn = getOption("robotstxt_warn", TRUE),
  force = FALSE,
  ssl_verifypeer = c(1, 0),
  encoding = "UTF-8",
  verbose = FALSE,
  on_server_error = on_server_error_default,
  on_client_error = on_client_error_default,
  on_not_found = on_not_found_default,
  on_redirect = on_redirect_default,
  on_domain_change = on_domain_change_default,
  on_file_type_mismatch = on_file_type_mismatch_default,
  on_suspect_content = on_suspect_content_default
)
robotstxt(
  domain = NULL,
  text = NULL,
  user_agent = NULL,
  warn = getOption("robotstxt_warn", TRUE),
  force = FALSE,
  ssl_verifypeer = c(1, 0),
  encoding = "UTF-8",
  verbose = FALSE,
  on_server_error = on_server_error_default,
  on_client_error = on_client_error_default,
  on_not_found = on_not_found_default,
  on_redirect = on_redirect_default,
  on_domain_change = on_domain_change_default,
  on_file_type_mismatch = on_file_type_mismatch_default,
  on_suspect_content = on_suspect_content_default
)

Arguments

`domain`	Domain for which to generate a representation. If text equals to NULL, the function will download the file from server - the default.
`text`	If automatic download of the robots.txt is not preferred, the text can be supplied directly.
`user_agent`	HTTP user-agent string to be used to retrieve robots.txt file from domain
`warn`	warn about being unable to download domain/robots.txt because of
`force`	if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens,
`ssl_verifypeer`	either 1 (default) or 0, if 0 it disables SSL peer verification, which might help with robots.txt file retrieval
`encoding`	Encoding of the robots.txt file.
`verbose`	make function print out more information
`on_server_error`	request state handler for any 5xx status
`on_client_error`	request state handler for any 4xx HTTP status that is not 404
`on_not_found`	request state handler for HTTP status 404
`on_redirect`	request state handler for any 3xx HTTP status
`on_domain_change`	request state handler for any 3xx HTTP status where domain did change as well
`on_file_type_mismatch`	request state handler for content type other than 'text/plain'
`on_suspect_content`	request state handler for content that seems to be something else than a robots.txt file (usually a JSON, XML or HTML)

Value

Object (list) of class robotstxt with parsed data from a robots.txt (domain, text, bots, permissions, host, sitemap, other) and one function to (check()) to check resource permissions.

Fields

domain: character vector holding domain name for which the robots.txt file is valid; will be set to NA if not supplied on initialization
character: vector of text of robots.txt file; either supplied on initialization or automatically downloaded from domain supplied on initialization
bots: character vector of bot names mentioned in robots.txt
permissions: data.frame of bot permissions found in robots.txt file
host: data.frame of host fields found in robots.txt file
sitemap: data.frame of sitemap fields found in robots.txt file
other: data.frame of other - none of the above - fields found in robots.txt file
check(): Method to check for bot permissions. Defaults to the domains root and no bot in particular. check() has two arguments: paths and bot. The first is for supplying the paths for which to check permissions and the latter to put in the name of the bot. Please, note that path to a folder should end with a trailing slash ("/").

Examples

## Not run: 
rt <- robotstxt(domain="google.com")
rt$bots
rt$permissions
rt$check( paths = c("/", "forbidden"), bot="*")

## End(Not run)

## Not run: 
rt <- robotstxt(domain="google.com")
rt$bots
rt$permissions
rt$check( paths = c("/", "forbidden"), bot="*")

## End(Not run)

Get the robotstxt cache

Description

Get the robotstxt cache

Usage

rt_cache
rt_cache

Format

An object of class environment of length 0.

Storage for HTTP request response objects

Description

Storage for HTTP request response objects

Execute HTTP request for get_robotstxt()

Usage

rt_last_http

get_robotstxt_http_get(
  domain,
  user_agent = utils::sessionInfo()$R.version$version.string,
  ssl_verifypeer = 1
)
rt_last_http

get_robotstxt_http_get(
  domain,
  user_agent = utils::sessionInfo()$R.version$version.string,
  ssl_verifypeer = 1
)

Arguments

`domain`	the domain to get robots.txt file for.
`user_agent`	the user agent to use for HTTP request header. Defaults to current version of R. If 'NULL' is passed, httr will use software versions for the header, such as 'libcurl/8.7.1 r-curl/5.2.3 httr/1.4.7'
`ssl_verifypeer`	either 1 (default) or 0, if 0 it disables SSL peer verification, which might help with robots.txt file retrieval

Format

An object of class environment of length 1.

Handle robotstxt object retrieved from HTTP request

Description

A helper function for get_robotstxt() that will extract the robots.txt file from the HTTP request result object. It will inform get_robotstxt() if the request should be cached and which problems occurred.

Usage

rt_request_handler(
  request,
  on_server_error = on_server_error_default,
  on_client_error = on_client_error_default,
  on_not_found = on_not_found_default,
  on_redirect = on_redirect_default,
  on_domain_change = on_domain_change_default,
  on_sub_domain_change = on_sub_domain_change_default,
  on_file_type_mismatch = on_file_type_mismatch_default,
  on_suspect_content = on_suspect_content_default,
  warn = TRUE,
  encoding = "UTF-8"
)

on_server_error_default

on_client_error_default

on_not_found_default

on_redirect_default

on_domain_change_default

on_sub_domain_change_default

on_file_type_mismatch_default

on_suspect_content_default
rt_request_handler(
  request,
  on_server_error = on_server_error_default,
  on_client_error = on_client_error_default,
  on_not_found = on_not_found_default,
  on_redirect = on_redirect_default,
  on_domain_change = on_domain_change_default,
  on_sub_domain_change = on_sub_domain_change_default,
  on_file_type_mismatch = on_file_type_mismatch_default,
  on_suspect_content = on_suspect_content_default,
  warn = TRUE,
  encoding = "UTF-8"
)

on_server_error_default

on_client_error_default

on_not_found_default

on_redirect_default

on_domain_change_default

on_sub_domain_change_default

on_file_type_mismatch_default

on_suspect_content_default

Arguments

`request`	result of an HTTP request (e.g. httr::GET())
`on_server_error`	request state handler for any 5xx status
`on_client_error`	request state handler for any 4xx HTTP status that is not 404
`on_not_found`	request state handler for HTTP status 404
`on_redirect`	request state handler for any 3xx HTTP status
`on_domain_change`	request state handler for any 3xx HTTP status where domain did change as well
`on_sub_domain_change`	request state handler for any 3xx HTTP status where domain did change but only to www-sub_domain
`on_file_type_mismatch`	request state handler for content type other than 'text/plain'
`on_suspect_content`	request state handler for content that seems to be something else than a robots.txt file (usually a JSON, XML or HTML)
`warn`	suppress warnings
`encoding`	The text encoding to assume if no encoding is provided in the headers of the response

Format

An object of class list of length 4.

An object of class list of length 2.

An object of class list of length 3.

An object of class list of length 2.

An object of class list of length 4.

Value

a list with three items following the following schema:
list( rtxt = "", problems = list( "redirect" = list( status_code = 301 ), "domain" = list(from_url = "...", to_url = "...") ) )

Package 'robotstxt'

Help Index

re-export magrittr pipe operator

Description

Convert robotstxt_text to list

Description

Usage

Arguments

Add http protocal if missing from URL

Description

Usage

Arguments

Download a robots.txt file

Description

Usage

Arguments

Download multiple robotstxt files

Description

Usage

Arguments

Guess a domain from path

Description

Usage

Arguments

Check if HTTP domain changed

Description

Usage

Arguments

Value

Check if HTTP subdomain changed

Description

Usage

Arguments

Value

Check if HTTP redirect occurred

Description

Usage

Arguments

Value

Check if file is valid / parsable robots.txt file

Description

Usage

Arguments

Validate if a file is valid / parsable robots.txt file

Description

Usage

Arguments

Merge a number of named lists in sequential order

Description

Usage

Arguments

Details

Author(s)

Return default value if NULL

Description

Usage

Arguments

Parse a robots.txt file

Description

Usage

Arguments

Value

Check if a bot has permissions to access page(s)

Description

Usage

Arguments

Check if a spiderbar bot has permissions to access page(s)

Description

Usage

Arguments

Print robotstxt

Description

Usage

Arguments

Print robotstxt's text

Description

Usage

Arguments

Remove domain from path

Description