Package: robotstxt 0.7.15.9000

Jordan Bradford

robotstxt: A 'robots.txt' Parser and 'Webbot'/'Spider'/'Crawler' Permissions Checker

Provides functions to download and parse 'robots.txt' files. Ultimately the package makes it easy to check if bots (spiders, crawler, scrapers, ...) are allowed to access specific resources on a domain.

Authors:Pedro Baltazar [ctb], Jordan Bradford [cre], Peter Meissner [aut], Kun Ren [aut, cph], Oliver Keys [ctb], Rich Fitz John [ctb]

robotstxt_0.7.15.9000.tar.gz
robotstxt_0.7.15.9000.zip(r-4.5)robotstxt_0.7.15.9000.zip(r-4.4)robotstxt_0.7.15.9000.zip(r-4.3)
robotstxt_0.7.15.9000.tgz(r-4.4-any)robotstxt_0.7.15.9000.tgz(r-4.3-any)
robotstxt_0.7.15.9000.tar.gz(r-4.5-noble)robotstxt_0.7.15.9000.tar.gz(r-4.4-noble)
robotstxt_0.7.15.9000.tgz(r-4.4-emscripten)robotstxt_0.7.15.9000.tgz(r-4.3-emscripten)
robotstxt.pdf |robotstxt.html✨
robotstxt/json (API)
NEWS

# Install 'robotstxt' in R:

install.packages('robotstxt', repos = c('https://ropensci.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/ropensci/robotstxt/issues

Pkgdown site:https://docs.ropensci.org

On CRAN:

crawler peer-reviewed robotstxt scraper spider webscraping

10.07 score 68 stars 6 packages 394 scripts 1.2k downloads 19 exports 25 dependencies

Last updated 2 months agofrom:d3d0a4d525 (on main). Checks:7 OK. Indexed: yes.

Target	Result	Latest binary
Doc / Vignettes	OK	Jan 14 2025
R-4.5-win	OK	Jan 14 2025
R-4.5-linux	OK	Jan 14 2025
R-4.4-win	OK	Jan 14 2025
R-4.4-mac	OK	Jan 14 2025
R-4.3-win	OK	Jan 14 2025
R-4.3-mac	OK	Jan 14 2025

Exports:%>%get_robotstxt get_robotstxt_http_get get_robotstxts is_valid_robotstxt on_client_error_default on_domain_change_default on_file_type_mismatch_default on_not_found_default on_redirect_default on_server_error_default on_sub_domain_change_default on_suspect_content_default parse_robotstxt paths_allowed request_handler_handler robotstxt rt_last_http rt_request_handler

Dependencies:askpass cli codetools curl digest future future.apply globals glue httr jsonlite lifecycle listenv magrittr mime openssl parallelly R6 Rcpp rlang spiderbar stringi stringr sys vctrs

Using Robotstxt

Peter Meissner

Rendered fromusing_robotstxt.Rmdusingknitr::rmarkdownon Jan 14 2025.

Last update: 2024-08-24
Started: 2016-01-09

Citation

Development and contributors

Readme and manuals

Help Manual

Help page	Topics
re-export magrittr pipe operator	%>%
Convert robotstxt_text to list	as.list.robotstxt_text
Add http protocal if missing from URL	fix_url
Download a robots.txt file	get_robotstxt
Download multiple robotstxt files	get_robotstxts
Guess a domain from path	guess_domain
Check if HTTP domain changed	http_domain_changed
Check if HTTP subdomain changed	http_subdomain_changed
Check if HTTP redirect occurred	http_was_redirected
Check if file is valid / parsable robots.txt file	is_suspect_robotstxt
Validate if a file is valid / parsable robots.txt file	is_valid_robotstxt
Merge a number of named lists in sequential order	list_merge
Return default value if NULL	null_to_default
Parse a robots.txt file	parse_robotstxt
Check if a bot has permissions to access page(s)	paths_allowed
Check if a spiderbar bot has permissions to access page(s)	paths_allowed_worker_spiderbar
Print robotstxt	print.robotstxt
Print robotstxt's text	print.robotstxt_text
Remove domain from path	remove_domain
Handle robotstxt handlers	request_handler_handler
Generate a representation of a robots.txt file	robotstxt
Get the robotstxt cache	rt_cache
Storage for HTTP request response objects	get_robotstxt_http_get rt_last_http
Handle robotstxt object retrieved from HTTP request	on_client_error_default on_domain_change_default on_file_type_mismatch_default on_not_found_default on_redirect_default on_server_error_default on_sub_domain_change_default on_suspect_content_default rt_request_handler