Package: robotstxt 0.7.15.9000

Jordan Bradford

robotstxt: A 'robots.txt' Parser and 'Webbot'/'Spider'/'Crawler' Permissions Checker

Provides functions to download and parse 'robots.txt' files. Ultimately the package makes it easy to check if bots (spiders, crawler, scrapers, ...) are allowed to access specific resources on a domain.

Authors:Pedro Baltazar [ctb], Jordan Bradford [cre], Peter Meissner [aut], Kun Ren [aut, cph], Oliver Keys [ctb], Rich Fitz John [ctb]

robotstxt_0.7.15.9000.tar.gz
robotstxt_0.7.15.9000.zip(r-4.5)robotstxt_0.7.15.9000.zip(r-4.4)robotstxt_0.7.15.9000.zip(r-4.3)
robotstxt_0.7.15.9000.tgz(r-4.4-any)robotstxt_0.7.15.9000.tgz(r-4.3-any)
robotstxt_0.7.15.9000.tar.gz(r-4.5-noble)robotstxt_0.7.15.9000.tar.gz(r-4.4-noble)
robotstxt_0.7.15.9000.tgz(r-4.4-emscripten)robotstxt_0.7.15.9000.tgz(r-4.3-emscripten)
robotstxt.pdf |robotstxt.html
robotstxt/json (API)
NEWS

# Install 'robotstxt' in R:
install.packages('robotstxt', repos = c('https://ropensci.r-universe.dev', 'https://cloud.r-project.org'))

Peer review:

Bug tracker:https://github.com/ropensci/robotstxt/issues

On CRAN:

crawlerpeer-reviewedrobotstxtscraperspiderwebscraping

10.45 score 68 stars 6 packages 358 scripts 3.2k downloads 19 exports 25 dependencies

Last updated 23 hours agofrom:d3d0a4d525 (on main). Checks:OK: 7. Indexed: yes.

TargetResultDate
Doc / VignettesOKNov 15 2024
R-4.5-winOKNov 15 2024
R-4.5-linuxOKNov 15 2024
R-4.4-winOKNov 15 2024
R-4.4-macOKNov 15 2024
R-4.3-winOKNov 15 2024
R-4.3-macOKNov 15 2024

Exports:%>%get_robotstxtget_robotstxt_http_getget_robotstxtsis_valid_robotstxton_client_error_defaulton_domain_change_defaulton_file_type_mismatch_defaulton_not_found_defaulton_redirect_defaulton_server_error_defaulton_sub_domain_change_defaulton_suspect_content_defaultparse_robotstxtpaths_allowedrequest_handler_handlerrobotstxtrt_last_httprt_request_handler

Dependencies:askpassclicodetoolscurldigestfuturefuture.applyglobalsgluehttrjsonlitelifecyclelistenvmagrittrmimeopensslparallellyR6Rcpprlangspiderbarstringistringrsysvctrs

Using Robotstxt

Rendered fromusing_robotstxt.Rmdusingknitr::rmarkdownon Nov 15 2024.

Last update: 2024-08-24
Started: 2016-01-09

Readme and manuals

Help Manual

Help pageTopics
re-export magrittr pipe operator%>%
Convert robotstxt_text to listas.list.robotstxt_text
Add http protocal if missing from URLfix_url
Download a robots.txt fileget_robotstxt
Download multiple robotstxt filesget_robotstxts
Guess a domain from pathguess_domain
Check if HTTP domain changedhttp_domain_changed
Check if HTTP subdomain changedhttp_subdomain_changed
Check if HTTP redirect occurredhttp_was_redirected
Check if file is valid / parsable robots.txt fileis_suspect_robotstxt
Validate if a file is valid / parsable robots.txt fileis_valid_robotstxt
Merge a number of named lists in sequential orderlist_merge
Return default value if NULLnull_to_default
Parse a robots.txt fileparse_robotstxt
Check if a bot has permissions to access page(s)paths_allowed
Check if a spiderbar bot has permissions to access page(s)paths_allowed_worker_spiderbar
Print robotstxtprint.robotstxt
Print robotstxt's textprint.robotstxt_text
Remove domain from pathremove_domain
Handle robotstxt handlersrequest_handler_handler
Generate a representation of a robots.txt filerobotstxt
Get the robotstxt cachert_cache
Storage for HTTP request response objectsget_robotstxt_http_get rt_last_http
Handle robotstxt object retrieved from HTTP requeston_client_error_default on_domain_change_default on_file_type_mismatch_default on_not_found_default on_redirect_default on_server_error_default on_sub_domain_change_default on_suspect_content_default rt_request_handler