Package: robotstxt 0.7.13

Pedro Baltazar

robotstxt: A 'robots.txt' Parser and 'Webbot'/'Spider'/'Crawler' Permissions Checker

Provides functions to download and parse 'robots.txt' files. Ultimately the package makes it easy to check if bots (spiders, crawler, scrapers, ...) are allowed to access specific resources on a domain.

Authors:Pedro Baltazar [aut, cre], Peter Meissner [aut], Kun Ren [aut, cph], Oliver Keys [ctb], Rich Fitz John [ctb]

robotstxt_0.7.13.tar.gz
robotstxt_0.7.13.zip(r-4.5)robotstxt_0.7.13.zip(r-4.4)robotstxt_0.7.13.zip(r-4.3)
robotstxt_0.7.13.tgz(r-4.4-any)robotstxt_0.7.13.tgz(r-4.3-any)
robotstxt_0.7.13.tar.gz(r-4.5-noble)robotstxt_0.7.13.tar.gz(r-4.4-noble)
robotstxt_0.7.13.tgz(r-4.4-emscripten)robotstxt_0.7.13.tgz(r-4.3-emscripten)
robotstxt.pdf |robotstxt.html
robotstxt/json (API)
NEWS

# Install 'robotstxt' in R:
install.packages('robotstxt', repos = c('https://ropensci.r-universe.dev', 'https://cloud.r-project.org'))

Peer review:

Bug tracker:https://github.com/ropensci/robotstxt/issues

On CRAN:

crawlerpeer-reviewedrobotstxtscraperspiderwebscraping

19 exports 69 stars 3.98 score 25 dependencies 5 dependents 1.7k downloads

Last updated 2 years agofrom:dd5bf96a19 (on master)

Exports:%>%get_robotstxtget_robotstxt_http_getget_robotstxtsis_valid_robotstxton_client_error_defaulton_domain_change_defaulton_file_type_mismatch_defaulton_not_found_defaulton_redirect_defaulton_server_error_defaulton_sub_domain_change_defaulton_suspect_content_defaultparse_robotstxtpaths_allowedrequest_handler_handlerrobotstxtrt_last_httprt_request_handler

Dependencies:askpassclicodetoolscurldigestfuturefuture.applyglobalsgluehttrjsonlitelifecyclelistenvmagrittrmimeopensslparallellyR6Rcpprlangspiderbarstringistringrsysvctrs

Using Robotstxt

Rendered fromusing_robotstxt.Rmdusingknitr::rmarkdownon Jul 12 2024.

Last update: 2020-09-02
Started: 2016-01-09

Readme and manuals

Help Manual

Help pageTopics
re-export magrittr pipe operator%>%
Method as.list() for class robotstxt_textas.list.robotstxt_text
fix_urlfix_url
downloading robots.txt fileget_robotstxt
function to get multiple robotstxt filesget_robotstxts
function guessing domain from pathguess_domain
http_domain_changedhttp_domain_changed
http_subdomain_changedhttp_subdomain_changed
http_was_redirectedhttp_was_redirected
is_suspect_robotstxtis_suspect_robotstxt
function that checks if file is valid / parsable robots.txt fileis_valid_robotstxt
Merge a number of named lists in sequential orderlist_merge
null_to_defeaultnull_to_defeault
function parsing robots.txtparse_robotstxt
check if a bot has permissions to access page(s)paths_allowed
paths_allowed_worker spiderbar flavorpaths_allowed_worker_spiderbar
printing robotstxtprint.robotstxt
printing robotstxt_textprint.robotstxt_text
function to remove domain from pathremove_domain
request_handler_handlerrequest_handler_handler
Generate a representations of a robots.txt filerobotstxt
get_robotstxt() cachert_cache
storage for http request response objectsget_robotstxt_http_get rt_last_http
rt_request_handleron_client_error_default on_domain_change_default on_file_type_mismatch_default on_not_found_default on_redirect_default on_server_error_default on_sub_domain_change_default on_suspect_content_default rt_request_handler