Title: | Functions to mine endoscopic and associated pathology datasets |
---|---|
Description: | This script comprises the functions that are used to clean up endoscopic reports and pathology reports as well as many of the scripts used for analysis. The scripts assume the endoscopy and histopathology data set is merged already but it can also be used of course with the unmerged datasets. |
Authors: | Sebastian Zeki [aut, cre] |
Maintainer: | Sebastian Zeki <[email protected]> |
License: | GPL-3 |
Version: | 2.0.1.9000 |
Built: | 2024-12-04 18:00:03 UTC |
Source: | https://github.com/ropensci/EndoMineR |
This determines the follow up rule a patient should fit in to (according to the British Society for Gastroenterology guidance on Barrett's oesophagus) Specfically it combines the presence of intestinal metaplasia with Prague score so the follow-up group can be determined. It relies on the presence of a Prague score. It should be run after Barretts_PathStage which looks for the worst stage of a specimen and which will determine the presence or absence of intestinal metaplasia if the sample is non-dysplastic. Because reports often do not record a Prague score a more pragmatic approach as been to assess the M stage and if this is not present then to use the C stage extrapolated using the Barretts_Prague function
Barretts_FUType(dataframe, CStage, MStage, IMorNoIM)
Barretts_FUType(dataframe, CStage, MStage, IMorNoIM)
dataframe |
the dataframe(which has to have been processed by the Barretts_PathStage function first to get IMorNoIM and the Barretts_PragueScore to get the C and M stage if available), |
CStage |
CStage column |
MStage |
MStage column |
IMorNoIM |
IMorNoIM column |
Other Disease Specific Analysis - Barretts Data:
BarrettsAll()
,
BarrettsBxQual()
,
BarrettsParisEMR()
,
Barretts_PathStage()
,
Barretts_PragueScore()
# Firstly relevant columns are extrapolated from the # Mypath demo dataset. These functions are all part of Histology data # cleaning as part of the package. # Mypath demo dataset. These functions are all part of Histology data # cleaning as part of the package. v <- Mypath v$NumBx <- HistolNumbOfBx(v$Macroscopicdescription, "specimen") v$BxSize <- HistolBxSize(v$Macroscopicdescription) # The histology is then merged with the Endoscopy dataset. The merge occurs # according to date and Hospital number v <- Endomerge2( Myendo, "Dateofprocedure", "HospitalNumber", v, "Dateofprocedure", "HospitalNumber" ) # The function relies on the other Barrett's functions being run as well: v$IMorNoIM <- Barretts_PathStage(v, "Histology") v <- Barretts_PragueScore(v, "Findings") # The follow-up group depends on the histology and the Prague score for a # patient so it takes the processed Barrett's data and then looks in the # Findings column for permutations of the Prague score. v$FU_Type <- Barretts_FUType(v, "CStage", "MStage", "IMorNoIM") rm(v)
# Firstly relevant columns are extrapolated from the # Mypath demo dataset. These functions are all part of Histology data # cleaning as part of the package. # Mypath demo dataset. These functions are all part of Histology data # cleaning as part of the package. v <- Mypath v$NumBx <- HistolNumbOfBx(v$Macroscopicdescription, "specimen") v$BxSize <- HistolBxSize(v$Macroscopicdescription) # The histology is then merged with the Endoscopy dataset. The merge occurs # according to date and Hospital number v <- Endomerge2( Myendo, "Dateofprocedure", "HospitalNumber", v, "Dateofprocedure", "HospitalNumber" ) # The function relies on the other Barrett's functions being run as well: v$IMorNoIM <- Barretts_PathStage(v, "Histology") v <- Barretts_PragueScore(v, "Findings") # The follow-up group depends on the histology and the Prague score for a # patient so it takes the processed Barrett's data and then looks in the # Findings column for permutations of the Prague score. v$FU_Type <- Barretts_FUType(v, "CStage", "MStage", "IMorNoIM") rm(v)
This extracts the pathological stage from the histopathology specimen. It is done using 'degradation' so that it will look for the worst overall grade in the histology specimen and if not found it will look for the next worst and so on. It looks per report not per biopsy (it is more common for histopathology reports to contain the worst overall grade rather than individual biopsy grades). Specfically it extracts the histopathology worst grade within the specimen FOr the sake of accuracy this should alwats be used after the HistolDx function and this removes negative sentences such as 'there is no dysplasia'. This current function should be used on the column derived from HistolDx which is called Dx_Simplified
Barretts_PathStage(dataframe, PathColumn)
Barretts_PathStage(dataframe, PathColumn)
dataframe |
dataframe with column of interest |
PathColumn |
column of interest |
Other Disease Specific Analysis - Barretts Data:
BarrettsAll()
,
BarrettsBxQual()
,
BarrettsParisEMR()
,
Barretts_FUType()
,
Barretts_PragueScore()
# Firstly relevant columns are extrapolated from the # Mypath demo dataset. These functions are all part of Histology data # cleaning as part of the package. # The function then takes the Histology column from the merged data set (v). # It extracts the worst histological grade for a specimen b <- Barretts_PathStage(Mypath, "Histology") rm(v)
# Firstly relevant columns are extrapolated from the # Mypath demo dataset. These functions are all part of Histology data # cleaning as part of the package. # The function then takes the Histology column from the merged data set (v). # It extracts the worst histological grade for a specimen b <- Barretts_PathStage(Mypath, "Histology") rm(v)
The aim is to extract a C and M stage (Prague score) for Barrett's samples. This is done using a regex where C and M stages are explicitly mentioned in the free text Specfically it extracts the Prague score
Barretts_PragueScore(dataframe, EndoReportColumn, EndoReportColumn2)
Barretts_PragueScore(dataframe, EndoReportColumn, EndoReportColumn2)
dataframe |
dataframe with column of interest |
EndoReportColumn |
column of interest |
EndoReportColumn2 |
second column of interest |
Other Disease Specific Analysis - Barretts Data:
BarrettsAll()
,
BarrettsBxQual()
,
BarrettsParisEMR()
,
Barretts_FUType()
,
Barretts_PathStage()
# The example takes the endoscopy demo dataset and searches the # Findings column (which contains endoscopy free text about the # procedure itself). It then extracts the Prague score if relevant. I # find it easiest to use this on a Barrett's subset of data rather than # a dump of all endoscopies but of course this is a permissible dataset # too aa <- Barretts_PragueScore(Myendo, "Findings", "OGDReportWhole")
# The example takes the endoscopy demo dataset and searches the # Findings column (which contains endoscopy free text about the # procedure itself). It then extracts the Prague score if relevant. I # find it easiest to use this on a Barrett's subset of data rather than # a dump of all endoscopies but of course this is a permissible dataset # too aa <- Barretts_PragueScore(Myendo, "Findings", "OGDReportWhole")
Function to encapsulate all the Barrett's functions together. This includes the Prague score and the worst pathological grade and then feeds both of these things into the follow up function. The output is a dataframe with all the original data as well as the new columns that have been created.
BarrettsAll( Endodataframe, EndoReportColumn, EndoReportColumn2, Pathdataframe, PathColumn )
BarrettsAll( Endodataframe, EndoReportColumn, EndoReportColumn2, Pathdataframe, PathColumn )
Endodataframe |
endoscopy dataframe of interest |
EndoReportColumn |
Endoscopy report field of interest as a string vector |
EndoReportColumn2 |
Second endoscopy report field of interest as a string vector |
Pathdataframe |
pathology dataframe of interest |
PathColumn |
Pathology report field of interest as a string vector |
Newdf
Other Disease Specific Analysis - Barretts Data:
BarrettsBxQual()
,
BarrettsParisEMR()
,
Barretts_FUType()
,
Barretts_PathStage()
,
Barretts_PragueScore()
Barretts_df <- BarrettsAll(Myendo, "Findings", "OGDReportWhole", Mypath, "Histology")
Barretts_df <- BarrettsAll(Myendo, "Findings", "OGDReportWhole", Mypath, "Histology")
This function gets the number of biopsies taken per endoscopy and compares it to the Prague score for that endoscopy.Endoscopists should be taking a certain number of biopsies given the length of a Barrett's segment so it should be straightforward to detect a shortfall in the number of biopsies being taken. The output is the shortfall per endoscopist
BarrettsBxQual(dataframe, Endo_ResultPerformed, PatientID, Endoscopist)
BarrettsBxQual(dataframe, Endo_ResultPerformed, PatientID, Endoscopist)
dataframe |
dataframe |
Endo_ResultPerformed |
Date of the Endoscopy |
PatientID |
Patient's unique identifier |
Endoscopist |
name of the column with the Endoscopist names |
Other Disease Specific Analysis - Barretts Data:
BarrettsAll()
,
BarrettsParisEMR()
,
Barretts_FUType()
,
Barretts_PathStage()
,
Barretts_PragueScore()
# Firstly relevant columns are extrapolated from the # Mypath demo dataset. These functions are all part of Histology data # cleaning as part of the package. Mypath$NumBx <- HistolNumbOfBx(Mypath$Macroscopicdescription, "specimen") Mypath$BxSize <- HistolBxSize(Mypath$Macroscopicdescription) # The histology is then merged with the Endoscopy dataset. The merge occurs # according to date and Hospital number v <- Endomerge2( Myendo, "Dateofprocedure", "HospitalNumber", Mypath, "Dateofprocedure", "HospitalNumber" ) # The function relies on the other Barrett's functions being run as well: b1 <- Barretts_PragueScore(v, "Findings") b1$PathStage <- Barretts_PathStage(b1, "Histology") # The follow-up group depends on the histology and the Prague score for a # patient so it takes the processed Barrett's data and then looks in the # Findings column for permutations of the Prague score. b1$FU_Type <- Barretts_FUType(b1, "CStage", "MStage", "PathStage") colnames(b1)[colnames(b1) == "pHospitalNum"] <- "HospitalNumber" # The number of average number of biopsies is then calculated and # compared to the average Prague C score so that those who are taking # too few biopsies can be determined hh <- BarrettsBxQual( b1, "Date.x", "HospitalNumber", "Endoscopist" ) rm(v)
# Firstly relevant columns are extrapolated from the # Mypath demo dataset. These functions are all part of Histology data # cleaning as part of the package. Mypath$NumBx <- HistolNumbOfBx(Mypath$Macroscopicdescription, "specimen") Mypath$BxSize <- HistolBxSize(Mypath$Macroscopicdescription) # The histology is then merged with the Endoscopy dataset. The merge occurs # according to date and Hospital number v <- Endomerge2( Myendo, "Dateofprocedure", "HospitalNumber", Mypath, "Dateofprocedure", "HospitalNumber" ) # The function relies on the other Barrett's functions being run as well: b1 <- Barretts_PragueScore(v, "Findings") b1$PathStage <- Barretts_PathStage(b1, "Histology") # The follow-up group depends on the histology and the Prague score for a # patient so it takes the processed Barrett's data and then looks in the # Findings column for permutations of the Prague score. b1$FU_Type <- Barretts_FUType(b1, "CStage", "MStage", "PathStage") colnames(b1)[colnames(b1) == "pHospitalNum"] <- "HospitalNumber" # The number of average number of biopsies is then calculated and # compared to the average Prague C score so that those who are taking # too few biopsies can be determined hh <- BarrettsBxQual( b1, "Date.x", "HospitalNumber", "Endoscopist" ) rm(v)
This creates a column of Paris grade for all samples where this is mentioned.
BarrettsParisEMR(Column, Column2)
BarrettsParisEMR(Column, Column2)
Column |
Endoscopy report field of interest as a string vector |
Column2 |
Another endoscopy report field of interest as a string vector |
a string vector
Other Disease Specific Analysis - Barretts Data:
BarrettsAll()
,
BarrettsBxQual()
,
Barretts_FUType()
,
Barretts_PathStage()
,
Barretts_PragueScore()
# Myendo$EMR<-BarrettsParisEMR(Myendo$ProcedurePerformed,Myendo$Findings)
# Myendo$EMR<-BarrettsParisEMR(Myendo$ProcedurePerformed,Myendo$Findings)
This function returns all the conversions from common version of events to a standardised event list, much like the Location standardidastion function This does not include EMR as this is extracted from the pathology so is part of pathology type. It is used for automated OPCS-4 coding.
BiopsyIndex()
BiopsyIndex()
Other NLP - Lexicons:
EventList()
,
GISymptomsList()
,
HistolType()
,
LocationListLower()
,
LocationListUniversal()
,
LocationListUpper()
,
LocationList()
,
RFACath()
,
WordsToNumbers()
This creates a proportion table for categorical variables by endoscopist It of course relies on a Endoscopist column being present
CategoricalByEndoscopist(ProportionColumn, EndoscopistColumn)
CategoricalByEndoscopist(ProportionColumn, EndoscopistColumn)
ProportionColumn |
The column (categorical data) of interest |
EndoscopistColumn |
The endoscopist column |
Other Grouping by endoscopist:
MetricByEndoscopist()
# The function plots any numeric metric by endoscopist # Mypath demo dataset. These functions are all part of Histology data # cleaning as part of the package. v <- Mypath v$NumBx <- HistolNumbOfBx(Mypath$Macroscopicdescription, "specimen") v$BxSize <- HistolBxSize(v$Macroscopicdescription) # The histology is then merged with the Endoscopy dataset. The merge occurs # according to date and Hospital number v <- Endomerge2( Myendo, "Dateofprocedure", "HospitalNumber", v, "Dateofprocedure", "HospitalNumber" ) # The function relies on the other Barrett's functions being run as well: v$IMorNoIM <- Barretts_PathStage(v, "Histology") colnames(v)[colnames(v) == "pHospitalNum"] <- "HospitalNumber" # The function takes the column with the extracted worst grade of # histopathology and returns the proportion of each finding (ie # proportion with low grade dysplasia, high grade etc.) for each # endoscopist kk <- CategoricalByEndoscopist(v$IMorNoIM, v$Endoscopist) rm(Myendo)
# The function plots any numeric metric by endoscopist # Mypath demo dataset. These functions are all part of Histology data # cleaning as part of the package. v <- Mypath v$NumBx <- HistolNumbOfBx(Mypath$Macroscopicdescription, "specimen") v$BxSize <- HistolBxSize(v$Macroscopicdescription) # The histology is then merged with the Endoscopy dataset. The merge occurs # according to date and Hospital number v <- Endomerge2( Myendo, "Dateofprocedure", "HospitalNumber", v, "Dateofprocedure", "HospitalNumber" ) # The function relies on the other Barrett's functions being run as well: v$IMorNoIM <- Barretts_PathStage(v, "Histology") colnames(v)[colnames(v) == "pHospitalNum"] <- "HospitalNumber" # The function takes the column with the extracted worst grade of # histopathology and returns the proportion of each finding (ie # proportion with low grade dysplasia, high grade etc.) for each # endoscopist kk <- CategoricalByEndoscopist(v$IMorNoIM, v$Endoscopist) rm(Myendo)
A dataset containing fake lower GI endoscopy reports. The report field is provided as a whole report without any fields having been already extracted
ColonFinal
ColonFinal
A data frame with 2000 rows and 1 variables:
The whole report, in text
This does a general clean up of whitespace, semi-colons,full stops at the start of lines and converts end sentence full stops to new lines.
ColumnCleanUp(vector)
ColumnCleanUp(vector)
vector |
column of interest |
This returns a character vector
Other NLP - Text Cleaning and Extraction:
DictionaryInPlaceReplace()
,
Extractor()
,
NegativeRemoveWrapper()
,
NegativeRemove()
,
textPrep()
ii<-ColumnCleanUp(Myendo$Findings)
ii<-ColumnCleanUp(Myendo$Findings)
This function extracts the OPCS-4 codes for all Barrett's procedures It should take the OPCS-4 from the EVENT and perhaps also using extent depending on how the coding is done. The EVENT column will need to extract multiple findings The hope is that the OPCS-4 column will then map from the EVENT column. This returns a nested list column with the procedure, furthest path site and event performed
dev_ExtrapolateOPCS4Prep(dataframe, Procedure, PathSite, Event, extentofexam)
dev_ExtrapolateOPCS4Prep(dataframe, Procedure, PathSite, Event, extentofexam)
dataframe |
the dataframe |
Procedure |
The Procedure column |
PathSite |
The column containing the Pathology site |
Event |
the EVENT column |
extentofexam |
the furthest point reached in the examination |
# Need to run the HistolTypeSite and EndoscopyEvent functions first here # SelfOGD_Dunn$OPCS4w<-ExtrapolateOPCS4Prep(SelfOGD_Dunn,"PROCEDUREPERFORMED", # "PathSite","EndoscopyEvent")
# Need to run the HistolTypeSite and EndoscopyEvent functions first here # SelfOGD_Dunn$OPCS4w<-ExtrapolateOPCS4Prep(SelfOGD_Dunn,"PROCEDUREPERFORMED", # "PathSite","EndoscopyEvent")
This maps terms in the text and replaces them with the standardised term (mapped in the lexicon file) within the text. It is used within the textPrep function.
DictionaryInPlaceReplace(inputString, list)
DictionaryInPlaceReplace(inputString, list)
inputString |
the input string (ie the full medical report) |
list |
The replacing list |
This returns a character vector
Other NLP - Text Cleaning and Extraction:
ColumnCleanUp()
,
Extractor()
,
NegativeRemoveWrapper()
,
NegativeRemove()
,
textPrep()
inputText<-DictionaryInPlaceReplace(TheOGDReportFinal$OGDReportWhole,LocationList())
inputText<-DictionaryInPlaceReplace(TheOGDReportFinal$OGDReportWhole,LocationList())
This creates a basic graph using the template specified in theme_Publication. It takes a numeric column and plots it against any non-numeric x axis in a ggplot
EndoBasicGraph(dataframe, xdata, number)
EndoBasicGraph(dataframe, xdata, number)
dataframe |
dataframe |
xdata |
The x column |
number |
The numeric column |
Myplot This is the final plot
Myplot
Other Data Presentation helpers:
scale_colour_Publication()
,
scale_fill_Publication()
,
theme_Publication()
# This function plots numeric y vs non-numeric x # Get some numeric columns e.g. number of biopsies and size Mypath$Size <- HistolBxSize(Mypath$Macroscopicdescription) Mypath$NumBx <- HistolNumbOfBx(Mypath$Macroscopicdescription, "specimen") Mypath2 <- Mypath[, c("NumBx", "Size")] EndoBasicGraph(Mypath, "Size", "NumBx")
# This function plots numeric y vs non-numeric x # Get some numeric columns e.g. number of biopsies and size Mypath$Size <- HistolBxSize(Mypath$Macroscopicdescription) Mypath$NumBx <- HistolNumbOfBx(Mypath$Macroscopicdescription, "specimen") Mypath2 <- Mypath[, c("NumBx", "Size")] EndoBasicGraph(Mypath, "Size", "NumBx")
This takes the endoscopy dataset date performed and the hospital number column and merges with the equivalent column in the pathology dataset. This is merged within a 7 day time frame as pathology is often reported after endoscopic
Endomerge2(x, EndoDate, EndoHospNumber, y, PathDate, PathHospNumber)
Endomerge2(x, EndoDate, EndoHospNumber, y, PathDate, PathHospNumber)
x |
Endoscopy dataframe |
EndoDate |
The date the endoscopy was performed |
EndoHospNumber |
The unique hospital number in the endoscopy dataset |
y |
Histopathology dataframe |
PathDate |
The date the endoscopy was performed |
PathHospNumber |
The unique hospital number in the endoscopy dataset |
v <- Endomerge2( Myendo, "Dateofprocedure", "HospitalNumber", Mypath, "Dateofprocedure", "HospitalNumber" )
v <- Endomerge2( Myendo, "Dateofprocedure", "HospitalNumber", Mypath, "Dateofprocedure", "HospitalNumber" )
The goal of EndoMineR is to extract as much information as possible from endoscopy reports and their associated pathology specimens. The package is intended for use by gastroenterologists, pathologists and anyone interested in the analysis of endoscopic and ppathological datasets Gastroenterology now has many standards against which practice is measured although many reporting systems do not include the reporting capability to give anything more than basic analysis. Much of the data is locked in semi-structured text.However the nature of semi-structured text means that data can be extracted in a standardised way- it just requires more manipulation. This package provides that manipulation so that complex endoscopic-pathological analyses, in line with recognised standards for these analyses, can be done.The package is basically in three parts/
The extraction- This is really when the data is provided as full text reports. You may already have the data in a spreadsheet in which case this part isn't necessary.
Cleaning- These are a group of functions that allow the user to extract and clean data commonly found in endoscopic and pathology reports. The cleaning functions usually remove common typos or extraneous information and do some reformatting.
Analyses- The analyses provide graphing function as well as analyses according to the cornerstone questions in gastroenterology- namely surveillance, patient tracking, quality of endoscopy and pathology reporting and diagnostic yield questions.
To learn more about EndoMineR, start with the vignettes: 'browseVignettes(package = "EndoMineR")'
As spreadsheets are likely to be submitted with pre-segregated data as appears from endoscopy software output, these should be remerged prior to cleaning. This function takes the column headers and places it before each text so that the original full text is recreated. It will use the column headers as the delimiter. This should be used before textPrep as the textPrep function takes a character vector (ie the whole report and not a segregated one) only
EndoPaste(x)
EndoPaste(x)
x |
the dataframe |
This returns a list with a dataframe containing one column of the merged text and a character vector which is the delimiter list for when the textPrep function is used
testList<-structure(list(PatientName = c("Tom Hardy", "Elma Fudd", "Bingo Man" ), HospitalNumber = c("H55435", "Y3425345", "Z343424"), Text = c("All bad. Not good", "Serious issues", "from a land far away")), class = "data.frame", row.names = c(NA, -3L)) EndoPaste(testList)
testList<-structure(list(PatientName = c("Tom Hardy", "Elma Fudd", "Bingo Man" ), HospitalNumber = c("H55435", "Y3425345", "Z343424"), Text = c("All bad. Not good", "Serious issues", "from a land far away")), class = "data.frame", row.names = c(NA, -3L)) EndoPaste(testList)
If an endoscopist column is part of the dataset once the extractor function has been used this cleans the endoscopist column from the report. It gets rid of titles It gets rid of common entries that are not needed. It should be used after the textPrep function
EndoscEndoscopist(EndoscopistColumn)
EndoscEndoscopist(EndoscopistColumn)
EndoscopistColumn |
The endoscopy text column |
This returns a character vector
Other Endoscopy specific cleaning functions:
EndoscInstrument()
,
EndoscMeds()
,
EndoscopyEvent()
Myendo$Endoscopist <- EndoscEndoscopist(Myendo$Endoscopist)
Myendo$Endoscopist <- EndoscEndoscopist(Myendo$Endoscopist)
This cleans the Instument column from the report assuming such a column exists (where instrument usually refers to the endoscope number being used.) It gets rid of common entries that are not needed. It should be used after the textPrep function. Note this is possibly going to be deprecated in the next version as the endoscope coding used here is not widely used.
EndoscInstrument(EndoInstrument)
EndoscInstrument(EndoInstrument)
EndoInstrument |
column of interest |
This returns a character vector
Other Endoscopy specific cleaning functions:
EndoscEndoscopist()
,
EndoscMeds()
,
EndoscopyEvent()
Myendo$Instrument <- EndoscInstrument(Myendo$Instrument)
Myendo$Instrument <- EndoscInstrument(Myendo$Instrument)
This cleans medication column from the report assuming such a column exists. It gets rid of common entries that are not needed. It also splits the medication into fentanyl and midazolam numeric doses for use. It should be used after the textPrep function.
EndoscMeds(MedColumn)
EndoscMeds(MedColumn)
MedColumn |
column of interest as a string vector |
This returns a dataframe
Other Endoscopy specific cleaning functions:
EndoscEndoscopist()
,
EndoscInstrument()
,
EndoscopyEvent()
MyendoNew <- cbind(EndoscMeds(Myendo$Medications), Myendo)
MyendoNew <- cbind(EndoscMeds(Myendo$Medications), Myendo)
This extracts the endoscopic event. It looks for the event term and then looks in the event sentence as well as the one above to see if the location is listed. It only looks within the endoscopy fields. If tissue is taken then this will be extracted with the HistolTypeAndSite function rather than being listed as a result as this is cleaner and more robust.
EndoscopyEvent(dataframe, EventColumn1, Procedure, Macroscopic, Histology)
EndoscopyEvent(dataframe, EventColumn1, Procedure, Macroscopic, Histology)
dataframe |
datafrane of interest |
EventColumn1 |
The relevant endoscopt free text column describing the findings |
Procedure |
Column saying which procedure was performed |
Macroscopic |
Column describing all the macroscopic specimens |
Histology |
Column with free text histology (usually microscopic histology) |
This returns a character vector
Other Endoscopy specific cleaning functions:
EndoscEndoscopist()
,
EndoscInstrument()
,
EndoscMeds()
# Myendo$EndoscopyEvent<-EndoscopyEvent(Myendo,"Findings", # "ProcedurePerformed","MACROSCOPICALDESCRIPTION","HISTOLOGY")
# Myendo$EndoscopyEvent<-EndoscopyEvent(Myendo,"Findings", # "ProcedurePerformed","MACROSCOPICALDESCRIPTION","HISTOLOGY")
See if words from two lists co-exist within a sentence. Eg site and tissue type. This function only looks in one sentence for the two terms. If you suspect the terms may occur in adjacent sentences then use the EntityPairs_TwoSentence function.
EntityPairs_OneSentence(inputText, list1, list2)
EntityPairs_OneSentence(inputText, list1, list2)
inputText |
The relevant pathology text column |
list1 |
First list to refer to |
list2 |
The second list to look for |
Other Basic Column mutators:
EntityPairs_TwoSentence()
,
ExtrapolatefromDictionary()
,
ListLookup()
,
MyImgLibrary()
# tbb<-EntityPairs_OneSentence(Mypath$Histology,HistolType(),LocationList())
# tbb<-EntityPairs_OneSentence(Mypath$Histology,HistolType(),LocationList())
This is used to look for relationships between site and event especially for endoscopy events where sentences such as 'The stomach polyp was large. It was removed with a snare' ie the therapy and the site are in two different locations.
EntityPairs_TwoSentence(inputString, list1, list2)
EntityPairs_TwoSentence(inputString, list1, list2)
inputString |
The relevant pathology text column |
list1 |
The intial list to assess |
list2 |
The other list to look for |
Other Basic Column mutators:
EntityPairs_OneSentence()
,
ExtrapolatefromDictionary()
,
ListLookup()
,
MyImgLibrary()
# tbb<-EntityPairs_TwoSentence(Myendo$Findings,EventList(),HistolType())
# tbb<-EntityPairs_TwoSentence(Myendo$Findings,EventList(),HistolType())
The aim is to extract a C and M stage (Prague score) for Barrett's samples. This is done using a regex where C and M stages are explicitly mentioned in the free text Specfically it extracts the Prague score
Eosinophilics(dataframe, findings, histol, IndicationsFroExamination)
Eosinophilics(dataframe, findings, histol, IndicationsFroExamination)
dataframe |
dataframe with column of interest |
findings |
column of interest |
histol |
second column of interest |
IndicationsFroExamination |
second column of interest |
# Firstly relevant columns are extrapolated from the # Mypath demo dataset. These functions are all part of Histology data # cleaning as part of the package. # Mypath demo dataset. These functions are all part of Histology data # cleaning as part of the package. v <- Mypath v$NumBx <- HistolNumbOfBx(v$Macroscopicdescription, "specimen") v$BxSize <- HistolBxSize(v$Macroscopicdescription) # The histology is then merged with the Endoscopy dataset. The merge occurs # according to date and Hospital number v <- Endomerge2( Myendo, "Dateofprocedure", "HospitalNumber", v, "Dateofprocedure", "HospitalNumber" ) aa <- Eosinophilics(v, "Findings", "Histology","Indications")
# Firstly relevant columns are extrapolated from the # Mypath demo dataset. These functions are all part of Histology data # cleaning as part of the package. # Mypath demo dataset. These functions are all part of Histology data # cleaning as part of the package. v <- Mypath v$NumBx <- HistolNumbOfBx(v$Macroscopicdescription, "specimen") v$BxSize <- HistolBxSize(v$Macroscopicdescription) # The histology is then merged with the Endoscopy dataset. The merge occurs # according to date and Hospital number v <- Endomerge2( Myendo, "Dateofprocedure", "HospitalNumber", v, "Dateofprocedure", "HospitalNumber" ) aa <- Eosinophilics(v, "Findings", "Histology","Indications")
This function returns all the conversions from common version of events to a standardised event list, much like the Location standardisation function This does not include EMR as this is extracted from the pathology so is part of pathology type.
EventList()
EventList()
Other NLP - Lexicons:
BiopsyIndex()
,
GISymptomsList()
,
HistolType()
,
LocationListLower()
,
LocationListUniversal()
,
LocationListUpper()
,
LocationList()
,
RFACath()
,
WordsToNumbers()
# unique(unlist(EventList(), use.names = FALSE))
# unique(unlist(EventList(), use.names = FALSE))
This is the main extractor for the Endoscopy and Histology report. This relies on the user creating a list of words representing the subheadings. The list is then fed to the Extractor so that it acts as the beginning and the end of the regex used to split the text. Whatever has been specified in the list is used as a column header. Column headers don't tolerate special characters like : or ? and / and don't allow numbers as the start character so these have to be dealt with in the text before processing
Extractor(inputString, delim)
Extractor(inputString, delim)
inputString |
the column to extract from |
delim |
the vector of words that will be used as the boundaries to extract against |
Other NLP - Text Cleaning and Extraction:
ColumnCleanUp()
,
DictionaryInPlaceReplace()
,
NegativeRemoveWrapper()
,
NegativeRemove()
,
textPrep()
# As column names cant start with a number, one of the dividing # words has to be converted # A list of dividing words (which will also act as column names) # is then constructed mywords<-c("Hospital Number","Patient Name:","DOB:","General Practitioner:", "Date received:","Clinical Details:","Macroscopic description:", "Histology:","Diagnosis:") Mypath2<-Extractor(PathDataFrameFinal$PathReportWhole,mywords)
# As column names cant start with a number, one of the dividing # words has to be converted # A list of dividing words (which will also act as column names) # is then constructed mywords<-c("Hospital Number","Patient Name:","DOB:","General Practitioner:", "Date received:","Clinical Details:","Macroscopic description:", "Histology:","Diagnosis:") Mypath2<-Extractor(PathDataFrameFinal$PathReportWhole,mywords)
Provides term mapping and extraction in one. Standardises any term according to a mapping lexicon provided and then extracts the term. This is different to the DictionaryInPlaceReplace in that it provides a new column with the extracted terms as opposed to changing it in place
ExtrapolatefromDictionary(inputString, list)
ExtrapolatefromDictionary(inputString, list)
inputString |
The text string to process |
list |
of words to iterate through |
Other Basic Column mutators:
EntityPairs_OneSentence()
,
EntityPairs_TwoSentence()
,
ListLookup()
,
MyImgLibrary()
#Firstly we extract histology from the raw report # The function then standardises the histology terms through a series of # regular expressions and then extracts the type of tissue Mypath$Tissue<-suppressWarnings( suppressMessages( ExtrapolatefromDictionary(Mypath$Histology,HistolType() ) ) ) rm(MypathExtraction)
#Firstly we extract histology from the raw report # The function then standardises the histology terms through a series of # regular expressions and then extracts the type of tissue Mypath$Tissue<-suppressWarnings( suppressMessages( ExtrapolatefromDictionary(Mypath$Histology,HistolType() ) ) ) rm(MypathExtraction)
This function returns all the common GI symptoms. They are simply listed as is without grouping or mapping. They have been derived from a manual list with synonyms derived from the UMLS Methatharus using the browser.
GISymptomsList()
GISymptomsList()
Other NLP - Lexicons:
BiopsyIndex()
,
EventList()
,
HistolType()
,
LocationListLower()
,
LocationListUniversal()
,
LocationListUpper()
,
LocationList()
,
RFACath()
,
WordsToNumbers()
This extracts the polyps types from the data (for colonoscopy and flexible sigmoidosscopy data) and outputs the adenoma,adenocarcinoma and hyperplastic detection rate by endoscopist as well as overall number of colonoscopies. This will be extended to other GRS outputs in the future.
GRS_Type_Assess_By_Unit(dataframe, ProcPerformed, Endo_Endoscopist, Dx, Histol)
GRS_Type_Assess_By_Unit(dataframe, ProcPerformed, Endo_Endoscopist, Dx, Histol)
dataframe |
The dataframe |
ProcPerformed |
The column containing the Procedure type performed |
Endo_Endoscopist |
column containing the Endoscopist name |
Dx |
The column with the Histological diagnosis |
Histol |
The column with the Histology text in it |
nn <- GRS_Type_Assess_By_Unit( vColon, "ProcedurePerformed", "Endoscopist", "Diagnosis", "Original.y" )
nn <- GRS_Type_Assess_By_Unit( vColon, "ProcedurePerformed", "Endoscopist", "Diagnosis", "Original.y" )
This extracts the biopsy size from the report. If there are multiple biopsies it will extract the overall size of each one (size is calculated usually in cubic mm from the three dimensions provided). This will result in row duplication.
HistolBxSize(MacroColumn)
HistolBxSize(MacroColumn)
MacroColumn |
Macdescrip |
This is usually from the Macroscopic description column.
Other Histology specific cleaning functions:
HistolNumbOfBx()
,
HistolTypeAndSite()
rr <- HistolBxSize(Mypath$Macroscopicdescription)
rr <- HistolBxSize(Mypath$Macroscopicdescription)
This extracts the number of biopsies taken from the pathology report. This is usually from the Macroscopic description column. It collects everything from the regex [0-9]1,2.0,3 to whatever the string boundary is (z).
HistolNumbOfBx(inputString, regString)
HistolNumbOfBx(inputString, regString)
inputString |
The input text to process |
regString |
The keyword to remove and to stop at in the regex |
Other Histology specific cleaning functions:
HistolBxSize()
,
HistolTypeAndSite()
qq <- HistolNumbOfBx(Mypath$Macroscopicdescription, "specimen")
qq <- HistolNumbOfBx(Mypath$Macroscopicdescription, "specimen")
This standardizes terms to describe the pathology tissue type being examined
HistolType()
HistolType()
Other NLP - Lexicons:
BiopsyIndex()
,
EventList()
,
GISymptomsList()
,
LocationListLower()
,
LocationListUniversal()
,
LocationListUpper()
,
LocationList()
,
RFACath()
,
WordsToNumbers()
This needs some blurb to be written. Used in the OPCS4 coding
HistolTypeAndSite(inputString1, inputString2, procedureString)
HistolTypeAndSite(inputString1, inputString2, procedureString)
inputString1 |
The first column to look in |
inputString2 |
The second column to look in |
procedureString |
The column with the procedure in it |
a list with two columns, one is the type and site and the other is the index to be used for OPCS4 coding later if needed.
Other Histology specific cleaning functions:
HistolBxSize()
,
HistolNumbOfBx()
Myendo2<-Endomerge2(Myendo,'Dateofprocedure','HospitalNumber', Mypath,'Dateofprocedure','HospitalNumber') PathSiteAndType <- HistolTypeAndSite(Myendo2$PathReportWhole, Myendo2$Macroscopicdescription, Myendo2$ProcedurePerformed)
Myendo2<-Endomerge2(Myendo,'Dateofprocedure','HospitalNumber', Mypath,'Dateofprocedure','HospitalNumber') PathSiteAndType <- HistolTypeAndSite(Myendo2$PathReportWhole, Myendo2$Macroscopicdescription, Myendo2$ProcedurePerformed)
Get an overall idea of how many endoscopies have been done for an indication by year and month. This is a more involved version of SurveilCapacity function. It takes string for the Indication for the test
HowManyOverTime(dataframe, Indication, Endo_ResultPerformed, StringToSearch)
HowManyOverTime(dataframe, Indication, Endo_ResultPerformed, StringToSearch)
dataframe |
dataframe |
Indication |
Indication column |
Endo_ResultPerformed |
column containing date the Endoscopy was performed |
StringToSearch |
The string in the Indication to search for |
This returns a list which contains a plot (number of tests for that indication over time and a table with the same information broken down by month and year).
Other Basic Analysis - Surveillance Functions:
SurveilFirstTest()
,
SurveilLastTest()
,
SurveilTimeByRow()
,
TimeToStatus()
# This takes the dataframe MyEndo (part of the package examples) and looks in # the column which holds the test indication (in this example it is called # 'Indication' The date of the procedure column(which can be date format or # POSIX format) is also necessary. Finally the string which indicates the text # indication needs to be inpoutted. In this case we are looking for all endoscopies done # where the indication is surveillance (so searching on 'Surv' will do fine). # If you want all the tests then put '.*' instead of Surv rm(list = ls(all = TRUE)) ff <- HowManyOverTime(Myendo, "Indications", "Dateofprocedure", ".*")
# This takes the dataframe MyEndo (part of the package examples) and looks in # the column which holds the test indication (in this example it is called # 'Indication' The date of the procedure column(which can be date format or # POSIX format) is also necessary. Finally the string which indicates the text # indication needs to be inpoutted. In this case we are looking for all endoscopies done # where the indication is surveillance (so searching on 'Surv' will do fine). # If you want all the tests then put '.*' instead of Surv rm(list = ls(all = TRUE)) ff <- HowManyOverTime(Myendo, "Indications", "Dateofprocedure", ".*")
This extracts all of the relevant IBD scores where present from the medical text.
IBD_Scores(inputColumn1)
IBD_Scores(inputColumn1)
inputColumn1 |
column of interest as a string vector |
This returns a dataframe with all the scores in it
# Example to be provided
# Example to be provided
The aim here is simply to produce a document term matrix to get the frequency of all the words, then extract the words you are interested in with tofind then find which reports have those words. Then find what proportion of the reports have those terms.
ListLookup(theframe, EndoReportColumn, myNotableWords)
ListLookup(theframe, EndoReportColumn, myNotableWords)
theframe |
the dataframe, |
EndoReportColumn |
the column of interest, |
myNotableWords |
list of words you are interested in |
Other Basic Column mutators:
EntityPairs_OneSentence()
,
EntityPairs_TwoSentence()
,
ExtrapolatefromDictionary()
,
MyImgLibrary()
# The function relies on defined a list of # words you are interested in and then choosing the column you are # interested in looking in for these words. This can be for histopathology # free text columns or endoscopic. In this example it is for endoscopic # columns myNotableWords <- c("arrett", "oeliac") jj <- ListLookup(Myendo, "Findings", myNotableWords)
# The function relies on defined a list of # words you are interested in and then choosing the column you are # interested in looking in for these words. This can be for histopathology # free text columns or endoscopic. In this example it is for endoscopic # columns myNotableWords <- c("arrett", "oeliac") jj <- ListLookup(Myendo, "Findings", myNotableWords)
The is a list of standard locations at endoscopy. It used for the site of biopsies/EMRs and potentially in functions looking at the site of a therapeutic event. It just returns the list in the function.
LocationList()
LocationList()
Other NLP - Lexicons:
BiopsyIndex()
,
EventList()
,
GISymptomsList()
,
HistolType()
,
LocationListLower()
,
LocationListUniversal()
,
LocationListUpper()
,
RFACath()
,
WordsToNumbers()
The is a list of standard locations at endoscopy that is used in the extraction of the site of biopsies/EMRs and potentially in functions looking at the site of a therapeutic event. It just returns the list in the function
LocationListLower()
LocationListLower()
Other NLP - Lexicons:
BiopsyIndex()
,
EventList()
,
GISymptomsList()
,
HistolType()
,
LocationListUniversal()
,
LocationListUpper()
,
LocationList()
,
RFACath()
,
WordsToNumbers()
The is a list of standard locations at endoscopy that is used in the extraction of the site of biopsies/EMRs and potentially in functions looking at the site of a therapeutic event. It just returns the list in the function
LocationListUniversal()
LocationListUniversal()
Other NLP - Lexicons:
BiopsyIndex()
,
EventList()
,
GISymptomsList()
,
HistolType()
,
LocationListLower()
,
LocationListUpper()
,
LocationList()
,
RFACath()
,
WordsToNumbers()
The is a list of standard locations at endoscopy that is used in the extraction of the site of biopsies/EMRs and potentially in functions looking at the site of a therapeutic event. It just returns the list in the function.
LocationListUpper()
LocationListUpper()
Other NLP - Lexicons:
BiopsyIndex()
,
EventList()
,
GISymptomsList()
,
HistolType()
,
LocationListLower()
,
LocationListUniversal()
,
LocationList()
,
RFACath()
,
WordsToNumbers()
This takes any of the numerical metrics in the dataset and plots it by endoscopist. It of course relies on a Endoscopist column being present
MetricByEndoscopist(dataframe, Column, EndoscopistColumn)
MetricByEndoscopist(dataframe, Column, EndoscopistColumn)
dataframe |
The dataframe |
Column |
The column (numeric data) of interest |
EndoscopistColumn |
The endoscopist column |
Other Grouping by endoscopist:
CategoricalByEndoscopist()
#The function gives a table with any numeric # metric by endoscopist # In this example we tabulate medication by # endoscopist # Lets bind the output of EndoscMeds to the main dataframe so we # have a complete dataframe with all the meds extracted MyendoNew<-cbind(EndoscMeds(Myendo$Medications),Myendo) # Now lets look at the fentanly use per Endoscopist: kk<-MetricByEndoscopist(MyendoNew,'Endoscopist','Fent') #EndoBasicGraph(MyendoNew, "Endoscopist", "Fent") #run this #if you want to see the graph rm(Myendo)
#The function gives a table with any numeric # metric by endoscopist # In this example we tabulate medication by # endoscopist # Lets bind the output of EndoscMeds to the main dataframe so we # have a complete dataframe with all the meds extracted MyendoNew<-cbind(EndoscMeds(Myendo$Medications),Myendo) # Now lets look at the fentanly use per Endoscopist: kk<-MetricByEndoscopist(MyendoNew,'Endoscopist','Fent') #EndoBasicGraph(MyendoNew, "Endoscopist", "Fent") #run this #if you want to see the graph rm(Myendo)
A dataset containing fake endoscopy reports. The report fields have already been The report field is derived from the whole report as follows: Myendo<-TheOGDReportFinal Myendo$OGDReportWhole<-gsub('2nd Endoscopist:','Second endoscopist:',Myendo$OGDReportWhole) EndoscTree<-list('Hospital Number:','Patient Name:','General Practitioner:', 'Date of procedure:','Endoscopist:','Second endoscopist:','Medications', 'Instrument','Extent of Exam:','Indications:','Procedure Performed:','Findings:', 'Endoscopic Diagnosis:') for(i in 1:(length(EndoscTree)-1)) Myendo<-Extractor(Myendo,'OGDReportWhole',as.character(EndoscTree[i]), as.character(EndoscTree[i+1]),as.character(EndoscTree[i])) Myendo$Dateofprocedure<-as.Date(Myendo$Dateofprocedure)
Myendo
Myendo
A data frame with 2000 rows and 1 variables:
The whole report, in text
Hospital Number, in text
Patient Name, in text
General Practitioner, in text
Date of the procedure, as date
Endoscopist, in text
Secondendoscopist, in text
Medications, in text
Instrument, in text
ExtentofExam, in text
Indications, in text
Procedure Performed, in text
Endoscopic findings, in text
This is used to pick and clean endoscopic images from html exports so they can be prepared before being linked to pathology and endoscopy reports
MyImgLibrary(file, delim, location)
MyImgLibrary(file, delim, location)
file |
The html report to extract (the html will have all the images references in it) |
delim |
The phrase that separates individual endoscopies |
location |
The folder containing the actual images |
Other Basic Column mutators:
EntityPairs_OneSentence()
,
EntityPairs_TwoSentence()
,
ExtrapolatefromDictionary()
,
ListLookup()
# MyImgLibrary("~/Images Captured with Proc Data Audit_Findings1.html", # "procedureperformed","~/")
# MyImgLibrary("~/Images Captured with Proc Data Audit_Findings1.html", # "procedureperformed","~/")
A dataset containing fake pathology reports. The report field is derived from the whole report as follows: Mypath<-PathDataFrameFinalColon HistolTree<-list('Hospital Number','Patient Name','DOB:','General Practitioner:', 'Date of procedure:','Clinical Details:','Macroscopic description:','Histology:','Diagnosis:',”) for(i in 1:(length(HistolTree)-1)) Mypath<-Extractor(Mypath,'PathReportWhole',as.character(HistolTree[i]), as.character(HistolTree[i+1]),as.character(HistolTree[i])) Mypath$Dateofprocedure<-as.Date(Mypath$Dateofprocedure)
Mypath
Mypath
A data frame with 2000 rows and 1 variables:
The whole report, in text
Hospital Number, in text
Patient Name, in text
Date of Birth, in text
General Practitioner, in text
Date of the procedure, as date
Clinical Details, in text
Macroscopic description of the report, in text
Histology, in text
Diagnosis, in text
Extraction of the negative sentences so that normal findings can be removed and not counted when searching for true diseases. eg remove 'No evidence of candidal infection' so it doesn't get included if looking for candidal infections. It is used by default as part of the textPrep function but can be turned off as an optional parameter
NegativeRemove(inputText)
NegativeRemove(inputText)
inputText |
column of interest |
This returns a column within a dataframe. THis should be changed to a character vector eventually
Other NLP - Text Cleaning and Extraction:
ColumnCleanUp()
,
DictionaryInPlaceReplace()
,
Extractor()
,
NegativeRemoveWrapper()
,
textPrep()
# Build a character vector and then # incorporate into a dataframe anexample<-c("There is no evidence of polyp here", "Although the prep was poor,there was no adenoma found", "The colon was basically inflammed, but no polyp was seen", "The Barrett's segment was not biopsied", "The C0M7 stretch of Barrett's was flat") anexample<-data.frame(anexample) names(anexample)<-"Thecol" # Run the function on the dataframe and it should get rid of sentences (and # parts of sentences) with negative parts in them. hh<-NegativeRemove(anexample$Thecol)
# Build a character vector and then # incorporate into a dataframe anexample<-c("There is no evidence of polyp here", "Although the prep was poor,there was no adenoma found", "The colon was basically inflammed, but no polyp was seen", "The Barrett's segment was not biopsied", "The C0M7 stretch of Barrett's was flat") anexample<-data.frame(anexample) names(anexample)<-"Thecol" # Run the function on the dataframe and it should get rid of sentences (and # parts of sentences) with negative parts in them. hh<-NegativeRemove(anexample$Thecol)
This performs negative removal on a per sentance basis
NegativeRemoveWrapper(inputText)
NegativeRemoveWrapper(inputText)
inputText |
the text to remove Negatives from |
This returns a column within a dataframe. This should be changed to a character vector eventually
Other NLP - Text Cleaning and Extraction:
ColumnCleanUp()
,
DictionaryInPlaceReplace()
,
Extractor()
,
NegativeRemove()
,
textPrep()
# Build a character vector and then # incorporate into a dataframe anexample<-c("There is no evidence of polyp here", "Although the prep was poor,there was no adenoma found", "The colon was basically inflammed, but no polyp was seen", "The Barrett's segment was not biopsied", "The C0M7 stretch of Barrett's was flat") anexample<-data.frame(anexample) names(anexample)<-"Thecol" # Run the function on the dataframe and it should get rid of sentences (and # parts of sentences) with negative parts in them. #hh<-NegativeRemoveWrapper(anexample$Thecol)
# Build a character vector and then # incorporate into a dataframe anexample<-c("There is no evidence of polyp here", "Although the prep was poor,there was no adenoma found", "The colon was basically inflammed, but no polyp was seen", "The Barrett's segment was not biopsied", "The C0M7 stretch of Barrett's was flat") anexample<-data.frame(anexample) names(anexample)<-"Thecol" # Run the function on the dataframe and it should get rid of sentences (and # parts of sentences) with negative parts in them. #hh<-NegativeRemoveWrapper(anexample$Thecol)
A dataset containing fake pathology reports for upper GI endoscopy tissue specimens. The report field is provided as a whole report without any fields having been already extracted
PathDataFrameFinal
PathDataFrameFinal
A data frame with 2000 rows and 1 variables:
The whole report, in text
A dataset containing fake pathology reports for lower GI endoscopy tissue specimens. The report field is provided as a whole report without any fields having been already extracted
PathDataFrameFinalColon
PathDataFrameFinalColon
A data frame with 2000 rows and 1 variables:
The whole report, in text
This allows us to look at the overall flow from one type of procedure to another using circos plots. A good example of it's use might be to see how patients move from one state (e.g. having an EMR), to another state (e.g. undergoing RFA)
PatientFlow_CircosPlots( dataframe, Endo_ResultPerformed, HospNum_Id, ProcPerformed )
PatientFlow_CircosPlots( dataframe, Endo_ResultPerformed, HospNum_Id, ProcPerformed )
dataframe |
dataframe |
Endo_ResultPerformed |
the column containing the date of the procedure |
HospNum_Id |
Column with the patient's unique hospital number |
ProcPerformed |
The procedure that you want to plot (eg EMR, radiofrequency ablation for Barrett's but can be any dscription of a procedure you desire) |
# This function builds a circos plot which gives a more aggregated # overview of how patients flow from one state to another than the # SurveySankey function # Build a list of procedures Event <- list( x1 = "Therapeutic- Dilatation", x2 = "Other-", x3 = "Surveillance", x4 = "APC", x5 = "Therapeutic- RFA TTS", x5 = "Therapeutic- RFA 90", x6 = "Therapeutic- EMR", x7 = "Therapeutic- RFA 360" ) EndoEvent <- replicate(2000, sample(Event, 1, replace = FALSE)) # Merge the list with the Myendo dataframe fff <- unlist(EndoEvent) fff <- data.frame(fff) names(fff) <- "col1" Myendo$EndoEvent<-fff$col1 names(Myendo)[names(Myendo) == "HospitalNumber"] <- "PatientID" names(Myendo)[names(Myendo) == "fff$col1"] <- "EndoEvent" # Myendo$EndoEvent<-as.character(Myendo$EndoEvent) # Run the function using the procedure information (the date of the # procedure, the Event type and the individual patient IDs) hh <- PatientFlow_CircosPlots(Myendo, "Dateofprocedure", "PatientID", "EndoEvent") rm(Myendo) rm(EndoEvent)
# This function builds a circos plot which gives a more aggregated # overview of how patients flow from one state to another than the # SurveySankey function # Build a list of procedures Event <- list( x1 = "Therapeutic- Dilatation", x2 = "Other-", x3 = "Surveillance", x4 = "APC", x5 = "Therapeutic- RFA TTS", x5 = "Therapeutic- RFA 90", x6 = "Therapeutic- EMR", x7 = "Therapeutic- RFA 360" ) EndoEvent <- replicate(2000, sample(Event, 1, replace = FALSE)) # Merge the list with the Myendo dataframe fff <- unlist(EndoEvent) fff <- data.frame(fff) names(fff) <- "col1" Myendo$EndoEvent<-fff$col1 names(Myendo)[names(Myendo) == "HospitalNumber"] <- "PatientID" names(Myendo)[names(Myendo) == "fff$col1"] <- "EndoEvent" # Myendo$EndoEvent<-as.character(Myendo$EndoEvent) # Run the function using the procedure information (the date of the # procedure, the Event type and the individual patient IDs) hh <- PatientFlow_CircosPlots(Myendo, "Dateofprocedure", "PatientID", "EndoEvent") rm(Myendo) rm(EndoEvent)
This plots the findings at endoscopy (or pathology) over time for individual patients. An example might be with worst pathological grade on biopsy for Barrett's oesophagus over time
PatientFlowIndividual( theframe, EndoReportColumn, myNotableWords, DateofProcedure, PatientID )
PatientFlowIndividual( theframe, EndoReportColumn, myNotableWords, DateofProcedure, PatientID )
theframe |
dataframe |
EndoReportColumn |
the column containing the date of the procedure |
myNotableWords |
The terms from a column with categorical variables |
DateofProcedure |
Column with the date of the procedure |
PatientID |
Column with the patient's unique identifier |
Other Patient Flow functions:
SurveySankey()
# This function builds chart of categorical outcomes for individal patients over time # It allows a two dimensional visualisation of patient progress. A perfect example is # visualising the Barrett's progression for patients on surveillance and then # therapy if dysplasia develops and highlighting recurrence if it happens # Barretts_df <- BarrettsAll(Myendo, "Findings", "OGDReportWhole", Mypath, "Histology") # myNotableWords<-c("No_IM","IM","LGD","HGD","T1a","IGD","SM1","SM2") # PatientFlowIndividual(Barretts_df,"IMorNoIM",myNotableWords,DateofProcedure,"HospitalNumber") # Once the function is run you should always call dev.off()
# This function builds chart of categorical outcomes for individal patients over time # It allows a two dimensional visualisation of patient progress. A perfect example is # visualising the Barrett's progression for patients on surveillance and then # therapy if dysplasia develops and highlighting recurrence if it happens # Barretts_df <- BarrettsAll(Myendo, "Findings", "OGDReportWhole", Mypath, "Histology") # myNotableWords<-c("No_IM","IM","LGD","HGD","T1a","IGD","SM1","SM2") # PatientFlowIndividual(Barretts_df,"IMorNoIM",myNotableWords,DateofProcedure,"HospitalNumber") # Once the function is run you should always call dev.off()
The takes a list of catheters used in radiofrequency ablation.
RFACath()
RFACath()
Other NLP - Lexicons:
BiopsyIndex()
,
EventList()
,
GISymptomsList()
,
HistolType()
,
LocationListLower()
,
LocationListUniversal()
,
LocationListUpper()
,
LocationList()
,
WordsToNumbers()
This function creates a consort diagram using diagrammeR by assessing all of the dataframes in your script and populating each box in the consort diagram with the number of rows in each dataframe as well as how the dataframes are linked together. The user just provides a pathname for the script
sanity(pathName)
sanity(pathName)
pathName |
The string in the Indication to search for |
#pathName<-paste0(here::here(),"/inst/TemplateProject/munge/PreProcessing.R") #sanity(pathName) # This creates a consort diagram from any R script (not Rmd). It # basically tells you how all the dataframes are related and how many # rows each dataframe has so you can see if any data has been lost # on the way.
#pathName<-paste0(here::here(),"/inst/TemplateProject/munge/PreProcessing.R") #sanity(pathName) # This creates a consort diagram from any R script (not Rmd). It # basically tells you how all the dataframes are related and how many # rows each dataframe has so you can see if any data has been lost # on the way.
This standardises the colours for any ggplot plot produced. If you do use it, like all ggplots it can be extended using the "+" to add whatever else is necessary
scale_colour_Publication()
scale_colour_Publication()
Other Data Presentation helpers:
EndoBasicGraph()
,
scale_fill_Publication()
,
theme_Publication()
# None needed
# None needed
This standardises the fills for any ggplot plot produced. If you do use it, like all ggplots it can be extended using the "+" to add whatever else is necessary
scale_fill_Publication()
scale_fill_Publication()
Other Data Presentation helpers:
EndoBasicGraph()
,
scale_colour_Publication()
,
theme_Publication()
# None needed
# None needed
This is a helper function for finding and replacing from lexicons like the event list. The lexicons are all named lists where the name is the text to replace and the value what it should be replaced with It uses fuzzy find and replace to account for spelling errors
spellCheck(pattern, replacement, x, fixed = FALSE)
spellCheck(pattern, replacement, x, fixed = FALSE)
pattern |
the pattern to look for |
replacement |
the pattern replaceme with |
x |
the target string |
fixed |
whether the pattern is regex or not. Default not. |
This returns a character vector
L <- tolower(stringr::str_split(HistolType(),"\\|"))
L <- tolower(stringr::str_split(HistolType(),"\\|"))
Extracts the first test only per patient and returns a new dataframe listing the patientID and the first test done
SurveilFirstTest(dataframe, HospNum_Id, Endo_ResultPerformed)
SurveilFirstTest(dataframe, HospNum_Id, Endo_ResultPerformed)
dataframe |
dataframe |
HospNum_Id |
Patient ID |
Endo_ResultPerformed |
Date of the Endoscopy |
Other Basic Analysis - Surveillance Functions:
HowManyOverTime()
,
SurveilLastTest()
,
SurveilTimeByRow()
,
TimeToStatus()
dd <- SurveilFirstTest( Myendo, "HospitalNumber", "Dateofprocedure" )
dd <- SurveilFirstTest( Myendo, "HospitalNumber", "Dateofprocedure" )
This extracts the last test only per patient and returns a new dataframe listing the patientID and the last test done
SurveilLastTest(dataframe, HospNum_Id, Endo_ResultPerformed)
SurveilLastTest(dataframe, HospNum_Id, Endo_ResultPerformed)
dataframe |
dataframe |
HospNum_Id |
Patient ID |
Endo_ResultPerformed |
Date of the Endoscopy |
Other Basic Analysis - Surveillance Functions:
HowManyOverTime()
,
SurveilFirstTest()
,
SurveilTimeByRow()
,
TimeToStatus()
cc <- SurveilLastTest(Myendo, "HospitalNumber", "Dateofprocedure")
cc <- SurveilLastTest(Myendo, "HospitalNumber", "Dateofprocedure")
This determines the time difference between each test for a patient in days It returns the time since the first and the last study as a new dataframe.
SurveilTimeByRow(dataframe, HospNum_Id, Endo_ResultPerformed)
SurveilTimeByRow(dataframe, HospNum_Id, Endo_ResultPerformed)
dataframe |
dataframe, |
HospNum_Id |
Patient ID |
Endo_ResultPerformed |
Date of the Endoscopy |
Other Basic Analysis - Surveillance Functions:
HowManyOverTime()
,
SurveilFirstTest()
,
SurveilLastTest()
,
TimeToStatus()
aa <- SurveilTimeByRow( Myendo, "HospitalNumber", "Dateofprocedure" )
aa <- SurveilTimeByRow( Myendo, "HospitalNumber", "Dateofprocedure" )
The purpose of the function is to provide a Sankey plot which allows the analyst to see the proportion of patients moving from one state (in this case type of Procedure) to another. This allows us to see for example how many EMRs are done after RFA.
SurveySankey(dfw, ProcPerformedColumn, PatientID)
SurveySankey(dfw, ProcPerformedColumn, PatientID)
dfw |
the dataframe extracted using the standard cleanup scripts |
ProcPerformedColumn |
the column containing the test like P rocPerformed for example |
PatientID |
the column containing the patients unique identifier eg hostpital number |
Other Patient Flow functions:
PatientFlowIndividual()
names(Myendo)[names(Myendo) == "HospitalNumber"] <- "PatientID" gg <- SurveySankey(Myendo, "ProcedurePerformed", "PatientID")
names(Myendo)[names(Myendo) == "HospitalNumber"] <- "PatientID" gg <- SurveySankey(Myendo, "ProcedurePerformed", "PatientID")
This function prepares the data by cleaning punctuation, checking spelling against the lexicons, mapping terms according to the lexicons and lower casing everything. It contains several of the other functions in the package for ease of use.
textPrep(inputText, delim)
textPrep(inputText, delim)
inputText |
The relevant pathology text columns |
delim |
the delimitors so the extractor can be used |
This returns a string vector.
Other NLP - Text Cleaning and Extraction:
ColumnCleanUp()
,
DictionaryInPlaceReplace()
,
Extractor()
,
NegativeRemoveWrapper()
,
NegativeRemove()
mywords<-c("Hospital Number","Patient Name:","DOB:","General Practitioner:", "Date received:","Clinical Details:","Macroscopic description:", "Histology:","Diagnosis:") CleanResults<-textPrep(PathDataFrameFinal$PathReportWhole,mywords)
mywords<-c("Hospital Number","Patient Name:","DOB:","General Practitioner:", "Date received:","Clinical Details:","Macroscopic description:", "Histology:","Diagnosis:") CleanResults<-textPrep(PathDataFrameFinal$PathReportWhole,mywords)
This standardises the theme for any ggplot plot produced. If you do use it, like all ggplots it can be extended using the "+" to add whatever else is necessary
theme_Publication(base_size = 14, base_family = "Helvetica")
theme_Publication(base_size = 14, base_family = "Helvetica")
base_size |
the base size |
base_family |
the base family |
Other Data Presentation helpers:
EndoBasicGraph()
,
scale_colour_Publication()
,
scale_fill_Publication()
# None needed
# None needed
A dataset containing fake endoscopy reports. The report field is provided as a whole report without any fields having been already extracted
TheOGDReportFinal
TheOGDReportFinal
A data frame with 2000 rows and 1 variables:
The whole report, in text
This function selects patients who have had a start event and an end event of the users choosing so you can determine things like how long it takes to get a certain outcome. For example, how long does it take to get a patient into a fully squamous oesophagus after Barrett's ablation for dysplasia?
TimeToStatus(dataframe, HospNum, EVENT, indicatorEvent, endEvent)
TimeToStatus(dataframe, HospNum, EVENT, indicatorEvent, endEvent)
dataframe |
The dataframe |
HospNum |
The Hospital Number column |
EVENT |
The column that contains the outcome of choice |
indicatorEvent |
The name of the start event (can be a regular expression) |
endEvent |
The name of the endpoint (can be a regular expression) |
Other Basic Analysis - Surveillance Functions:
HowManyOverTime()
,
SurveilFirstTest()
,
SurveilLastTest()
,
SurveilTimeByRow()
# Firstly relevant columns are extrapolated from the # Mypath demo dataset. These functions are all part of Histology data # cleaning as part of the package. v <- Mypath v$NumBx <- HistolNumbOfBx(v$Macroscopicdescription, "specimen") v$BxSize <- HistolBxSize(v$Macroscopicdescription) # The histology is then merged with the Endoscopy dataset. The merge occurs # according to date and Hospital number v <- Endomerge2( Myendo, "Dateofprocedure", "HospitalNumber", v, "Dateofprocedure", "HospitalNumber" ) # The function relies on the other Barrett's functions being run as well: b1 <- Barretts_PragueScore(v, "Findings") b1$IMorNoIM <- Barretts_PathStage(b1, "Histology") colnames(b1)[colnames(b1) == "pHospitalNum"] <- "HospitalNumber" # The function groups the procedures by patient and gives # all the procedures between # the indicatorEvent amd the procedure just after the endpoint. # Eg if the start is RFA and the # endpoint is biopsies then it will give all RFA procedures and # the first biopsy procedure b1$EndoscopyEvent <- EndoscopyEvent( b1, "Findings", "ProcedurePerformed", "Macroscopicdescription", "Histology" ) nn <- TimeToStatus(b1, "eHospitalNum", "EndoscopyEvent", "rfa", "dilat") rm(v)
# Firstly relevant columns are extrapolated from the # Mypath demo dataset. These functions are all part of Histology data # cleaning as part of the package. v <- Mypath v$NumBx <- HistolNumbOfBx(v$Macroscopicdescription, "specimen") v$BxSize <- HistolBxSize(v$Macroscopicdescription) # The histology is then merged with the Endoscopy dataset. The merge occurs # according to date and Hospital number v <- Endomerge2( Myendo, "Dateofprocedure", "HospitalNumber", v, "Dateofprocedure", "HospitalNumber" ) # The function relies on the other Barrett's functions being run as well: b1 <- Barretts_PragueScore(v, "Findings") b1$IMorNoIM <- Barretts_PathStage(b1, "Histology") colnames(b1)[colnames(b1) == "pHospitalNum"] <- "HospitalNumber" # The function groups the procedures by patient and gives # all the procedures between # the indicatorEvent amd the procedure just after the endpoint. # Eg if the start is RFA and the # endpoint is biopsies then it will give all RFA procedures and # the first biopsy procedure b1$EndoscopyEvent <- EndoscopyEvent( b1, "Findings", "ProcedurePerformed", "Macroscopicdescription", "Histology" ) nn <- TimeToStatus(b1, "eHospitalNum", "EndoscopyEvent", "rfa", "dilat") rm(v)
A dataset containing fake lower GI endoscopy reports and pathology reports all pre-extracted
vColon
vColon
A data frame with 2000 rows and 26 variables:
The HospitalNum, in text
The PatientName, in text
The GeneralPractitioner report, in text
The Date, in date
The Endoscopist report, in text
The Secondendoscopist report, in text
The Medications report, in text
The Instrument report, in text
The ExtentofExam report, in text
The Indications report, in text
The ProcedurePerformed report, in text
The Findings report, in text
The EndoscopicDiagnosis report, in text
The Original endosocpy report, in text
The HospitalNum, in text
The PatientName, in text
The DOB, in date
The GeneralPractitioner report, in text
The Date.y , in date
The ClinicalDetails report, in text
The Natureofspecimen report, in text
The Macroscopicdescription report, in text
The Histology report, in text
The Diagnosis report, in text
The whole report, in text
Days, in numbers
This function converts words to numbers.
WordsToNumbers()
WordsToNumbers()
Other NLP - Lexicons:
BiopsyIndex()
,
EventList()
,
GISymptomsList()
,
HistolType()
,
LocationListLower()
,
LocationListUniversal()
,
LocationListUpper()
,
LocationList()
,
RFACath()