As you probably know, there is a publicly available database wcich contains many information on majority of clinical trials – at least on trials with US-citizens – started in 1983.
In this post I try to show what information is stored in this database and how can you manage it with free statistical tools.
I give a detailed description on the ID-structure and give solutions for specific scientific questions.
The questions I try to answer with this small presentation:
- How to determine the number of “recruiting” sites, how to generate a list of cities with total number of recruiting facilities and how to plot the ‘Recruiting’ sites on a Google map.
The data can be downloaded from
With choosing pipe-delimited text files, you can easily read the content with any text-editor (I would recommend notepad++).
If you have some statistical background and especially you have access to SAS you can download SAS transport files as well.
After downloading a close to 2 GB zipped file, youl’ll get a set of 40 files.
One of the tools can be used for management of this files is R or its menu-driven version RStudio.
As it is stated on the webpage http://aact.ctti-clinicaltrials.org, you can easily read the downloaded files with the help of the code:
read.table(file = "id_information.txt", header = TRUE, sep = "|", na.strings = "", comment.char = "", quote = "\"", fill = FALSE, nrows = 200000)
The most important file is the Studies database ( open in new window ). You can find information – among others – on
last verification date
number of arms and groups.
The file contains data of more than 251 thousand studies (only the first 1000 can be found on our site).
Task 1: Answer the question how many open (overall status = ‘RECRUITING’) studies can be found tabulated by sites.
We have to lean on Facilities and Studies databases. The Facilities database – the 1st 1000 records – can be checked here.
To get the database containing both study and facility relevant data, you have to merge the two databases.
In R with the command
library(Hmisc) library(data.table) library(DT)studies <- read.table("DIR/studies.txt", header = TRUE, sep = "|", na.strings = "", comment.char = "", quote = "\"", fill = FALSE, nrows=5000) facilities <- read.table("DIR/facilities.txt", header = TRUE, sep = "|", na.strings = "", comment.char = "", quote = "\"", fill = FALSE, nrows=5000) sites <- merge(studies, facilities, by = "nct_id") my <- c("nct_id", "overall_status", "city", "state", "zip", "country", "name") sitesa <- sites[my] sitesa$city <- tolower(sitesa$city)
If you would like to have a table on sites with “recruting’ status, you can obtain a table like this:
with the commands:
datatable(setDT(sitesa_c_final)[, .N, by = .(overall_status,city)][order(-N)])
Or if you would like to demonstrate the status of the sites on a Google map? There is no problem, but I would recommend to change from RStudio to Knime.
If you would like to place the sites on a map you’ll need their exact coordinates. The good news is that this information is also available for free. You can download the necessary database from Maxmind site ( https://www.maxmind.com/en/free-world-cities-database ).
Addition of the coordinates to the database with cities can be done with the following code:
coords <- read.table("e:/_job/clinicaltrials.gov/worldcities/worldcitiespop.txt", header = TRUE, sep = ",", na.strings = "", comment.char = "", quote = "\"", fill = FALSE) sitesa_c <- merge(sitesa, coords, by.x = "city", by.y = "City") sitesa_c_final <- subset(sitesa_c, sitesa_c$overall_status == "Recruiting")
This sitesa_c_final table is given to KNIME, where the following actions should be done:
The outcome looks like this, where the shown sites (indicated by their names) indicate the sites with ‘Recruiting’ status.
One of the major tasks in natural language processing (NLP) is the part-of-speech (POS) tagging of sentences, i.e. categorizing the words according to grammatical properties. Common parts of speech are noun, verb, article, adjective, preposition, pronoun, adverb, conjunction and interjection.
With the help of a recent R package RDRPOSTagger now one can perform POS tagging within R on more than 40 languages, including English, Hungarian, French, German, Hindi, Italian, Thai, Vietnamese and many more. Below we present a brief introduction to this topic via a simple R script.
First, we have to install/load the package. The tokenizers package is also needed for splitting the text into sentences.
The part-of-speech tags are listed along with their abbreviations.
unipostag_types <- c("ADJ" = "adjective", "ADP" = "adposition", "ADV" = "adverb", "AUX" = "auxiliary", "CONJ" = "coordinating conjunction", "DET" = "determiner", "INTJ" = "interjection", "NOUN" = "noun", "NUM" = "numeral", "PART" = "particle", "PRON" = "pronoun", "PROPN" = "proper noun", "PUNCT" = "punctuation", "SCONJ" = "subordinating conjunction", "SYM" = "symbol", "VERB" = "verb", "X" = "other")
Next, the text to analyse is added (source: Wikipedia).
text <- "Rubik's Cube is a 3-D combination puzzle invented in 1974 by Hungarian sculptor and professor of architecture Ernő Rubik. Originally called the Magic Cube, the puzzle was licensed by Rubik to be sold by Ideal Toy Corp. in 1980 via businessman Tibor Laczi and Seven Towns founder Tom Kremer, and won the German Game of the Year special award for Best Puzzle that year. As of January 2009, 350 million cubes had been sold worldwide making it the world's top-selling puzzle game. It is widely considered to be the world's best-selling toy."
We split it into sentences.
sentences <- tokenize_sentences(text, simplify = TRUE)
The language and type of tagging needs to be defined.
unipostagger <- rdr_model(language = "UD_English", annotation = "UniversalPOS")
Finally, the tagging is performed.
unipostags <- rdr_pos(unipostagger, sentences)
unipostags$word.type <- unipostag_types[unipostags$word.type]
The results for the first sentence can be seen below.
sentence.id word.id word word.type
1 1 1 Rubik's noun
2 1 2 Cube noun
3 1 3 is verb
4 1 4 a determiner
5 1 5 3-D numeral
6 1 6 combination noun
7 1 7 puzzle adjective
8 1 8 invented verb
9 1 9 in adposition
10 1 10 1974 numeral
11 1 11 by adposition
12 1 12 Hungarian adjective
13 1 13 sculptor noun
14 1 14 and coordinating conjunction
15 1 15 professor noun
16 1 16 of adposition
17 1 17 architecture noun
18 1 18 Ernő proper noun
19 1 19 Rubik. proper noun
For more details about the RDRPOSTagger package please check this link: Natural Language Processing on 40 languages with the Ripple Down Rules-based Part-Of-Speech Tagger.
The purpose of Shiny is to provide an intuitive and user-friendly interface to R. R is a highly popular statistical environment for doing heavy data analysis and constructing statistical models, and therefore is highly popular among data scientists. However, for a user with a non-coding background, using R to conduct such analysis can become quite intensive. This is where Shiny Web Apps come in. Essentially, Shiny allows for a more intuitive graphical user interface that is still capable of conducting sophisticated data analysis — without the need for extensive coding on the part of the end user.
In this article on using Shiny with R and HTML, the author illustrated how an interactive web application can be created to conduct analysis without the need for direct manipulation of code. In this article, the author will use a slightly different model to illustrate how the Shiny environment can be customized to work with the end user in a more intuitive fashion. Essentially, the goal of this article is to illustrate how a user can:
- Build an application by linking the UI and server side
- How to customize the themes available in the Shiny Themes library
- Implement error messages in order to provide guidance to an end user on how to use a particular program
The program itself that is developed for this tutorial is quite basic: a slider input allows the user to manipulate a variable within the program by means of reactivity, which causes instantaneous changes in the line plot output that is developed by means of reactivity.
This inherent function gives Shiny a significant advantage over using R code as a stand-alone. Traditionally, in order to analyze the change in a particular variable, the code must be manipulated directly (or the data from which the code is reading), and this can ultimately become very inefficient. However, Shiny greatly speeds up this process by allowing the user to manipulate the variables in a highly intuitive manner, and changes are reflected instantly.
However, the whole purpose of Shiny is to make an R Script as interactive as possible. In this regard, the user will want to be able to add features to the program that go well beyond reactivity. Two such aspects of this that the author will discuss in this tutorial are:
shinythemesin order to customize the appearance of our Shiny appearance
- Constructing a
validate()function in order to display an alert once variables are manipulated in a certain manner
See the tutorial here: SitePoint
SatRdays are community-led, regional conferences to support collaboration, networking and innovation within the R community. The initiative of Steph Locke and Gergely Daroczi was accepted and funded by the R consortium. The very first event of this series took place in Budapest, Hungary on September 3, 2016 with almost 200 attendees of 19 countries and 12 hours of pure R fun. The day began with various workshops, followed by two keynotes and several regular talks, and ended with a data visualization challenge. The complete schedule can be found on the conference website, http://budapest.satrdays.org . The talks were live-streamed and can be watched online: http://www.ustream.tv/channel/xFdxHeVnGKS . If you have only limited time, we recommend the following talks: 1st keynote by Gábor Csárdi (R package history), Romain François‘ question section (including a marriage proposal), 2nd keynote by Jeroen Ooms (HTTP requests, ImageMagick) and data sonification by Thomas Levine. In overall, the first satRdays event received very positive feedback from the R community, and started to establish the reputation of the series. Personal thoughts about the conference from the main organizer were published at https://www.r-consortium.org/news/blogs/2016/09/start-satrdays .
“With the advent of data science and the increased need to analyze and interpret vast amounts of data, the R language has become ever more popular. However, there’s increasingly a need for a smooth interaction between statistical computing platforms and the web, given both 1) the need for a more interactive user interface in analyzing data, and 2) the increased role of the cloud in running such applications.
Statisticians and web developers have thus seemed an unlikely mix till now, but make no mistake that the interactions between these two groups will continue to increase as the need for web-based platforms becomes ever more popular in the world of data science. In this regard, the interaction of the R and Shiny platforms is quickly becoming a cornerstone of interaction between the world of data and the web.
In this tutorial, we’ll look primarily at the commands used to build an application in Shiny — both on the UI (user interface) side and the server side. While familiarity with the R programming language is invariably helpful in creating a Shiny app, expert knowledge is not necessary, and this example will cover the building of a simple statistical graph in Shiny, along with some basic commands illustrating how to customize the web page through HTML.” – Michael Grogan
Continue to tutorial: SitePoint
Although it is piece of news from 2012, but it cannot be say enough, that FDA officially accepts submissions in R.
“The FDA does not endorse or require any particular software to be used for clinical trial submissions, and there are no regulations that restrict the use of open source software (including R) at the FDA. Nonetheless, any software (R included) used to prepare data analysis from clinical trials must comply with the various FDA regulations and guidances. The R Foundation helpfully provides a guidance document for the use of R in regulated clinical trial environments, which provides details of the specific FDA regulations and how R complies with them.”
The details can be found in here.