As you probably know, there is a publicly available database wcich contains many information on majority of clinical trials – at least on trials with US-citizens – started in 1983.
In this post I try to show what information is stored in this database and how can you manage it with free statistical tools.
I give a detailed description on the ID-structure and give solutions for specific scientific questions.
The questions I try to answer with this small presentation:
- How to determine the number of “recruiting” sites, how to generate a list of cities with total number of recruiting facilities and how to plot the ‘Recruiting’ sites on a Google map.
The data can be downloaded from
With choosing pipe-delimited text files, you can easily read the content with any text-editor (I would recommend notepad++).
If you have some statistical background and especially you have access to SAS you can download SAS transport files as well.
After downloading a close to 2 GB zipped file, youl’ll get a set of 40 files.
One of the tools can be used for management of this files is R or its menu-driven version RStudio.
As it is stated on the webpage http://aact.ctti-clinicaltrials.org, you can easily read the downloaded files with the help of the code:
read.table(file = "id_information.txt", header = TRUE, sep = "|", na.strings = "", comment.char = "", quote = "\"", fill = FALSE, nrows = 200000)
The most important file is the Studies database ( open in new window ). You can find information – among others – on
last verification date
number of arms and groups.
The file contains data of more than 251 thousand studies (only the first 1000 can be found on our site).
Task 1: Answer the question how many open (overall status = ‘RECRUITING’) studies can be found tabulated by sites.
We have to lean on Facilities and Studies databases. The Facilities database – the 1st 1000 records – can be checked here.
To get the database containing both study and facility relevant data, you have to merge the two databases.
In R with the command
library(Hmisc) library(data.table) library(DT)studies <- read.table("DIR/studies.txt", header = TRUE, sep = "|", na.strings = "", comment.char = "", quote = "\"", fill = FALSE, nrows=5000) facilities <- read.table("DIR/facilities.txt", header = TRUE, sep = "|", na.strings = "", comment.char = "", quote = "\"", fill = FALSE, nrows=5000) sites <- merge(studies, facilities, by = "nct_id") my <- c("nct_id", "overall_status", "city", "state", "zip", "country", "name") sitesa <- sites[my] sitesa$city <- tolower(sitesa$city)
If you would like to have a table on sites with “recruting’ status, you can obtain a table like this:
with the commands:
datatable(setDT(sitesa_c_final)[, .N, by = .(overall_status,city)][order(-N)])
Or if you would like to demonstrate the status of the sites on a Google map? There is no problem, but I would recommend to change from RStudio to Knime.
If you would like to place the sites on a map you’ll need their exact coordinates. The good news is that this information is also available for free. You can download the necessary database from Maxmind site ( https://www.maxmind.com/en/free-world-cities-database ).
Addition of the coordinates to the database with cities can be done with the following code:
coords <- read.table("e:/_job/clinicaltrials.gov/worldcities/worldcitiespop.txt", header = TRUE, sep = ",", na.strings = "", comment.char = "", quote = "\"", fill = FALSE) sitesa_c <- merge(sitesa, coords, by.x = "city", by.y = "City") sitesa_c_final <- subset(sitesa_c, sitesa_c$overall_status == "Recruiting")
This sitesa_c_final table is given to KNIME, where the following actions should be done:
The outcome looks like this, where the shown sites (indicated by their names) indicate the sites with ‘Recruiting’ status.
One of the major tasks in natural language processing (NLP) is the part-of-speech (POS) tagging of sentences, i.e. categorizing the words according to grammatical properties. Common parts of speech are noun, verb, article, adjective, preposition, pronoun, adverb, conjunction and interjection.
With the help of a recent R package RDRPOSTagger now one can perform POS tagging within R on more than 40 languages, including English, Hungarian, French, German, Hindi, Italian, Thai, Vietnamese and many more. Below we present a brief introduction to this topic via a simple R script.
First, we have to install/load the package. The tokenizers package is also needed for splitting the text into sentences.
The part-of-speech tags are listed along with their abbreviations.
unipostag_types <- c("ADJ" = "adjective", "ADP" = "adposition", "ADV" = "adverb", "AUX" = "auxiliary", "CONJ" = "coordinating conjunction", "DET" = "determiner", "INTJ" = "interjection", "NOUN" = "noun", "NUM" = "numeral", "PART" = "particle", "PRON" = "pronoun", "PROPN" = "proper noun", "PUNCT" = "punctuation", "SCONJ" = "subordinating conjunction", "SYM" = "symbol", "VERB" = "verb", "X" = "other")
Next, the text to analyse is added (source: Wikipedia).
text <- "Rubik's Cube is a 3-D combination puzzle invented in 1974 by Hungarian sculptor and professor of architecture Ernő Rubik. Originally called the Magic Cube, the puzzle was licensed by Rubik to be sold by Ideal Toy Corp. in 1980 via businessman Tibor Laczi and Seven Towns founder Tom Kremer, and won the German Game of the Year special award for Best Puzzle that year. As of January 2009, 350 million cubes had been sold worldwide making it the world's top-selling puzzle game. It is widely considered to be the world's best-selling toy."
We split it into sentences.
sentences <- tokenize_sentences(text, simplify = TRUE)
The language and type of tagging needs to be defined.
unipostagger <- rdr_model(language = "UD_English", annotation = "UniversalPOS")
Finally, the tagging is performed.
unipostags <- rdr_pos(unipostagger, sentences)
unipostags$word.type <- unipostag_types[unipostags$word.type]
The results for the first sentence can be seen below.
sentence.id word.id word word.type
1 1 1 Rubik's noun
2 1 2 Cube noun
3 1 3 is verb
4 1 4 a determiner
5 1 5 3-D numeral
6 1 6 combination noun
7 1 7 puzzle adjective
8 1 8 invented verb
9 1 9 in adposition
10 1 10 1974 numeral
11 1 11 by adposition
12 1 12 Hungarian adjective
13 1 13 sculptor noun
14 1 14 and coordinating conjunction
15 1 15 professor noun
16 1 16 of adposition
17 1 17 architecture noun
18 1 18 Ernő proper noun
19 1 19 Rubik. proper noun
For more details about the RDRPOSTagger package please check this link: Natural Language Processing on 40 languages with the Ripple Down Rules-based Part-Of-Speech Tagger.
What is Data Visualization?
Data Visualization is a way of representing complex data and stats in a pleasing, visually-appealing way. Visual data may include components like pie and graph charts, maps or tables, and can be presented in different forms, such as infographics, videos, illustrations and interactive reports.
Why is it important? The answer is simple. Our brains absorb visual information better, faster, more easily.
Benefits of Data Visualization
The benefits of visualizing data include:
- providing clearer information for clients
- making it easier to view and analyze patterns and trends
- enabling interaction with the data
- allowing for more information to be absorbed, and more quickly
- better identify peaks and troughs.
Sitepoint’s arcticle is going to assess how a new tool, Google Data Studio, can help us build beautiful and interactive reports.
Google Data Studio
Google Data Studio (GDS) is a new tool by Google that makes it easy to create beautiful, engaging, responsive, branded and interactive reports. It does this by pulling metrics from Google’s properties, such as Google Analytics, Adwords and YouTube Analytics, as well as spreadsheets and SQL databases.
For the article, the author will be using Data Studio to create a visual report using Google Analytics data. To do this, you first need to have an active Google Analytics property that is properly integrated with the website.
The same applies to other reports. If you wish to pull the data from your Adwords or YouTube Analytics, make sure to sign in with an appropriate Google account that has that data.
Read the Setup Guide on Sitepoint: Here
Today’s challenge is a geographical one. Do you know which cities are the most populated cities in the world? Do you know where they are? China? USA? By way of contrast, do you know which cities are the smallest cities in the world?
Today we want to show you where you can find the largest and the smallest cities in the world by population on a map. While there is general agreement from trustworthy sources on the web about which are the most populated cities, agreement becomes sparser when looking for the smallest cities in the world. There is general agreement though about which ones are the smallest capitals in the world.
We collected data for the 125 world’s largest cities in a CSV text file and data for the 10 smallest capitals of equally small and beautiful countries in another CSV text file. Data includes city name, country, size in squared kilometers, population number, and population density. The challenge of today is to localize such cities on a world map. Technically this means:
- To blend the city data from the CSV file with the city geo-coordinates from the Google Geocoding API into KNIME Analytics Platform
- Then to blend the ETL and machine learning from KNIME Analytics Platform with the geographical visualization of Open Street Maps.
It’s been a long wait, but it’s worth it. A new version of Metabase is ready for you (and end users)!
You can download the new version of Metabase at http://www.metabase.com/start/
To upgrade, see the instructions for your platform at http://www.metabase.com/docs/latest/operations-guide/start.html#upgrading-metabase
Let’s see what Metabase says about their new version:
Data access permissions
A way to control access to sensitive data has been one of the most requested features since we launched. With 0.20, we’ve taken the first major step in giving you the ability to lock down an instance. We now allow you to create user groups, and control their access to databases, tables and raw SQL queries. This lets you control access to sensitive data while still allowing your end users to answer their own questions within the datasets they’re allowed access to.
Getting started guides
In most places we’ve worked, there’s typically an email that gets forwarded around, or a Google doc that describes how to use the analytics systems available. Some more sophisticated setups use an internal wiki or other website that has an inventory of what’s available. We believe that the best way to keep these current is to have them be built into the application. Now you can create a cheatsheet to help new users know which dashboards, metrics and reports are the most important as well as provide caveats for use, advice on who to contact for help, and more.
Charting improvements – Part 2
Following up on our previous releases’ improvements to charting, we’ve added new chart types (progress, scatter and bubble charts), improved your control over axes, and allowed you to customize the display of dashboard cards made up of multiple questions.
Lots of you have been clamoring for a way to use Metabase with Oracle databases. Now you can! Due to Oracle’s license for the underlying JDBC driver, you’ll need to do a few extra steps — check out www.metabase.com/docs/latest/administration-guide/databases/oracle.md for details.
Druid performance and timezone fixes
We’ve made some improvements to how Metabase works with Druid that fix a number of timezone bugs and improve charting performance.
Metabase also fixed many reported issues and bugs from their GitHub page.
Definition of Customer Segments
Customer segmentation has undoubtedly been one of the most implemented applications in data analytics since the birth of customer intelligence and CRM data.
The concept is simple. Group your customers together based on some criteria, such as revenue creation, loyalty, demographics, buying behavior, or any combination of these criteria, and more.
The group (or segment) can be defined in many ways, depending on the data scientist’s degree of expertise and domain knowledge.
- Grouping by rules. Somebody in the company already knows how the system works and how the customers should be grouped together with respect to a given task, e.g. a campaign. A Rule Engine node would suffice to implement this set of experience-based rules. This approach is highly interpretable, but not very portable to new analysis. In the presence of a new goal, new knowledge, or new data the whole rule system needs to be redesigned.
- Grouping as binning. Sometimes the goal is clear and not negotiable. One of the many features describing our customers is selected as the representative one, be it revenues, loyalty, demographics, or anything else. In this case, the operation of segmenting the customers in groups is reduced to a pure binning operation. Here customer segments are built along one or more attributes by means of bins. This task can be implemented easily, using one of the many binner nodes available in KNIME Analytics Platform.
- Grouping with zero knowledge. We can assume that the data scientist frequently does not know enough of the business at hand to build his own customer segmentation rules. In this case, if no business analyst is around to help, he should resolve to a plain blind clustering procedure. The after-work for the cluster interpretation belongs to a business analyst, who is (or should be) the domain expert.
With the set goal of making this workflow suitable for a number of different use cases, we chose the third option.
There are many clustering procedures and KNIME Analytics Platform makes them available in the Node Repository panel, in the category Analytics/Mining/Clustering, e.g. k-Means, nearest neighbors, DBSCAN, hierarchical clustering, SOTA, etc … We went for the most commonly used: the k-Means algorithm.
Read more: KNIME.ORG