Blog – R
Blog on R
11. June 2018 – Day 1
R is a very important tool for statisticians. Now I’m talking on support of clinical trials (or life science research in general) – but this statement is true practically in all areas where statistical solutions work.
I decided that I’ll spend at least one hour with R in the following 100 days. Probably one day will be skipped. Maybe. But the original concept is to put a continuous work for a better understanding of the R phenomenon.
With this blog I primarily would like to learn. But I have other considerations as well. I’m curious to learn what does a massive 100 day long learning procedure mean in this area? What depth will I be able to reach in understanding of R (or at least some specific aspects)? I hope that there will be some fruits of this activity. And if so, I would like to share them with you, my Dear Follower.
Let’s start with a two sentence introduction.
I’m a statistician supporting pharma industry on the first place for 28 years. I met R first time around 2002. Since then I am continuously astonished what a great variety of solutions can be reached via R. The speed of the development of unique packages is terrific.
Now let’s focus on our original target: R.
I put the question to myself during the past years: is there anybody who could perfectly – optimally – benefit from the advantages of R? You probably know a solution for a problems, but it might happen that there are much better solutions – hidden for you. Is there anybody who can absolutely professionally adopt the methods implemented in R? How many statisticians or programmers are able to be updated on changes of packages and arrivals of new ones?
Let’s see an easy task: regression. Linear regression is supported already by the basic system . Subset selection – among other opportunities – are supported by “MASS”  on the first place. A very good summary on logit regression can reached at  where the application of package “aod” is presented. Graphical interpretation of logistic modeling can be supported by “ggplot2” or with “regvis” packages. According to Stephen Turner  the best package for multiple regression is “rms“. On the same page Peter Flom places “car” to the first choice in managing multiple regression problems.
The queue is long – hopefully not infinite – but I do not want to demoralize anybody with this example; on the contrary I would like underline the importance of all the activities aiming some order in this chaos.
Let’s start at the beginning.
R was dreamed and introduced by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. The first version was released only in 1995.
Today R is the 11th most important programming language  – where Java is the 1st. Probably the total number of available R packages can only be determined with difficulties – the number exceeded 11,000 (!) at 27 November 2017  – but at least there is an up-to-date list with the most popular 40 packages at .
When I’m writing this blog, the most popular package was “Rcpp“, a package for proper C++ integration (with more than 989K downloads) . It’s strange enough but programmers sometimes even put energies in implementation of “fun” things. R is a good example for this as well, namely you can play the well-known Miner game or demonstrate the Tower of Hanoi .
End of Day 1 https://www.r-bloggers.com/simple-linear-regression-2/
18. June 2018 – Day 2
Only the second day and I already found it: Awesame-R [D2-1]. An R-related page with several interesting topics categorised by functionality like “Web technologies and services” or “Bayesian” or “Natural Language Processing”.
This exactly is the page I wanted to create.
Probably they started it more than 2 days ago. It looks very informative and promising.
Let’s examine some menu items.
I already mentioned NLP.
Today (11.06.2018) the relevant menu item contains 16 packages (from “text2vec” to “utf8“). If you click on the first item (“text2vec“) you get a Github link. Integration of such github content worths a detailed description in itself. Maybe next time. But “text2vec” can be installed directly from the R environment. If it works, it’s more comfortable to choose this option.
I checked all the links listed here: all worked, though some of them is quite outdated (e.g. “SnowballC” was updated on 09.08.2014 last time). The real surprise was the “tidytext” link [D2-2], which referred to a book “Text mining with R” by Julia Silge and David Robinson. (You can reach the book here but only on business grounds).
By the way, I just realised that NLP is not necessarily a self-evident concept to anybody.
In short, NLP (Natural Language Processing) is the secret dream of mankind (with some IT/statistical knowledge): a set of appropriate
algorithms which are able to interpret texts on their own. You probably heard the term text-mining. Text mining is the common term for procedures which are connected with text type variables. You can count words – after you identified words, define similarity measures for words or compare different word sets (“How many Shakespaers did write Shekespare’s Works?”).
NLP is a bit more. As I referred to it, definition of a word is not so trivial (in programmatic issues). Are ‘dog’ and ‘dogs’ two different words or not? Syntactically yes, but semantically no. Or let’s consider another example: are the terms “AMI” and “heart attack” identical? The answer might be yes, especially in life sciences, as AMI might mean acute myocardial infarction. But please check the different meanings of AMI [D2-3] and you can easily find examples, where AMI does not stand for infarct.
On the contrary of difficulties the brief summary of MS Simpson and D Demner-Fushman on “Biomedical Text Mining: A survey of recent progress” [D2-4] explains in the required details how many important tasks were already identified and – at least partly – solved.
Going back to the R-packages. The package “tm” states that it is a ‘comprehensive text mining framework for R’. It’s documentation can be found here [D2-5]. First of all the documentation makes it clear that the proposed framework was similar in complexity to commercial solutions like GATE, RapidMiner, SAS Miner and others – in 2008, when “tm” was introduced.
The framework in itself supports the management of input data (raw, unstructured set of words), tagging of the input data, application of joining and filtering functionalities on the source data. “TM” provides technical support for memory management, while it applies the so called ‘term-document’ matrix approach in retrieval of really important information.
I’ll go into more details regarding the original concept and I would like to investigate how this concept has developed during the last 10 years next time.
End of Day 2[D2-1] https://awesome-r.com/