One of the major tasks in natural language processing (NLP) is the part-of-speech (POS) tagging of sentences, i.e. categorizing the words according to grammatical properties. Common parts of speech are noun, verb, article, adjective, preposition, pronoun, adverb, conjunction and interjection.
With the help of a recent R package RDRPOSTagger now one can perform POS tagging within R on more than 40 languages, including English, Hungarian, French, German, Hindi, Italian, Thai, Vietnamese and many more. Below we present a brief introduction to this topic via a simple R script.
First, we have to install/load the package. The tokenizers package is also needed for splitting the text into sentences.
The part-of-speech tags are listed along with their abbreviations.
unipostag_types <- c("ADJ" = "adjective", "ADP" = "adposition", "ADV" = "adverb", "AUX" = "auxiliary", "CONJ" = "coordinating conjunction", "DET" = "determiner", "INTJ" = "interjection", "NOUN" = "noun", "NUM" = "numeral", "PART" = "particle", "PRON" = "pronoun", "PROPN" = "proper noun", "PUNCT" = "punctuation", "SCONJ" = "subordinating conjunction", "SYM" = "symbol", "VERB" = "verb", "X" = "other")
Next, the text to analyse is added (source: Wikipedia).
text <- "Rubik's Cube is a 3-D combination puzzle invented in 1974 by Hungarian sculptor and professor of architecture Ernő Rubik. Originally called the Magic Cube, the puzzle was licensed by Rubik to be sold by Ideal Toy Corp. in 1980 via businessman Tibor Laczi and Seven Towns founder Tom Kremer, and won the German Game of the Year special award for Best Puzzle that year. As of January 2009, 350 million cubes had been sold worldwide making it the world's top-selling puzzle game. It is widely considered to be the world's best-selling toy."
We split it into sentences.
sentences <- tokenize_sentences(text, simplify = TRUE)
The language and type of tagging needs to be defined.
unipostagger <- rdr_model(language = "UD_English", annotation = "UniversalPOS")
Finally, the tagging is performed.
unipostags <- rdr_pos(unipostagger, sentences)
unipostags$word.type <- unipostag_types[unipostags$word.type]
The results for the first sentence can be seen below.
sentence.id word.id word word.type
1 1 1 Rubik's noun
2 1 2 Cube noun
3 1 3 is verb
4 1 4 a determiner
5 1 5 3-D numeral
6 1 6 combination noun
7 1 7 puzzle adjective
8 1 8 invented verb
9 1 9 in adposition
10 1 10 1974 numeral
11 1 11 by adposition
12 1 12 Hungarian adjective
13 1 13 sculptor noun
14 1 14 and coordinating conjunction
15 1 15 professor noun
16 1 16 of adposition
17 1 17 architecture noun
18 1 18 Ernő proper noun
19 1 19 Rubik. proper noun
For more details about the RDRPOSTagger package please check this link: Natural Language Processing on 40 languages with the Ripple Down Rules-based Part-Of-Speech Tagger.
OpenTrialsFDA works on making clinical trial data from the FDA (the US Food and Drug Administration) more easily accessible and searchable. Until now, this information has been hidden in the user-unfriendly Drug Approval Packages that the FDA publishes via its dataportal Drugs@FDA. These are often just images of pages, so you cannot even search for a text phrase in them. OpenTrialsFDA scrapes all the relevant data and documents from the FDA documents, runs Optical Character Recognition across all documents and links this information to other clinical trial data.
Explore the public beta version through a new user-friendly web interface at https://fda.opentrials.net.
OpenTrials aims to provide a comprehensive picture of the data and documents on all clinical trials conducted on medicines and other treatments. The platform will present data aggregated from a wide variety of existing sources, starting with clinical trial registers and moving on to academic journals, systematic reviews and other data sources.
The intention is to create an open, freely re-usable index of all such information, to increase discoverability, facilitate research, identify inconsistent data, enable audits on the availability and completeness of this information, support advocacy for better data and drive standards around open data in evidence-based medicine.
Explore the public beta version of OpenTrials here.
SatRdays are community-led, regional conferences to support collaboration, networking and innovation within the R community. The initiative of Steph Locke and Gergely Daroczi was accepted and funded by the R consortium. The very first event of this series took place in Budapest, Hungary on September 3, 2016 with almost 200 attendees of 19 countries and 12 hours of pure R fun. The day began with various workshops, followed by two keynotes and several regular talks, and ended with a data visualization challenge. The complete schedule can be found on the conference website, http://budapest.satrdays.org . The talks were live-streamed and can be watched online: http://www.ustream.tv/channel/xFdxHeVnGKS . If you have only limited time, we recommend the following talks: 1st keynote by Gábor Csárdi (R package history), Romain François‘ question section (including a marriage proposal), 2nd keynote by Jeroen Ooms (HTTP requests, ImageMagick) and data sonification by Thomas Levine. In overall, the first satRdays event received very positive feedback from the R community, and started to establish the reputation of the series. Personal thoughts about the conference from the main organizer were published at https://www.r-consortium.org/news/blogs/2016/09/start-satrdays .