Today’s challenge is a geographical one. Do you know which cities are the most populated cities in the world? Do you know where they are? China? USA? By way of contrast, do you know which cities are the smallest cities in the world?
Today we want to show you where you can find the largest and the smallest cities in the world by population on a map. While there is general agreement from trustworthy sources on the web about which are the most populated cities, agreement becomes sparser when looking for the smallest cities in the world. There is general agreement though about which ones are the smallest capitals in the world.
We collected data for the 125 world’s largest cities in a CSV text file and data for the 10 smallest capitals of equally small and beautiful countries in another CSV text file. Data includes city name, country, size in squared kilometers, population number, and population density. The challenge of today is to localize such cities on a world map. Technically this means:
- To blend the city data from the CSV file with the city geo-coordinates from the Google Geocoding API into KNIME Analytics Platform
- Then to blend the ETL and machine learning from KNIME Analytics Platform with the geographical visualization of Open Street Maps.
The Semantic Web
According to the W3C Linked Data page, the Semantic Web refers to a technology stack to support the “Web of data”. Semantic Web technologies enable people to create data stores on the Web, build vocabularies, and write rules for handling data. Linked data are empowered by technologies such as RDF, SPARQL, OWL, and SKOS.
- RDF. Resource Description Framework is a standard data model for representing the metadata of resources in the Web; it represents all resources – even those that cannot be directly retrieved. RDF especially helps to process, mix, expose, and share such metadata. In terms of the relational model, an RDF statement specifies a relationship between two resources and it is similar to a triple relation with subject, predicate, and object.
- OWL. Ontology Web Language is based on the basic elements of RDF, but uses a wider vocabulary to describe properties and classes.
- SKOS. Simple Knowledge Organization System is also based on RDF and specifically designed to express hierarchical information. If needed, it is also extendable into OWL.
- SPARQL. Simple Protocol and RDF Query Language is an RDF-based query language used to retrieve and manipulate public and private metadata stored in RDF format.
A commonly used instance of the semantic web is the DBPedia project, which was created to extract structured content from Wikipedia.
Our latest release KNIME Analytics Platform 3.2 includes a great feature: semantic web integration! A full node category is dedicated to querying and manipulating semantic web resources. The new semantic web nodes treat the web of data exactly like a database, with connector nodes, query nodes, and manipulation nodes. Additional nodes are provided to read and write files in various formats.
Definition of Customer Segments
Customer segmentation has undoubtedly been one of the most implemented applications in data analytics since the birth of customer intelligence and CRM data.
The concept is simple. Group your customers together based on some criteria, such as revenue creation, loyalty, demographics, buying behavior, or any combination of these criteria, and more.
The group (or segment) can be defined in many ways, depending on the data scientist’s degree of expertise and domain knowledge.
- Grouping by rules. Somebody in the company already knows how the system works and how the customers should be grouped together with respect to a given task, e.g. a campaign. A Rule Engine node would suffice to implement this set of experience-based rules. This approach is highly interpretable, but not very portable to new analysis. In the presence of a new goal, new knowledge, or new data the whole rule system needs to be redesigned.
- Grouping as binning. Sometimes the goal is clear and not negotiable. One of the many features describing our customers is selected as the representative one, be it revenues, loyalty, demographics, or anything else. In this case, the operation of segmenting the customers in groups is reduced to a pure binning operation. Here customer segments are built along one or more attributes by means of bins. This task can be implemented easily, using one of the many binner nodes available in KNIME Analytics Platform.
- Grouping with zero knowledge. We can assume that the data scientist frequently does not know enough of the business at hand to build his own customer segmentation rules. In this case, if no business analyst is around to help, he should resolve to a plain blind clustering procedure. The after-work for the cluster interpretation belongs to a business analyst, who is (or should be) the domain expert.
With the set goal of making this workflow suitable for a number of different use cases, we chose the third option.
There are many clustering procedures and KNIME Analytics Platform makes them available in the Node Repository panel, in the category Analytics/Mining/Clustering, e.g. k-Means, nearest neighbors, DBSCAN, hierarchical clustering, SOTA, etc … We went for the most commonly used: the k-Means algorithm.
Read more: KNIME.ORG