#Support for simple features, a standardised way to encode spatial vector data
library(sf)
#Data manipulation
library(dplyr)
#A system for creating graphics
library(ggplot2)
#Easy viisualisation of a correlation matrix using ggplot2
library(ggcorrplot)
#Color maps designed to improve graph readability
library(viridis)
#Alternative way of plotting, useful for radial plots
library(plotrix)
#Methods for cluster analysis
library(cluster)
#Thematic maps can be generated with great flexibility
library(tmap)
#Provides some easy-to-use functions to extract and visualize the output of multivariate data analyses
library(factoextra)
3 Geodemographics
In this chapter we introduce the topic of geodemographics and geodemographic classifications. The chapter is based on the following references:
Creating a Geodemographic Classification Using K-means Clustering in R (Guy Lansley and James Cheshire, 2018)
Lecture 10 of the 2020-21 Work Book for the module GEOG0030 on Geocomputation, delivered at UCL Department of Geography.
Geodemographics is the statistical study of human populations based on their locations. It includes the application of geodemographic classifications (GDCs) or profiling whereby different locations are classified into groups based on the similarity in their demographic characteristics.
Assuming that the geodemographic characteristics of a group are an indicator of how the people in that group behave, GDC can be a very useful tool to predict the behavioral patterns of different regions. For this reason, geodemographics and GDC have applications in many areas, from marketing and retail to public health or service planning industries.
3.1 Dependencies
This chapter uses the libraries below. Ensure they are installed on your machine, then execute the following code chunk to load them:
#Obtain the working directory, where we will save the data directory
getwd()
[1] "/Users/PIETROST/Library/CloudStorage/Dropbox/GitHub/r4ps"
3.2 Data
3.2.2 Import the data
In the code chunk below we load the dataset described above, lsoa-data-clean.csv as a data frame and call it df_LSOA. We will be generating some maps to show the geographical distribution of our data and results. To do this, we need the data that defines the geographical boundaries of the LSOAs. This data can be found in the form of a shapefile here. We have also stored this shapefile, called LSOA_2011_Lodnon_gen_MHW.shp, in the data folder of this workbook so you can import it directly as a data frame with simple features using the st_read()
function from the sf
package. For more information on the sf
package, check the documentation.
# Import LSOA demographic data for GLA
# The raw data can be obtained from link below, but it has been cleaned by Carmen Cabrera-Arnau for this chapter
# https://data.london.gov.uk/dataset/lsoa-atlas
<- read.csv("./data/geodemographics/lsoa-data-clean.csv")
df_LSOA
# Import LSOA boundaries for GLA
<- st_read("./data/geodemographics/LSOA_2011_London_gen_MHW/LSOA_2011_London_gen_MHW.shp") st_LSOA
Reading layer `LSOA_2011_London_gen_MHW' from data source
`/Users/PIETROST/Library/CloudStorage/Dropbox/GitHub/r4ps/data/geodemographics/LSOA_2011_London_gen_MHW/LSOA_2011_London_gen_MHW.shp'
using driver `ESRI Shapefile'
Simple feature collection with 4835 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 503574.2 ymin: 155850.8 xmax: 561956.7 ymax: 200933.6
Projected CRS: OSGB36 / British National Grid
3.3 Preparing the data for GDC
3.3.1 Choice of geographic units
Normally, GDCs involve the analysis of aggregated demographic data into geographic units. Very small geographic units of data aggregation can provide more detailed results, but if the counts are too low, this could lead to re-identification issues.
As mentioned above, the data for this chapter is aggregated into LSOAs. The size of the LSOAs is small enough to produce detailed results and is also a convenient choice, since it is broadly used in the UK Census and other official data-reporting exercises.
We can visualise the LSOAs within GLA simply by plotting the geometry column of sf_LSOA, which can be selected with the function st_geometry()
.
plot(st_geometry(st_LSOA), border=adjustcolor("gray20", alpha.f=0.4), lwd=0.6)
3.3.2 Variables of interest
Any classification task must be based on certain criteria that determines how elements are grouped into classes. For GDC, these criteria are demographic characteristics of the population located in the geographic units under study. In this case, we have prepared the file lsoa-data-clean.csv to contain some interesting demographic data corresponding to each LSOA. The data frame df_LSOA contains this data and we can visualise its first few lines by using the function head()
:
head(df_LSOA[,1:4])
Lower.Super.Output.Area LSOA11NM MidYrPop MidYrPop0to15
1 E01000907 Camden 001A 1431 20.68
2 E01000908 Camden 001B 1564 18.16
3 E01000909 Camden 001C 1602 14.86
4 E01000912 Camden 001D 1589 16.17
5 E01000913 Camden 001E 1695 17.23
6 E01000893 Camden 002A 1563 17.08
As we can see, each row contains information about an LSOA and each column (starting from the third column) represents a demographic characteristic of the LSOA and the people living there. With the function names()
, we can get the names of the columns in df_LSOA
names(df_LSOA[,1:8])
[1] "Lower.Super.Output.Area" "LSOA11NM"
[3] "MidYrPop" "MidYrPop0to15"
[5] "MidYrPop16to29" "MidYrPop30to44"
[7] "MidYrPop45to64" "MidYrPop65"
The data frame df_LSOA contains many variables. As we can see above, they have summarised names. For a short description of what these names mean, we can load the file called Dictionary-lsoa-data-clean.csv:
<- read.csv("./data/geodemographics/Dictionary-lsoa-data-clean.csv")
df_dictionary
head(df_dictionary)
Label Description
1 LSOA11NM Name of LSOA
2 MidYrPop Mid-year Population Estimates;All Ages;2011
3 MidYrPop0to15 Mid-year Population Estimates;Aged 0-15;2011
4 MidYrPop16to29 Mid-year Population Estimates;Aged 16-29;2011
5 MidYrPop30to44 Mid-year Population Estimates;Aged 30-44;2011
6 MidYrPop45to64 Mid-year Population Estimates;Aged 45-64;2011
For the purposes of this chapter, we will focus on just a few of these variables since this will make the results easier to interpret. In particular, we will look at variables related to ethnicity, country of birth, employment status and qualifications. Let us select the fields of interest:
<- df_LSOA[, c("LSOA11NM", "White", "MixedMulti", "Asian", "BlackAfricanCaribbean", "Ethnic_Other", "BAME", "BirthUK", "BirthNotUK", "FullTimeStudent", "EmployRate", "UnemployRate", "Q_None", "Q_L1", "Q_L2", "Q_Apprenticeship", "Q_L3", "Q_L4OrAbove", "Q_Other", "Q_Sch_FullTimeSt18" )] df_LSOA
We can explore the summary statistics for each of the selected fields with the summary()
function applied on the field of interest. For example, to obtain the summary statistics for the percentage of people belonging to the ethnic group ‘Black/African/Caribbean/Black British’, we can run the code below:
summary(df_LSOA$BlackAfricanCaribbean)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.10 4.30 9.50 13.05 18.90 63.70
This tells us that the mean or average percentage of people from this ethnic group in LSOAs within GLA is 13.05%. It also tells us that 63.70% of the population are Black/African/Caribbean/Black British in the LSOA with the maximum proportion of people belonging to this ethnic group.
To visualise the whole distribution of the variable ‘Percentage of Black/African/Caribbean/Black British’, we can plot a histogram:
hist(df_LSOA$BlackAfricanCaribbean, breaks=50, xlab="% Black/African/Caribbean/Black British", ylab='Number of LSOAs', main=NULL)
The histogram reveals that many LSOAs have a low proportion of Black/African/Caribbean/Black British people, but there are a few with more than 50% of their population belonging to this ethnic group.
Now the question is whether the LSOAs with similar proportions of Black/African/Caribbean/Black British are also spatially close. To find out, we need to map the data. We can do this by joining the data frame df_LSOA with the data frame st_LSOA which contains the geographic boundaries of the LSOAs:
<- st_LSOA %>% left_join(df_LSOA, by='LSOA11NM') join_LSOA
If we plot the joined data frames using the tmap
library functionalities, we can observe that, indeed there are specific regions within GLA with a high proportion of Black/African/Caribbean/Black British people.
= expression("% Black, African, Caribbean or Black British")
legend_title = tm_shape(join_LSOA) +
map_ethnic tm_fill(col = "BlackAfricanCaribbean", title = legend_title, text.size = 10, palette = viridis(256), style = "cont") + # add fill
tm_layout(legend.position = c(0.71, 0.02), legend.title.size=0.7, inner.margins=c(0.05, 0.05, 0.05, 0.14)) +
tm_borders(col = "white", lwd = .01) + # add borders
tm_compass(type = "arrow", position = c("right", "top") , size = 1) + # add compass
tm_scale_bar(breaks = c(0,1,2,3), text.size = 0.8, position = c("left", "bottom")) # add scale bar
map_ethnic
3.4 Standardisation
3.4.1 Across geographic units
Although LSOAs have been designed to have similar population sizes, the population figures fluctuate. And of course, if the population of a place is bigger or smaller, this can affect the figures corresponding to demographic characteristics (e.g. presumably, the larger the total population, the higher the number of people who are unemployed).
To counter the effect of variable population sizes across geographic units, we always need to standardise the data so it is given as a proportion or a percentage. This has already done in the dataset lsoa-data-clean.csv, however, if you were to create your own dataset, you need to take this into account. To compute the right percentages, it is important to consider the right denominator. For example, if we are computing the percentage of people over the age of 65 in a given geographic unit, we can divide the number of people over 65 by the total population in that geographic unit, then multiply by 100. However, if we are computing the percentage of single-person households, we need to divide the number of single-person households by the total number of households (and not by the total population), then multiply by 100.
3.4.2 Variable standardisation
Data outliers are often present when analysing data from the real-world. These values are generally extreme and difficult to treat statistically. In GDC, they could end up dominating the classification process. To avoid this, we need to standardise the input variables as well, so that they all contribute equally to the classification process.
There are different methods for variable standardisation, but here we will achieve this by computing the Z-scores for each variable, i.e. for variable \(X\), Z-score = \(\dfrac{X-mean(X)}{std(X)}\) where \(std()\) refers to standard deviation. In R, obtaining the Z-score of a variable is very simple with the function scale()
. Since we want to obtain the Z-scores of all the variables under consideration, we can loop over the columns corresponding to the variables that we want to standardise:
# creates a new data frame
<- df_LSOA
df_std # extracts column names from df_std
<- colnames(df_std)
colnames_df_std # loops columns from position 1 : the last column
for(i in 2: ncol (df_std)){
<- scale(as.numeric(df_std[, colnames_df_std[i]]))
df_std[, colnames_df_std[i]] }
3.5 Checking for variable association
Before diving into the clustering process, it is necessary to check for variable associations. Two variables that are strongly associated could be conveying essentially the same information. Consequently, excessive weight could be attributed to the phenomenon they refer to in the clustering process. There are different techniques to check for variable association, but here we focus on the Pearson’s correlation matrix.
Each row and column in a Pearson’s correlation matrix represents a variable. Each entry in the matrix represents the level of correlation between the variables represented by the corresponding row and column. In R, a Pearson’s correlation matrix can be created very easily with the corr()
function, where the method parameter is set to “pearson”. As a general rule, two variables with correlation coefficient greater than 0.8 or smaller than -0.8 are considered to be highly correlated. If this is the case, we might want to discard one of the two variables since the information they convey is redundant. However, in some cases, it might be reasonable to keep both variables if we can argue that they both have a similar but unique meaning.
The correlation coefficients by themselves are not enough to conclude whether two variables are correlated. Each correlation coefficient must be computed in combination with its p-value. For this reason, we also apply the cor_pmat()
function below, which outputs a matrix of p-values corresponding to each correlation coefficient. Here, we set the confidence level at 0.95, therefore, p-values smaller than 0.05 are considered to be statistically significant. In the correlation matrix plot, we add crosses to indicate which correlation coefficients are not significant (i.e. those above 0.05). Those crosses indicate that there is not enough statistical evidence to reject the claim that the variables in the corresponding row and column are uncorrelated.
# Matrix of Pearson correlation coefficients
<- cor(df_std[,c(colnames_df_std[2: ncol(df_std)])], method = "pearson")
corr_mat # Matrix of p-values
<- cor_pmat(df_std[,c(colnames_df_std[2: ncol(df_std)])], method = "pearson", conf.level = 0.95)
corr_pmat # Barring the no significant coefficient
ggcorrplot(corr_mat, tl.cex=7, hc.order = TRUE, outline.color = "white", p.mat = corr_pmat, colors = c(viridis(3)), lab=TRUE, lab_size=1.6)
Among the statistically significant values in the correlation matrix, we can see that BAME and White have a correlation of -1. So do BirthUK and BirthNotUK. We will therefore remove BAME and BrithUK from our dataset. We also see that Q_L4OrAbove has a strong negative correlation with Q_None and Q_L1. We will therefore remove two of these variables, for example Q_None and Q_L1.
# Remove BAME, BirthUK, Q_None, Q_L1
<- subset(df_std, select = -c(BAME, BirthUK, Q_None, Q_L1)) data
We can now perform a join of the resulting dataset with the variable st_LSOA, which stores the geographical units for the LSOAs. We perform this step so that we can later map the results
<- st_LSOA %>% left_join(data, by='LSOA11NM') join_data
And once again, we can check the Pearson correlation matrix of the resulting dataset. Obviously, now that we have removed some variables that were strongly correlated to others, the values of the correlation coefficients are not as high as before.
#Obtain column names from data
<- colnames(data)
colnames_data # Matrix of Pearson correlation coefficients
<- cor(data[,c(colnames_data[2: ncol(data)])], method = "pearson")
corr_mat_data # Matrix of p-values
<- cor_pmat(data[,c(colnames_data[2: ncol(data)])], method = "pearson")
corr_pmat_data # Barring the no significant coefficient
ggcorrplot(corr_mat_data, tl.cex=7, hc.order = TRUE, outline.color = "white", p.mat = corr_pmat_data, colors = c(viridis(3)), lab=TRUE, lab_size=1.6)
3.6 The clustering process
3.6.1 K-means
K-means clustering is a way of grouping similar items together. To illustrate the method, imagine you have a bag filled with vegetables, and you want to separate them into smaller bags based on their color, size and flavour. K-means would do this for you by first randomly selecting a number k of vegetables (you provide k, e.g. k=4), and then grouping all the other vegetables based on which of the k vegetables selected initially they are closest to in color, size and flavour. This process is repeated a few times until the vegetables are grouped as best as possible. The end result is k smaller bags, each containing veg of similar color, size and flavour. This is similar to how k-means groups similar items in a data set into clusters.
More technically, k-means clustering is actually an algorithm of unsupervised learning (we will learn more about this in Chapter 10) that partitions a set of points into k clusters, where k is a user-specified number. The algorithm iteratively assigns each point to the closest cluster, based on the mean of the points in the cluster, until no point can be moved to a different cluster to decrease the sum of squared distances between points and their assigned cluster mean. The result is a partitioning of the points into k clusters, where the points within a cluster are as similar as possible to each other, and as dissimilar as possible from points in other clusters.
In R, k-means can be easily applied by using the function k-means()
, where some of the required arguments are: the dataset, the number of clusters (which is called centers), the number of random sets to choose (nstart) or the maximum number of iterations allowed. For example, for a 4-cluster classification, we would run the following line of code:
<- kmeans(data[,c(colnames_data[2: ncol(data)])], centers=4, nstart=20, iter.max = 1000) Km
3.6.2 Number of clusters
As mentioned above, the number of clusters k is a parameter of the algorithm that has to be specified by the user. Ultimately, there is no right or wrong answer to the question ‘what is the optimum number of clusters?’. Deciding the value of k in the k-means algorithm can be a somewhat subjective process where in most cases, common sense is the most useful approach. For example, you can ask yourself if the obtained groups are meaningful and easy to interpret or if, on the other hand, there are too few or too many clusters, making the results unclear.
However, there are some techniques and guidelines to help us decide what the right number of clusters is. Here we explore the silhouette score method.
The silhouette score of a data point (in this case an LSOA and its demographic data) is a measure of how similar this data point is to the data points in its own cluster compared to the data points in other clusters. The silhouette score ranges from -1 to 1, with a higher value indicating that the data point is well matched to its own cluster and poorly matched to neighbouring clusters. A score close to 1 means that the data point is distinctly separate from other clusters, whereas a score close to -1 means the data point may have been assigned to the wrong cluster. Given a number of clusters k obtained with k-means, we can compute the average silhouette score over all the data points. Then, we can plot the average silhouette score against k. The optimal value of k will be the one with the highest k score.
You can run the code below to compute the average silhouette score corresponding to different values of k ranging from 2 to 8. The optimum number of clusters is given by the value of k at which the average silhouette is maximised.
<- function(k){
silhouette_score <- kmeans(data[,c(colnames_data[2: ncol(data)])], centers = k, nstart=5, iter.max = 1000)
km <- silhouette(km$cluster, dist(data[,c(colnames_data[2: ncol(data)])]))
ss mean(ss[, 3])
}<- 2:8
k <- sapply(k, silhouette_score)
avg_sil plot(k, type='b', avg_sil, xlab='Number of clusters', ylab='Average Silhouette Scores', frame=FALSE)
From the figure, we can see that the optimum k is 5, so we will take 5 as the number of clusters. Note, this number might be different when you run the programme since the clustering algorithm involves some random steps.
<- kmeans(data[,c(colnames_data[2: ncol(data)])], centers=5, nstart=20, iter.max = 1000) Km
3.6.3 Other clustering methods
There are several other clustering methods apart from k-means. Each method has its own advantages and disadvantages, and the choice of method will ultimately depend on the specific data and clustering problem. We will not explore these methods in detail, but below we include some of their names and a brief description. If you want to learn about them, you can refer to the book “Pattern Recognition and Machine Learning” by Christopher Bishop (Bishop 2006).
Fuzzy C-means: a variation of k-means where a data point can belong to multiple clusters with different membership levels.
Hierarchical clustering: this method forms a tree-based representation of the data, where each leaf node represents a single data point and the branches represent clusters. A popular version of this method is agglomerative hierarchical clustering, where individual data points start as their own clusters, and are merged together in a bottom-up fashion based on similarity.
DBSCAN: a density-based clustering method that groups together nearby points and marks as outliers those points that are far away from any cluster.
Gaussian Mixture Model (GMM): GMMs are probabilistic models that assume each cluster is generated from a Gaussian distribution. They can handle clusters of different shapes, sizes, and orientations.
3.7 GDC results
3.7.1 Mapping the clusters
Our LSOAs are now grouped into 5 clusters according to the similarity in their demographic characteristics. We can create a final data set based on join_data which includes the cluster where each geographical unit belongs to:
<- join_data
join_data_cluster $cluster <- Km$cluster join_data_cluster
Finally, we can plot the results of the clustering process on a map using functions from the tmap library:
= tm_shape(join_data_cluster) +
map_cluster tm_fill(col = "cluster", title = "cluster", palette = viridis(256), style = "cont") + # add fill
tm_borders(col = "white", lwd = .01) + # add borders
tm_layout(legend.position = c(0.88, 0.02)) +
tm_compass(type = "arrow", position = c("right", "top") , size = 1) + # add compass
tm_scale_bar(breaks = c(0,1,2,3), text.size = 0.5, position = c("left", "bottom")) # add scale bar
map_cluster
Note: sometimes, the number of items in a cluster may be very small. In that case, you may want to merge to cluster to make the number of items in each group more homogeneous or perhaps change k in the k-means algorithm.
3.7.2 Cluster interpretation
The map above not only displays the clusters where each LSOA belongs, but it also shows that there is a tendency for LSOAs belonging to the same cluster to be geographically close. This indicates that people with similar demographic characteristics live close to each other. However, we still need to understand what each cluster represents.
The so-called cluster centers (kmCenters) are the data points that, within each cluster, provide a clear indication of the average characteristics the cluster where they belong based, of course, on the variables used in the classification. The data used in the clustering process was Z-score standardized, so the values of each variable corresponding to the cluster centers are still presented as Z-scores. Zero indicates the mean for each variable across all the data points in the sample, and values above or below zero indicate the number of standard deviations from the average. This makes it easy to understand the unique characteristics of each cluster relative to the entire sample. To visualise the characteristics and meaning of the clusters centers and their corresponding clusters, we use radial plots. Below we produce a radial plot for cluster 1. Can you see which variables are higher or lower than average in this cluster? If you want to visualise the radial plot for other clusters, you will need to change the number inside the brackets of KmCenters[1,].
# creates a radial plot for cluster 1
# the boxed.radial (False) prevents white boxes forming under labels
# radlab rotates the labels
<- as.matrix(Km$centers)
KmCenters <- as.data.frame(KmCenters)
KmCenters radial.plot(KmCenters[1,], labels = colnames(KmCenters),
boxed.radial = FALSE, show.grid = TRUE,
line.col = "blue", radlab = TRUE, rp.type="p", label.prop=0.9, mar=c(3,3,3,3))
3.7.3 Testing
We will evalueate the fit of the k-means model with 5 clusters by creating an x-y plot of the of the first two principal components of each data point. Each point is coloured according to the cluster where it belongs. Remember that the aim of principal component analysis is to create the minimum number of new variables based on a combination of the original variables that can explain the variability in the data. The first principal component is the new variable that captures the most variability.
In the plot below, we can see the first and second principal components in the x and y axes respectively, with the axis label indicating the amount of variability that these components are able to explain. To create the plot, we use the fviz_cluster()
function from the factoextra library.
fviz_cluster(Km, data = data[,c(colnames_data[2: ncol(data)])], geom = "point", ellipse = F, pointsize = 0.5,
ggtheme = theme_classic())
There are obvious clusters in the plot, but some points are in the overlapping regions of two or more clusters, making it unclear to what cluster they should really belong. This does not mean that our classification is wrong, instead, it is a result of the fact that the plot is only representing two of the principal components, and there are other variables that are not captured in this 2-dimensional representation.
3.8 Questions
For this set of questions, we will be using the same datasets that we used for Chapter 3, the London LSOA dataset and the shapefile for the LSOA boundaries:
<- read.csv("./data/geodemographics/lsoa-data-clean.csv")
df_LSOA
# Import LSOA boundaries for GLA
<- st_read("./data/geodemographics/LSOA_2011_London_gen_MHW/LSOA_2011_London_gen_MHW.shp") st_LSOA
Reading layer `LSOA_2011_London_gen_MHW' from data source
`/Users/PIETROST/Library/CloudStorage/Dropbox/GitHub/r4ps/data/geodemographics/LSOA_2011_London_gen_MHW/LSOA_2011_London_gen_MHW.shp'
using driver `ESRI Shapefile'
Simple feature collection with 4835 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 503574.2 ymin: 155850.8 xmax: 561956.7 ymax: 200933.6
Projected CRS: OSGB36 / British National Grid
This time, we will focus on demographic variables related to ethnicity, country of birth, housing ownership status, income and age group. Let us select these demographic variables for the questions below.
<- df_LSOA[, c("LSOA11NM", "White", "MixedMulti", "Asian", "BlackAfricanCaribbean", "Ethnic_Other", "BAME", "BirthUK", "BirthNotUK", "OwnedOutright", "OwnedMortgage", "SocialRent", "PrivateRent", "Hh_MeanIncome", "MidYrPop0to15", "MidYrPop16to29", "MidYrPop30to44", "MidYrPop45to64", "MidYrPop65")] df_LSOA
Prepare your data for a geodemographic classification (GDC). To do this, start by standardising the selected variables. Then, check for variable association using a correlation matrix. Discard any variables if necessary. Join the resulting dataset with the LSOA boundary data. Now you should be ready to group the data into clusters using the k-means algorithm. Based on the average silhouette score method, select the number of clusters for a GDC with k-means. Every time you apply the kmeans()
function, you should set nstart=20 and iter.max=1000.
Essay questions:
- Describe how you prepared your data for the GDC. There is no need to include figures, but you should briefly explain how you reached certain decisions. For example, did you discard any variables due to their strong association with other variables in the dataset? How did you pick the number of clusters for your GDC?
- Map the resulting clusters and generate a radial plot for one of the clusters. You should create just one figure with as many subplots as needed.
- Describe what you observe and comment on your results. Do you observe any interesting patterns? Do the results of this GDC agree with what you would expect? Justify your answer.