Basic Statistical Graphics
for Archaeology with R:
Life Beyond Excel
Mike Baxter Hilary Cool
Barbican
Research
Associates
Nottingham
Basic Statistical Graphics for Archaeology with R:
Life Beyond Excel
Mike Baxter* and Hilary Cool**
*
**
Nottingham Trent University, School of Science and Technology (Emeritus)
Barbican Research Associates, 16 Lady Bay Road, Nottingham NG2 5BJ, UK
October 2016
Contents
Preface
viii
1 Introduction
1.1 Why statistics?
1.2 Why R?
1.3 The need for graphics, or not
1.4 How this book was written
1.5 Structure of the text
1.6 Additional resources – data sets and R code
1
1
2
2
4
5
6
2 R – an introduction
2.1 Introduction
2.2 R basics
2.2.1 R – getting started
2.2.2 Data import
2.2.3 Data structures
2.3 Functions and packages
2.3.1 Functions
2.3.2 Packages – general
2.3.3 Packages – graphics
2.3.4 Plot construction – ggplot2
2.3.5 Colour
2.4 Don’t panic
8
8
10
10
10
11
13
13
14
14
16
19
20
3 Descriptive statistics
3.1 Introduction
3.2 Types of data
3.3 Summary statistics
3.3.1 Definitions
3.3.2 Usage (and abusage)
3.4 Obtaining summary statistics in R
22
22
22
23
23
26
28
i
4 Histograms
4.1 Introduction
4.2 Histograms in base R
4.3 Histograms using ggplot2
4.4 Histograms from count data
4.5 Histograms with unequal bin-widths
4.6 Using histograms for comparison
4.7 Frequency polygons
31
31
33
35
36
37
39
43
5 User-defined functions
5.1 User-defined functions
5.1.1 Introduction
5.1.2 Writing functions – a very basic introduction
5.1.3 More realistic illustrations
5.2 Summary
45
45
45
46
47
50
6 Kernel density estimates
6.1 Introduction
6.2 Example – loomweights from Pompeii
6.3 Comparing KDEs
6.4 Two-dimensional (2D) KDEs
51
51
53
58
59
7 Boxplots and related graphics
7.1 Introduction
7.2 Construction and interpretation
7.3 Using plots for comparing groups
7.4 Discussion
64
64
64
65
70
8 Scatterplots
8.1 Introduction
8.2 Presentational aspects
8.2.1 Plotting symbols, axis labels, axis limits
8.2.2 Adding lines to plots
8.2.3 Points
8.2.4 Scatterplot matrices
73
73
74
74
76
79
82
9 Pie charts
9.1 Introduction
9.2 Examples
9.2.1 Pointless pie charts
9.2.2 Dishonest pie charts
9.2.3 Could do better
9.2.4 From bad to worse?
85
85
86
86
89
90
94
ii
10 Barplots
10.1 Introduction
10.2 One-way data
10.3 Two-way data
10.3.1 Preliminary considerations
10.3.2 Barplots for two-way data in base R
10.3.3 Barplots in ggplot2
10.4 Examples, including things to avoid
10.4.1 Glass ‘counters’ and their function
10.4.2 Chartjunk I
10.4.3 Chartjunk II
98
98
99
102
102
103
106
109
109
113
115
11 Ternary diagrams
11.1 Introduction
11.2 Examples
11.2.1 Palaeolithic assemblages
11.2.2 Romano-British civilian faunal assemblages
11.2.3 Staffordshire Hoard gold compositions
119
119
120
120
122
127
12 Correspondence analysis
12.1 Introduction
12.2 Correspondence analysis – basics
12.3 Some data
12.4 R basics
12.4.1 ‘Default’ analyses
12.4.2 Customising plots
12.4.3 Interpretation
12.5 Glass ‘counters’ from Pompeii
12.6 Anglo-Saxon male graves and seriation
12.7 Flavian drinking-vessels
130
130
131
132
135
135
136
140
141
144
148
References
153
Index
165
iii
List of Figures
3.1
3.2
Some possible distributional ‘shapes’ for continuous data.
A histogram and KDE for late Romano-British hairpin lengths.
26
29
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
Romano-British hairpins.
A Roman pillar moulded bowl.
Histograms for the Romano-British hairpin lengths I.
Histograms for the Romano-British hairpin lengths II.
Histograms for the Romano-British hairpin lengths using ggplot2.
Different ways of representing the data from Table 4.3.
Using stacked histograms for comparison.
Using superimposed histograms for comparison.
Frequency polygons for Romano-British hairpin lengths.
32
32
33
34
35
38
39
40
43
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.11
Construction of a kernel density estimate.
51
The effect of bandwidth on KDE construction.
52
KDEs of the weights of loomweights I.
53
A view of Insula VI.1, Pompeii.
55
Loomweights from Insula VI.1, Pompeii.
55
KDEs of the weights of loomweights II.
56
KDEs and histograms of the weights of loomweights.
57
KDEs comparing lengths of early and late Romano-British hairpins. 58
Two-dimensional KDEs using ggplot2.
59
Two-dimensional KDEs using base R.
61
Contour plots based on 2D KDEs using base R.
62
7.1
7.2
7.3
7.4
7.5
7.6
Basic boxplots, violin plots and beanplots.
Boxplots and beanplots used to compare weights of loomweights.
Kernel density estimates of weights of phased loomweights.
Boxplots and violin plots used to compare weights of loomweights.
Boxplots and violin plots for Romano-British hairpin lengths.
Boxplots for skewed data.
65
66
67
68
69
72
8.1
8.2
8.3
Default scatterplots for loomweight heights and weights.
Enhanced scatterplots for loomweight heights and weights.
Basic code for a base R scatterplot.
74
74
75
iv
8.4
8.5
8.6
8.7
8.8
8.9
8.10
8.11
8.12
8.13
8.14
Available symbols for plotting points in R.
Basic code for a ggplot2 scatterplot.
Adding lines to plots I.
Available lines for plotting in R.
Adding lines to plots II.
Adding points to plots.
Plotting multiple groups I.
Plotting multiple groups II.
A scatterplot matrix for loomweight dimensions using base R.
An enhanced scatterplot matrix for loomweight dimensions.
A ggpairs scatterplot for loomweight dimensions.
75
76
76
77
78
79
80
81
82
83
84
9.1
9.2
Pie charts of the distribution of contexts by cauldron group.
Pie charts showing the distribution of amphora types in the Lower
Danube region.
Pie charts for faunal remains from Wroxeter.
Soay sheep and Dexter cattle.
A clustered barplot for faunal remains from Wroxeter.
A ternary diagram for faunal remains from Wroxeter.
Pie charts for counter colours by phase from Pompeii, Insula VI.1.
88
9.3
9.4
9.5
9.6
9.7
10.1 Default and enhanced barplots for amphora presence on RomanoBritish sites.
10.2 Default barplots for species presence on Romano-British sites.
10.3 Enhanced barplots for species presence on Romano-British sites
using base R.
10.4 ‘Default’ barplots for species presence on Romano-British sites using ggplot2.
10.5 A modified clustered barplot for species presence on Romano-British
sites using ggplot2.
10.6 Glass counters from Gloucester and Pompeii.
10.7 A clustered barplot for glass ‘counter’ dimensions from different
contexts.
10.8 An alternative (and poorly) ‘enhanced’ barplot for amphora presence on Romano-British sites.
10.9 A version of Figure 10.3a showing the visual impact of grid lines.
10.10 Monthly distribution of sacrifices to Saturn.
10.11 Examples of unnecessary three-dimensional barplots.
11.1 Ternary diagrams for the assemblage data of Table 11.1.
11.2 ‘Default’ and modified ternary diagrams for Romano-British civilian faunal assemblages.
11.3 A ‘modified’ ternary diagram for Romano-British civilian faunal
assemblages II.
v
89
90
91
92
93
96
100
104
104
107
108
110
111
113
114
115
116
121
123
125
11.4 A ternary diagram for Romano-British civilian faunal assemblages
overlaid with a two-dimensional KDE.
11.5 Ternary diagrams for Romano-British civilian faunal assemblages
for different site types overlaid with two-dimensional KDEs.
11.6 Objects from the Staffordshire Hoard.
11.7 A ternary diagram for Au/Ag/Cu compositions of artefacts from
the Staffordshire Hoard.
12.1 A selection of Romano-British bow brooches of 1st to 4th century
date.
12.2 Basic correspondence analyses for the Romano-British brooch data
set.
12.3 An enhanced correspondence analyses for the Romano-British brooch
data set using the ca function and base R.
12.4 An enhanced correspondence analyses for the Romano-British brooch
data set using the ca function and ggplot2.
12.5 A correspondence analysis of glass counter colours by phase from
Insula VI.1, Pompeii, Italy.
12.7 Examples of Flavian glass vessels.
12.8 Correspondence analysis, Flavian drinking-vessel glass assemblages.
vi
126
126
128
129
134
135
137
138
142
148
149
List of Tables
3.1
Lengths of late Romano-British copper alloy hairpins.
28
4.1
4.2
4.3
Lengths of Romano-British copper alloy hairpins.
The Romano-British hairpin data as entered into R.
Chronology of Roman glass pillar-moulded bowls.
31
33
37
6.1
Loomweight dimensions.
54
7.1
Phased loomweight data with six variables.
66
9.1
9.2
9.3
Counts of Iron Age and early Roman cauldrons by context.
Percentages of major domesticates from Wroxeter by phase.
Colours of glass ‘counters’ by phase, from Insula VI.1, Pompeii.
87
90
95
10.1 Amphora as a proportion of total pottery assemblages by site.
10.2 Later first- to mid-second century animal bone assemblages from
Romano-British sites.
10.3 The data of Table 10.2 recast for use with the barplot function in
base R.
10.4 The data of Table 10.2 recast for use with ggplot2.
10.5 Glass ‘counter’ sizes from Insula VI.1 Pompeii and other contexts.
10.6 Individuals commemorated with and setting up funerary altars.
100
102
103
107
109
117
11.1 Counts of cores, blanks and tools from middle levels of the palaeolithic site at Ksar Akil (Lebanon).
121
11.2 Faunal remains from Romano-British civilian sites.
122
11.3 Au/Ag/Cu compositions for artefacts from the Staffordshire Hoard. 127
12.1 Regional counts of late Iron Age and Roman brooches.
12.2 Assemblage profiles for Flavian drinking-vessel glass.
vii
133
149
Preface
As the late, great Terry Pratchett put it, in his Discworld novel Hogfather,
“Everything starts somewhere, though many physicists disagree.
But people have always been dimly aware of the problem with the
start of things.”
The genesis of this text is multiple. But the idea of producing it, to the best
of our recollection, crystallized during a conversation in a pub – as many good
and bad ideas do. The first author is an academic statistician with reasonably
extensive youthful experience of the practicalities of archaeological excavation and
its aftermath. The second author is an archaeologist who has used statistical
methodology in her work since her university days. We were working on separate
projects with the finishing line in sight and vaguely wondering what to do next.
We’ve published jointly since the 1990s, latterly, when it comes to statistics,
including graphical presentation using the open-source software R about which
more is said in Chapter 2. Some of the simpler methodology that we’ve used,
with respect to graphical presentation of the kind that is the staple of archaeological publications that touch this kind of thing, should be straightforward. The
archaeological literature is littered with statistical graphics that should have been
strangled at birth or, in some cases, not conceived at all.
We resort to Excel from time-to-time and take the trouble to customise the
graphs. We thought that a text that discussed how to produce the most common
types of graph that appear in archaeological publications, with detailed instructions on how to construct them in R, might be useful. The immediate thought at
the time was that it would enable the second author to produce her own graphs
in R without having to trouble the first author.
This would be a solipsistic reason for inflicting a text on an unsuspecting world,
though this has never stopped anyone. There is a more serious purpose – of the kind
that occurs to you on your second glass of wine (or beer or cup of tea or whatever).
You might also conceive of our pub conversation as a ‘conception’ arising, in a less
precisely datable fashion, from mutual long-standing dissatisfaction with some of
of the statistical graphics to be found in the archaeological literature.
With apologies for the bad language used this is best conveyed by some of
the kinds of things we’ve said. ‘If I see another bl**dy three-dimensional exploded pie-chart I’ll scream’. ‘The editor(s) who allowed this monstrous, threedimensional, optically disturbing bar chart to sully their publication should be
viii
shot’. ‘Mummy, why do they use half a page to display a pie/bar chart with tow
categories? Couldn’t they just cite the numbers?’. That sort of thing. It’s possible
that we might be exaggerating a bit, but possibly not and certainly not as much
as you would wish.
Somewhat over 20 years ago Kris Lockyear (Lockyear, 1994), in the appropriately named Ranter’s Corner of the now defunct Archaeological Computing
Newsletter, sounded off, entertainingly, more in anger than sorrow, about problems with graphics in archaeological publications. The article was entitled PUPs
(pretty useless pictures); the acronym PPPs (pretty pointless pictures) was also
used, as well as others coined to be equally derogatory. Re-reading it – it is partly
a product of its time in terms of references to the technology then available – a
lot of his critique is, depressingly, still applicable. Everything passes but, over the
time-scale we’re talking about, everything doesn’t necessarily change, or at least
for the better.
PPPs and PUPs still abound. And things could be done about it. Quite a lot
of the more heinous graphics to be seen in archaeological publications are produced
using Excel or software purposely designed to emulate the worst that Excel has
to offer. For many purposes Excel is quite capable of delivering decent graphs,
but it has to be controlled/customised and not everyone bothers.
The royal ‘We’ is used a lot in the text; this is supposed to be bad practice
(along with over-use of the first-person singular) in academic texts, and is quite
deliberate. One reason is the text is opinionated in places, and opinions differ
(particularly among academics); we thought it best to make it quite clear when
disputable opinion was being expressed. The other reason is that we’ve been at
some pains to use real data for illustrative purposes, a lot of it based on our own
research; it seemed silly to refer to this in neutral terms as if it had nothing to do
with us, so hence ‘We’.
The other decision made was to use real data and examples of what we considered to be bad ‘graphical practice’, drawn from examples in texts and journals
lying around in our house – it proved to be distressingly easy, as is discussed in
the introductory chapter. What, for some perverse reason, struck us as amusing
was that the richest source of awful graphics proved to be the annual proceedings
of the Theoretical Roman Archaeology Conference (TRAC ), of which we have an
almost complete set. We’ve resisted the temptation to cite everything, but enough
examples are included in the text to indicate what we mean.
What amused us – and we shall make it clear that we are not criticising the
scholarship of authors singled out for attention – is that a lot of the really awful
graphical stuff was produced by authors who have passed through the hallowed
portals of our more prestigious universities – most notably Cambridge – usually
as PhD students or post-doctoral, and have gone on to lectureships in other noted
universities. What’s worrying is that their supervisors obviously did nothing to
discourage the graphics used and that bad habits will perhaps be condoned in
future generations. To quote the poet Philip Larkin out of context ‘Man hands on
ix
misery to man/It deepens like the coastal shelf’.
If this text has a serious purpose it is to encourage archaeologists to give more
thought to the statistical graphics that they use, possibly even to have the courage
to eschew them on occasion. The emphasis is on how you might do things in R,
which isn’t as ‘comfortable’ as Excel and other menu-driven software, but allows
more control and forces you to think.
x
Chapter 1
Introduction
1.1
Why statistics?
Archaeology is a data-rich subject. Forget anything you might have read about
data being ‘theory-laden’ – data exist independently of theory, though ‘theory’
may inform what you choose to pay attention to and how you interpret it.
A piece of pot is a piece of pot; you can measure its dimensions, classify it by
fabric, determine its chemical composition, establish its petrography and so on.
With luck you may be able to assign a function or date to it, albeit with a degree
of uncertainty. If it’s recovered from an excavation you know where it comes from;
that is, you know what the spatial, and possibly temporal, relationship is to other
artefacts of both the same and different types. If you’re dealing with an assemblage,
of the kind generated by a serious excavation, you can count occurrences of types,
classify them by context and date, and make comparisons with other assemblages.
All this generates numbers and you want to do something with them, or ought
to. Stratigraphy, one of the bedrocks of practical archaeology, is inherently numerical. You try to place contexts/building etc., sometimes with only partial success,
into an ordered (numerical) sequence that reflects their temporal relationship.
It is, we suppose, possible to live in the realm of ‘pure theory’ (a theoretical
womb perhaps, from which you don’t emerge into reality). No PhD dissertation
is complete without an obligatory chapter on theory, before getting to the real
business on the basis of which the PhD is awarded. Aspirants to university lectureships in archaeology may possibly find that competence in ‘theory’ is one of
the more important things needed for the job.
We’ve nothing against theory. Some of our best friends do theory. The second
author has been accused of doing theory, albeit in an unobtrusive way; and the
first author of possibly understanding some of it. But . . .
One view of what theory sometimes tries to do is that it involves an interpretive synthesis of large amounts of information from excavated archaeological
sites, distinguished both spatially and temporally. This isn’t the whole story. But
archaeological excavation is at the heart of much of archaeology as an academic
1
discipline. This is, perhaps, too blindingly obvious to say explicitly. And you
need to interpret the information to be gleaned from any individual site. And
this involves finding things, cataloguing them, counting them, classifying them,
measuring them, interpreting them and so on. And then you might think it useful
to do something with the numbers thus generated. And then – dreaded word –
statistics come into play. Which is what this text is about – more specifically,
graphical presentation.
1.2
Why R?
Excluding publications in archaeological science journals and the like, the statistical methods that appear in what we shall call ‘straight’ archaeological publications,
where they do appear at all, tend to be at the more basic but important end of the
scale. Apparently ‘simple’ graphics such as pie charts, barplots, histograms, scatterplots and so on. It’s this kind of thing on which the text largely concentrates,
though we have strayed from our original self-imposed remit.
The motivations for writing the text are discussed in the preface. The original thought was to provide a detailed ‘how to’ guide about producing the commoner kind of graphics in the open-source software R. The software is discussed
in Chapter 2. A lot of what you see in the literature is obviously produced using
menu-driven software, quite often Excel; R, while you do need to learn it, has a
reputation for being difficult to learn that we think can be exaggerated.
Part of the thinking is that R offers more control over the graphics than other
software does. You’re also forced to think about presentational issues and whether
you need the graphic at all. Excel has a lot to answer for, but it’s probably down
to the fact that users don’t exercise their right to customise graphs and/or are
seriously lacking in taste. Even if the merits of R don’t attract you, think several
times before choosing a ‘default’ option in some conveniently available menu-driven
software package.
1.3
The need for graphics, or not
What follows in later chapters is sometimes opinionated. Ironically, given the
emphasis on graphical presentation, one message of the text is that what you see
is often unnecessary and, literally, a waste of space. That a ‘picture is worth a
thousand words’ is a cliché. Like all good clichés it contains an element of truth
but discretion in the use of graphics should be exercised1 . It’s criminal to waste
half a page of printed text on a pie chart or barplot with just one category (this,
1
If you believe Wikipedia the origins go back to American journalism and advertising from
the 1910s, which might sow an element of doubt in your mind. ‘A picture paints a thousand
words’ might be a more familar rendition and seems to stem from much the same period, possibly
‘popularised’, if that is the right term, in a song by the ‘soft-rock’ band Bread in the 1970s called
‘If’, not to be confused with Rudyard Kipling’s poem with the same title.
2
unfortunately, is not a joke) or even two or three categories. Many experts in data
visualization (which we’re not) would like to dispense with pie charts entirely.
They are also very rude about psuedo-three-dimensional pie charts, barplots and
histograms which should be banned from academic publications. This is illustrated
with unfortunately real examples in Chapters 9 and 10.
A lot of what’s seen in the literature, illustrated graphically, could be explained
in a simple sentence or two, or conveyed more effectively by commentary on a small
table. You need to trust the intended audience to be capable of comprehending
things like ‘of the 1000 artefacts studied the overwhelming majority (99.5%) were of
type X’. We have real examples in mind here, some to be noted in later chapters.
A specific instance comes from Wainwright’s (1984: 8) presidential address to
‘The Prehistoric Society’. It shows a pie chart of the regional distribution of
scheduled monuments in England, by period. It is stated that the chart shows
49% of prehistoric date, 43% medieval, 7% Roman and 3% post-medieval. Surely
this is comprehensible without a chart that occupies half a page; if you really must
have a graphic a barplot would have been better. This example could readily be
multiplied.
If a graphic is needed, and it often helps, give thought to it. Considerable
thought goes into the choice of artefacts for illustration, and how plans and sections
will be presented. Graphics summarising data deserve just as much care.
We, the authors, are on dangerous territory here since the disclaimer that we
have no special expertise in data visualization has already been issued. There are
plenty of experts around, not all of whom necessarily agree about everything. Our
position, and we exaggerate a little, is along the lines that ‘we don’t know much
about art but we know what we don’t like’. What this means will become apparent
later.
You may identify problems with some of the graphics presented here; the ‘syntax’, inappropriate choice of colour and so on. It’s the first thing we’ll notice when
it’s made public. However, the idea is to show how things might be done using
R, including what we think should be avoided, and it’s up to the user to make
stylistic/aesthetic decisions about exact presentation, which may depend on the
medium in which they wish to publish.
What’s being discussed here is graphics with an intended audience. But you
can undertake graphical data analysis for your own edification, to understand your
data, without immediate thoughts of publication. That is, there is a distinction
to be made between what we shall call ‘exploratory’ and ‘presentational’ graphical
analysis. Unwin’s (2015) recent and excellent book on the subject, Graphical Data
Analysis with R, emphasises the former. Here we have both aspects in mind and
this has informed the way the text is structured.
There are different graphics systems available in R. If a quick preliminary look
at the data is all that is needed we often choose to use base R; it’s fine for presentational purposes as well but, increasingly, the package of choice for presentation
seems to be ggplot2. This is discussed in more detail in Chapter 2. The latter
3
system is geared towards colour, though this can be done in base R as well. We
sense a growing consensus that the future of academic publication is that it will
be online so colour will be no problem; but it hasn’t happened yet. For black-andwhite graphics we have sometimes found it more convenient to use base R; you can
customise ggplot2 to do the same, but it can require some effort. For some kinds
of graphic, though not necessarily those most commonly seen in the archaeological
literature, ggplot2 is undoubtedly the best choice.
The strategy adopted in much of what follows is to present the minimal code in
both base R and ggplot2 necessary to see what things look like, then give detailed
code that illustrates the presentational options available. Our own experience in
learning and using R is that (a) it helps to be motivated by a data set of your own
that you’re keen to interrogate, (b) imitate existing code that seems to do what
you want to do and, if you can’t, (c) search online or read a book or two. Hence
the detailed code provided here, but it’s the motivation that counts.
1.4
How this book was written
The first author drafted the text while the second author was engaged in completing a long-term project. When this was complete she read through the text,
correcting, adding and deleting as appropriate. Most importanty she checked all
the R code thoroughly to ensure that it worked (not always) and whether she
understood it (not always). This led to major revisions in places.
Apart from rewriting some code, to make it easier to understand, we have devoted more space to explicitly advocating the routine use of user-defined functions.
This topic now has a chapter of its own (Chapter 5), following a first introduction
to code for graphical analysis – histograms – in Chapter 4. The basic idea is that
you write a body of code to produce a graph but, for reasons discussed in Chapter 5, it is often more convenient to embed it in a function. The graph is then
produced only when the function is executed.
In the original draft the bodies of code were presented, but not explicitly within
a user-defined function. We urged the use of such functions on the reader, but
obviously not emphatically enough. This, we hope, is now remedied. We have now
explicitly presented some of the code embedded within functions so, in addition to
Chapter 5, readers can see what is required. Some code is still not shown within
a function, but readers who wish to emulate our analyses and code are urged to
incorporate such code within a function. See Section 1.6 for additional comment
on this.
Another source of confusion, when the second author read the original draft,
was the appropriate way to prepare the data for analysis. The issue is that we have
presented code for analysis in both base R and ggplot2, and for the same data
they may (depending on the analysis) require the data to be differently formatted,
This can also differ from the form most natural for publication.
R has some powerful data manipulation facilities so that you can convert from
4
one format to another within R, but this is not necessarily for beginners. Accordingly we have recommended that data be prepared in an appropriate format
outside of R, in the format needed for any particular analysis, before importing it
into R. This is most explicit in Section 10.3, where the same data are presented in
three formats, one for publication, one for base R and one for ggplot2.
To facilitate the use of these notes, either to reproduce our own analyses or
use them as the basis for your own data, we are making available an Excel file
containing all the data used, including some data sets too large to publish in the
text, and a text file containing all the code. More dtails is provided in Section 1.6.
It should be possible to copy-and-paste these into R. We have tried to make it
clearer in the text than was originally the case, at the risk of some repetition,
what format the data needs to be in. An obvious point, but one that led to some
confusion when the second author reviewed the code, is the name that you give
your data file in R needs to match that in the function constructed to analyse it!
1.5
Structure of the text
Chapter 2 necessarily provides enough information to get started with R and about
things we’ve found useful or which recur in the text; for those who like detailed
and systematic treatments references are provided. Chapter 3 discusses descriptive
statistics, their uses, and how to obtain them in R. This is not the main focus of the
text but it helps to understand some of the graphics to follow. Chapters 4, 9 and 10
deal with histograms, pie charts and barplots respectively – the kind of statistical
graphics that, collectively, dominate the ‘straight’ archaeological literature where
statistics is used at all. Chapter 6 deals with kernel density estimates; these
aren’t exactly mainstream yet, but are becoming so in some specialised areas
of application and can be an appealing alternative to the histogram for some
purposes. Chapter 7 deals with boxplots and ‘variants’ of them; boxplots are now
widely used, the variants – ‘violin plots’ and ‘beanplots’ – are hardly visible in
the archaeological literature but address some of the problems you may have with
boxplots. Chapter 8 deals with scatterplots, another common graphic to be seen.
Chapter 11 illustrates the construction and use of ternary diagrams. This is
something of a ‘niche’ subject – several niches in fact – but they are widely used.
Chapter 12, on correspondence analysis, might be regarded as an indulgence that
strays beyond our original remit, which was to confine ourselves to the simpler
methods of univariate and bivariate data display.
Chapters 4, and 6 to 8, deal with methods mostly appropriate for continuous
data, things like height and weight. The chapters on pie charts and barplots deal
with categorical data where, for example, artefacts are classified by type. If two
classifications are used, such as type and context for example, they can be crosstabulated, to see how the distribution of types changes across contexts. If the
number of categories within each classification ‘system’ is small barplots can be
a good way of looking at the data; if the number of categories is at all ‘large’
5
correspondence analysis is an attractive alternative for looking at the tabulated
data, widely used, and – we think – readily explained and easy to implement, so
hence its inclusion.
Apart from providing detailed R code we have made an effort to illustrate everything using real, mostly published, data that derives either from our own work
or on that to be found in the books and journals on our bookshelves. The archaeological context, including conclusions to be drawn from an analysis is usually
summarised, sometimes in detail. There is something of a bias towards Roman
material, reflecting our own interests and the content of papers, other than our
own, that we have drawn on. The methodology is general, and independent of
the specific context. With very few exceptions the data are drawn from what
we’ve called ‘straight’ archaeological publications as opposed to archaeological science publications2 . For the latter, parts of Baxter (2016) cover some of the same
ground as here but in much less detail, mostly without R code, and with a specific
focus on multivariate methods and archaeometry.
We made the decision, after serious thought, to reproduce examples of what
we consider to be bad graphical practice, with critical comment. This involves
citing specific papers. It was distressingly easy to find these; less than an hour
rummaging around our bookshelves and the detritus on our floors located most of
what is presented. We’d like to make it absolutely clear that we do not, in any
way, wish to impugn the archaeological scholarship exhibited in the articles singled
out for this sort of treatment. But some of the graphics make you want to weep.
The text attempts to explain what we think some of the problems are and what
might be done about it.
We anticipate that readers familiar and comfortable with the software they use
will not be that tempted to move to R. It’s not the only software package that can
produce decent graphics, though we would discourage the use of Excel. We hope,
however, that the examples and discussion in the text will encourage readers to
give more thought to graphical presentation than is sometimes evident, possibly
to the extent of recognising when a space-filling graphic based on simple data with
a clear message is not always necessary.
1.6
Additional resources – data sets and R code
For those interested in pursuing R some additional resources relating to the data
sets and code used have been made available, along with this text, on our academic.edu web pages, and on the Barbican Research Associates website.3
2
The distinction is made with some unease but we assume readers will know what we intend.
http://nottinghamtrent.academia.edu/MikeBaxter
http://independent.academia.edu/HilaryCool
http://www.barbicanra.co.uk/resources.html
3
6
Data sets
All the data sets analysed for which we provide code are contained in worksheets in
an Excel file that can be accessed on the websites cited above, Data Sets for ‘Basic
Statistical Graphics for Archaeology with R: Life Beyond Excel’. To use these in
the way that we envisage, using the code that is also available, the different data
sets needed to be imported into R, as described in Section 2.2.2, with identical
names to those provided for the worksheets. Because of their size not all of the
data are given in full in this text.
R code
A Word file, Figures for ‘Basic Statistical Graphics for Archaeology with R: Life
Beyond Excel’ - R code is also available on the websites noted. Provided the data
sets have been successfully imported it should be possible to copy and paste the
complete chunk of code for each figure into R to get the figure in the text.
In the text we have urged the use of user-defined functions for all but the
smplest and shortest bodies of code (Chapter 5). Not all the code in the text itself
has been presented in this way, and it is sometimes incomplete with presentational
arguments left implicit. The initial thought was to encourage users to complete
the code, if not already present, and then embed it in a function. In the event,
based on the experience of the second author in testing and trying to understand
the code, it was thought better to make available the complete code used, already
embedded in a user-defined function.
The idea is that this will provide an entrée into R for those unfamiliar with it.
The intention is to show what can be done, and how, before coming to grips with
analysis of you own data. To emulate analyses illustrated, the minimum required
would be to change the name of the data set used to match your own, modify labels,
and possibly play around with presentational aspects. The excitement begins when
you want to tweak analyses and presentational options not covered in this text. If
it isn’t obvious from the R help facilities what to do there’s plenty of online help,
and books, available for guidance. It’s quite easy to want to something that seems
quite simple, but turns out to be more challenging that you might expect! But
you don’t have the same facilities available, and advice, with most menu-driven
software.
7
Chapter 2
R – an introduction
2.1
Introduction
R is a powerful open-source software system, now the package of choice of many
applied statisticians, and others, who engage in data analysis. It can be thought
of as the ‘son of S’, a language developed in the 1980s that, in the sub-title of
the text that introduced it, was described as a ‘programming environment for
data analysis and graphics’ (Becker et al. 1988). The classic texts of Venables
and Ripley, Modern Applied Statistics with S dealt, as the title implies, with a
wide range of applied statistical methods using S. The last and fourth edition of
the text (Venables and Ripley, 2002) encompassed R, which was developed in the
1990s, and remains a valuable source of reference we consult often. It is aimed,
though, primarily at statisticians.
R has the merits of being both an exceptional tool for data analysis and being
free. Since the early 2000s it has grown into something of an ‘industry standard’.
There are numerous texts on its use pitched at all levels, directed at different specialisations, some concentrating on the more technical and computational aspects
of the language.
Although the situation is changing R has yet to be widely used in some areas
of archaeological data analysis and its adoption has been patchy. Baxter (2015b)
is an online book-length introduction to the use of R for quantitative data analysis
in archaeology that touches on some of the material discussed in this text. It
does not pretend to be a systematic treatment of R but includes a lot of data
analysis with associated code. Baxter (2016) is a more specialised treatment of
multivariate methods in archaeometric data analysis but includes a discussion of
simpler methods of graphical analysis with some code. David Carlson’s useful
website1 has a more systematic introduction, with a focus on how to use R to
implement the methodologies in the textbooks of Shennan (1997) and Drennan
(2009). A book is promised.
It’s far easier to find applications in publications such as the Journal of Archae1
http://people.tamu.edu/ dcarlson/ (accessed 10/10/2016)
8
ological Science than it is in ‘straight’ archaeological journals such as American
Antiquity, Antiquity, Antiquaries Journal, The Archaeological Journal, Britannia,
Journal of Roman Archaeology, Medieval Archaeology, Proceedings of the Prehistoric Society etc., to name only some of those on our bookshelves or which we
consult regularly. If there is a default in this latter group for data analysis it’s
probably Excel. If you customise things properly you can get perfectly acceptable
graphics with Excel but lots of authors don’t. There are plenty of other examples
around, using other software the origins of which are often obscure, that produce
far worse graphics than anything that can be achieved in Excel.
The presumed reason for this is the accessibility (in the sense that it’s free or
what’s available in the environment you work in) of the software used. Since R
is free and designed, among other things, to produce decent graphics by people
who know what a ‘decent’ graphic is it’s a very obvious and better alternative to
what’s often used.
A frequently cited reason for the reluctance of some potential users to engage
with R is that it is command-driven, rather than the menu-driven software many
users are reared on. That is, it’s not ‘user-friendly’. This may be a particular
problem for statistical software where the subject matter can be seen as challenging
before you even engage with implementation. Our attitude is that if you can write
a sentence, with due attention to syntax and capitalisation, you can do it. That
said, in the first instance, it can be helpful to emulate code produced by others.
There is lots around – it’s a good way of learning, and the reason detailed code is
supplied here.
The attractions of menu-driven software are obvious. You can produce a
barplot (possibly quite an awful one) in Excel almost instantaneously. It takes
more effort in R to produce a plot fit for publication than you might wish. But you
can exercise more control over its appearance which forces you to think more about
it, possibly even concluding that it’s unnecessary (as many published barplots are).
In fact, because of the perceived problem, more user-friendly interfaces with R
exist. David Carlson’s website exploits Rcmdr, a graphical user interface. RStudio2
is increasingly recommended and there are other options (see, for example, Baddeley et al. 2016, 20). Our experience with R predates the development of RStudio
by some years and we haven’t used it much but we’ve seen it suggested that this
is where many people new to R might wish to start from. Ben Marwick’s CRAN
Task View: Archaeological Science usefully summarises a wide range of R resources
potentially useful for archaeological data analysis well beyond what is covered in
this text3 .
Although the first author has used R for about 15 years, and S before that, we
don’t claim any particular expertise in the subject compared to what we recognise
as ‘expert’, and this isn’t false modesty. We tend to work with quite ‘small’ data
sets that use code which works for us (‘does the job’ if you like) and which we
2
3
https://www.rstudio.com/ (accessed 10/10/2016)
https://github.com/benmarwick/ctv-archaeology (accessed 10/10/2016)
9
hopefully understand. That is, by the highest professional standards, some of
what follows may not be ‘efficient’ or too readily generalised – anyone with an
aptitude for programming can probably improve on some of the code. R develops
and is updated fairly rapidly, so sometimes you don’t pick up on changes, or exploit
them, as quickly as might be hoped. As discussed in more detail later, ggplot2
is a powerful graphics software package developed for use with R that has now
been around for some time (Wickham 2009) (a second edition of which, Wickham
2016, has recently appeared) and has an enthusiastic following; we’ve only started
using it regularly fairly recently so some of what’s to come has involved a learning
experience on our part.
R is available on several different platforms; our experience is entirely with
Windows; if you use Linux or Macs some of what follows may not work as described
but there will be plenty of online advice. The next section outlines the essentials
needed to get started with R, and things we think it might be useful to know at the
outset. The treatment is informal; attention will be drawn to more thorough and
‘respectable’ treatments of topics covered which the reader may wish to consult if
they get interested in R.
2.2
2.2.1
R basics
R – getting started
Search online for CRAN R (the Comprehensive R Archive Network) or use
http://cran.r-project.org/
to get started. You are directed to pages that tell you how to install R on various
platforms, and more information, if needed, is provided in the FAQs. R is updated
frequently. Versions 3.1.2 to 3.3.1 were used for the analyses in the text.
Material on CRAN includes Venables et al. (2016)4 which is about a 100 page
introduction to the language, though there is little in the way of data analysis.
A comprehensive list of over 150 books on the subject is available5 as well as
free material6 . David Carlson’s website, noted earlier, includes a ‘blow-by-blow’
account of the steps involved in installation.
2.2.2
Data import
Once installed the next thing is to get some data into R and do something with
it. You’re faced with a prompt > and – the way things are done in this text –
something needs to be typed after it to make things happen.
Small amounts of data can be entered directly. Suppose you have a set of
numbers (1, 2, 3, 4). Type
4
https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf (accessed 10/10/2016)
https://www.r-project.org/doc/bib/R-books.html (accessed 10/10/2016)
6
https://cran.r-project.org/other-docs.html (accessed 10/10/2016)
5
10
x <- c(1,2,3,4)
then mean(x) will return the (arithmetic) mean of the data (Section 3.3). You
could also do something like m <- mean(x) then type m; this stores the value in m
for any later use.
The object x is a vector and the term variable will also be used. It consists
entirely of numbers and is an example of a numeric variable. This and other
terminology to be introduced will recur later in the text.
More usually you will have a (possibly large) pre-prepared data set available
for study and it needs to be imported. There is extensive documentation on data
import/export7 which, depending on how your data are stored and formatted, you
may need to consult. Given that we do not usually deal with extremely large data
sets, and they are usually in an Excel spreadsheet or readily converted from a
database into that form, we usually do something that is both simple and quick,
but frowned upon by aficionados. That’s just to copy the spreadsheet, or the part
of it you want to import, go into R, and type something like
mydata <- read.table(file = "clipboard", header = T)
This can be done in different ways. Using read.table and providing a path to
a comma-delimited file that contains the data (in place of "clipboard") is often
recommended. However it’s done it results in what’s called a data frame (or
dataframe), discussed in more detail in Section 2.2.3.
It is assumed here that headers naming the columns are given, with the caveat
to follow. Spaces in headers must be avoided (and if other illegal characters are
used an error message in R will inform you). Missing data should be coded as NA
and there should be no blank entries in the data file. A rectangular table of data
is expected.
You have, in fact, two options. If all the columns have headers the row names
are assigned as 1, 2, 3, . . .; if the first column doesn’t have a header (i.e. the
top-left corner entry in the spreadsheet is blank) the entries in it are assigned as
row names. Which is to be preferred depends on how exactly you want to analyse
the data and is illustrated in context in later chapters.
2.2.3
Data structures
Data can be ‘structured’ in various ways in R. It matters because some analyses
one might wish to undertake depend on the data having a particular structure.
What follows is a very simplified discussion, limited to what’s needed later in the
text. See some of the references cited earlier for systematic discussions.
As we’ve treated data entry above, the data are stored in an object, mydata
in the example, that is a data frame. You can query whether this is the case
using is.data.frame(mydata). It will be rectangular and consist of p columns.
7
https://cran.r-project.org/doc/manuals/r-release/R-data.html (accessed 10/10/2016).
11
Those that are numeric consist purely of numbers; columns containing non-numeric
information, such as a context or artefact type identifier, will be treated as a factor .
Factors are important but can be a pain for a variety of reasons, even at the
basic level they are used for in this text. One reason is that, for the purposes of
data entry, the names you give them may not be those you wish to use in graphical
or tabular displays. Another is that some of the tools (functions) available for
analysis will, by default, order the factors alphabetically in a graphical display
and this may not be what you want. Sorting this kind of thing out is ‘messy’;
we discuss how to do it when the problems arise in later chapters. Chang (2012:
343–352) usefully discusses the various issues and remedies involved.
For some purposes it is useful to extract subsets of variables from a data frame.
To anticipate later material, if the data frame mydata contains a variable, Weight,
that you want to extract and it is the third column in the data frame then
Weight <- mydata$Weight or Weight <- mydata[,3]
will do it. If you want to select a subset of the data, say columns 2, 3, 4, then
newdata <- mydata[, 2:4] or newdata <- mydata[,c(2,3,4)]
does it. The latter form here is more general since it allows you to select noncontiguous columns. Rows can be similarly selected by placing the selection before
the comma in [,], and you can combine row and column selections in an obvious
way.
An issue that sometimes arises is that what should be treated as a factor is
sometimes recorded numerically, for example in a stratigraphic sequence 1, 2, 3,
etc. This may need to be converted explicitly to a factor and can be important for
some uses, including that of the package ggplot2 to be discussed. If, for example,
the data frame mydata contained a variable Context, coded numerically but to be
treated as a factor, you can use
mydata$Context <- as.factor(mydata$Context)
Illustrative examples follow later.
Some analyses that also follow later require the data to be recognised as a
matrix . This is a rectangular array of data consisting solely of numerical values.
Visually a data frame can look like this but sometimes irritating error messages,
to the effect that a matrix or some other structure is needed, will be provided. If
the contents of mydata are numerical and this sort of thing happens use
newdata <- as.matrix(mydata)
If the data frame consists of numeric variables and factors extract those numeric
columns needed, into newdata. Use is.matrix(newdata) to see if it has the
structure you want; if not, convert it using as.matrix.
12
2.3
2.3.1
Functions and packages
Functions
Once you’ve entered some data you do something with it by calling a function to
which you supply arguments. If, for example, you have a variable x that you want
to produce a histogram of (Chapter 4) then hist(x) will do it. This is using the
function hist. If you wanted to save the outcome of a call for future use myhist
<- hist(x) will do it. As it stands this will also plot the histogram; if you wanted
to suppress the plot use the argument plot = FALSE thus8 .
myhist <- hist(x, plot = FALSE)
To get help on a function type ?name (e.g., ?hist). This will provide a list
of the required and optional arguments, their defaults and – if you are lucky –
illustrative examples that you can learn from by emulating with your own data.
If you are not so lucky plenty of online advice, not all of it sound, is available.
For the purposes of this text the arguments supplied to a function can often
be split into two groups, the ‘business’ and ‘presentational’ parts. We think of
the ‘business part’ as the minimum needed to get something suitable for your own
personal inspection. This can be both very minimal and quick. If two variables,
x and y are available, say weight and height, and you want to see how they are
related, plot(x, y) may be sufficient for your purposes. If the graph is deemed
to be worthy of public exposure attention needs to be given to its appearance.
plot(x, y, pch=16, col="red", xlab="Weight (g)", ylab = "Height (cm)",
main = "A graph of height against weight",
cex= 1.2, cex.axis = 1.2, cex.labs= 1.3, cex.main = 1.4}
will endow the graph with axis labels, a title (main), and plotting characters (pch)
that are solid circles coloured red (col). The character expansion (cex) arguments
control the sizes of the points, axis text, axis labels and title; the appropriate choice
will depend on the context in which the graph is to be presented.
Plots may be further embellished by using additional functions, placed after
the initial command, that add lines, extra points or text and so on. Later chapters
provide examples.
At the lower, but important, levels of statistical endeavour that are mostly the
subject of this text, preparing a graphic that is suitable for presentation is often far
more time-consuming than the initial data exploration. This can be off-putting
until you get used to it, but has the merit of allowing far more control that is
sometimes exercised in graphs published in the archaeological literature.
It’s also possible to write your own functions, which ultimately can save a lot
of time if you are undertaking repetitive analyses or writing lengthy code. This
8
This can be abbreviated, as in similar functions, to plot = F; plot = TRUE is the default
and TRUE can be abbreviated to T.
13
will be discussed in context at greater length in Chapter 5 once we get round to
some real data analysis. It is strongly recommended that you read this and get
into the habit of embedding all but the simplest code in a user-defined function.
It makes it a lot easier to correct errors and add code, without having to retype
code at the terminal.
2.3.2
Packages – general
Packages are collections of functions and, commonly, data sets. David Carlson’s
archdata package, for example, consists of data sets that have been used in the
archaeological literature though, in other packages, the emphasis is usually on
functions. Packages exist that consist of a single function.
Download R and you have immediate access to several packages that will be
referred to as ‘base R’. Other packages are also bundled with R but need to be
loaded before you use them. For example, if the splines package is needed the
following may be used (comments follow #).
library(splines)
library(help = splines)
# loads splines
# lists available functions
There are a vast number – over 6,000 and counting – of user-contributed packages that need to be installed before they can be loaded. There’s more than one
way of doing this; what’s described is what we typically do on our PCs. In R, from
the Packages menu, select Set CRAN mirror to choose a site to import from then,
from the same menu, select Install package(s) and select the package you need.
This can then be loaded whenever you want it; you should only need to install it
once.
2.3.3
Packages – graphics
We’re used to base R and have found it more than adequate for producing graphs
fit for publication for many years. A variety of graphics packages for R that extend
its functionality, or are intended to improve on it or make things easier for the
user, exist. Those usually cited are the grid system developed by Murrell (2005,
2011), lattice (Sarkar 2008) and ggplot2 (Wickham 2016). The later packages
draw on elements of the earlier ones, as well as base R itself.
The ggplot2 package has something of a cult following. It is increasingly the
package of choice in texts that place some emphasis on graphics in R (e.g., Chang
2012; Unwin 2015). We had to make a choice about what graphics system(s)
we’d favour in producing this text. In the event we’ve opted – where the choice
exists – for a ‘parallel’ presentation of base R and ggplot2 graphics. There are
a number of reasons for this, of varying merit. One is that we were familiar,
and comfortable, with base R which, as already noted, is more than adequate for
producing publication-quality graphs, particularly if they have to be in black-andwhite. For the purpose of writing this text we had to ‘learn’ how to use ggplot2
14
– it is quite clearly superior to base R for some (but not all) purposes – which
managed to be both enjoyable and sometimes frustrating at the same time.
These are the less meritorious reasons for choosing the form of presentation
adopted. There are other reasons to be mentioned, but first a little more detail
about ggplot2 is provided. In the words of its creator, Hadley Wickham, ‘it is
unlike most other graphics packages because it has a deep underlying grammar’
(Wickham 2016: 3). The ‘grammar’ is derived from a text, The Grammar of
Graphics, of Wilkinson (2005). We must admit to not having read this but our
(possibly simple-minded) understanding, is that a graph is broken down into its
constituent elements (the different ‘atoms’ of which it is composed if you wish)
and ggplot2 is intended to allow complete control over the different elements.
If you stray from menu-driven software you have to learn the language to
ue. Although Wickham (2016: 3 claims ggplot2 is ‘eaasy to learn’, we don’t
entirely agree. For those with an interest in, and aptitude for, things computational
learning it is presumably not a problem; for the innocent end-user not immersed
in such things who simply want to produce a nice graph it can be a barrier9 .
The package is also quite prescriptive in some ways, has defaults not all of
which you may like, and doesn’t always easily allow you to do what you want.
You can override the defaults, though it can take time to work out how to do this.
Sometimes, with ingenuity, it’s possible to coerce it into doing what you want if
your vision doesn’t accord with the way ggplot2 is designed.
As far as ‘prescription’ goes the data must usually be in the form of a data
frame; if it is not naturally in this format and you need to bind different data
sets/columns/rows of data you need to ensure that what emerges is a data frame.
For some purposes the ‘natural’ data structure comes in the form of an n × p
data frame, possibly with identifying information such as ‘type’, ‘origin’, etc., but
this may need to be converted to ‘long’ format. That is, the numeric data are
stacked to form a column vector, with identifying information similarly treated.
The reshape2 package provides functions to facilitate this; Chang (2012, Chapter
15) usefully covers this, as well as other aspects of data manipulation, including the
construction and modification of data frames. The approach adopted in this text
is to suggest that the data be prepared in the format you want to use outside of R.
This may differ according to the graphics system you use and the kind of graph it’s
intended to produce, and data manipulation within R is sometimes unavoidable.
There is a difference between what you might want to do for the purposes of
initial data exploration and what you might want to do for publication. For the
first purpose base R can be quicker; once you get used to it ggplot2 can be quick
as well, especially if you have a ‘template’ to accomplish repetitive tasks, but it
typically involves a little more coding effort. This is discussed in more detail in
the next subsection.
9
There are plenty of discussion forums on the Web that are a valuable resource if you want
advice about how to do something specific. A problem, though, is that some of the people
who populate these seem more interested in demonstrating their own ‘computational nous’ than
providing a simple answer, which often exists, to a simple question.
15
For some analytical purposes that involve a graphic, and with a possible eye
to publication, we think it doesn’t much matter whether base R or ggplot2 is
used. For some purposes ggplot2 is clearly superior. We will make it clear in
the text where we have views about the superiority or otherwise of either approach, but provide examples and code for both so readers can make up their own
minds. It partly depends on how the outside world is to be privileged to view
your work should it be made public. For online publication colour is now natural
(Section 2.3.5) and ggplot2 is more suited to this though you may wish to exercise control over the colours. It’s straightforward enough to produce colour in
base R. For black-and-white publication, and alas despite rumours to the contrary
this is still often unavoidable, some effort needs to be invested in making ggplot2
comply, though it’s not too onerous10 .
A ‘selling point’ of ggplot2 is that legend construction is relatively painless
compared to base R. We don’t entirely agree and, to judge from online traffic, are
not alone. For fairly simple graphs there is not a problem, once you work out how
to control legend placement. For more complicated graphs legends can pop-up all
over the place, can look horrible, and it’s sometimes far from obvious why this is
happening and how it can be controlled. Chang (2012) provides a lot of useful
practical guidance; otherwise go online for help. Legend construction in base R
can be tedious; we don’t have any problems with it, but this, like everything, is
probably a function of familiarity.
2.3.4
Plot construction – ggplot2
For a detailed account of ggplot2 see Wickham (2016). Chang (2012) is a useful
practical guide, while Unwin (2015) is laden with analyses of real data and code
that can usefully be emulated. What follows here is more simplistic.
Later chapters provide the code for a variety of plots in both base R and
ggplot2. The former will be dealt with as we go along. In this section a more
general account of plot construction in ggplot2, as we see and use it, is provided
to ‘set the scene’ for what follows.
In the beginning you need to have installed ggplot2 and other packages such
as grid, and loaded them into R using the library function as described in Section 2.3.2. Depending on what you do you may get error messages telling you some
necessary package is missing, in which case install and load it.
For the purposes of exposition we shall suppose we have a data set, data1,
consisting of two variables, X and Y and a factor, Z, with two categories that might,
for example, be two different contexts/periods. For a specific example, think of X
and Y as the length and width of some appropriate artefact type recovered from
two different sites.
10
As an aside, it’s obvious from online traffic that some ggplot2 enthusiasts (and others) think
that letterpress is, or should be, a thing of the past and that everything should be online. They
may be living in the future but not in the real world as presently constituted, at least as far as
archaeological academic publication is concerned.
16
All the ggplot2 graphs in the text begin with, as a minimum, something like
ggplot(data1, aes(x = X))
where the first argument to the ggplot function, data1, specifies the data set that
will be used. The aes argument, which is shorthand for aesthetics takes a little
getting used to but is important as it controls the appearance of the graph that
will be born from subsequent commands to be issued. In the present instance it’s
establishing that subsequent operations are to be performed on the variable X –
you don’t get a plot at this stage.
To get a plot you need to specify what you want; what follows is again minimal
but suppose a histogram is what’s needed (Chapter 4);
ggplot(data1, aes(x = X) + geom_histogram()
will do it, though the result is not especially satisfactory. See Chapter 4 for details
about how the plot can be customised by adding arguments to geom_histogram().
Note the use of + to add additional commands; as presented above a plot will
emerge. If you need to add additional commands – normally the case – just stick
a + at the end of the line.
A lot of the code in the text is presented in this kind of format. Other ways of
getting the same result are
p <- ggplot(data1, aes(x=X)) + geom_histogram()
p
which will store the code in an object, p, and only produce the output when you
type p. The same result is produced using
p <- ggplot(data1, aes(x=X))
p <- p + geom_histogram()
p
which builds up the code incrementally before printing the result.
What’s been presented so far is what we think of as the ‘business part’ which
has already been mentioned in the context of base R. In most of the examples to
follow most of the code is presentational. Since each element of a ggplot2 graph
can be controlled separately, and since some of the necessary code can be rather
‘wordy’, it can make for tedious coding. Once you’ve mastered it, though, and
developed a ‘template’, it can be copied from one analysis to the next.
Now to complicate things a little, suppose that what’s needed is a plot of X
against Y, dealt with at much greater length in Chapter 8.
p <- ggplot(data1, aes(x=X, y=Y))
p <- p + geom_point()
p
17
does the job. It’s not terribly exciting at this point, but what’s going on, in
general, is that you create an ‘invisible’ background framework then bring a graph
into being by adding all the things you want it to show, in this case just the points.
The fun really begins when you want to differentiate between points according to
the level of the factor, Z, they are associated with. It’s easy in principle
p <- ggplot(data1, aes(x = X, y = Y, colour=Z, shape=Z))
p <- p + geom_point()
p
maps colours and symbol shapes to the points according to the level of the factor
Z11 . At this point you may decide you don’t like the default colours (we don’t) and
may want to change them, and possibly the symbol shapes. The next bit of code
shows how this can be done.
p
p
p
p
<<<<-
ggplot(data1, aes(x = X, y = Y, colour=Z, shape=Z))
p + geom_point()
p + scale_colour_manual(values = c("blue", "red"))
p + scale_shape_manual(values = c(16,17))
This is for illustration – more detail occurs later. Colour is discussed in the section
to follow and symbol shapes (and line types) are covered in Chapter 8. And so
you build up the plot, customising as you wish. The sort of code you end up with
may look like what follows.
p <- ggplot(data1, aes(x = X, y = Y, colour=Z, shape=Z))
p <- p + geom_point()
p <- p + scale_colour_manual(values = c("blue", "red"))
p <- p + scale_shape_manual(values = c(16,17))
p <- p + xlab("Suitable label") + ggtitle("Suitable title") +
theme(panel.grid.minor = element_blank()) +
theme(axis.text=element_text(size=12),
axis.title=element_text(size=14)) +
theme(plot.title = element_text(size = 16, face="bold")) +
# control position and size of legend and text
theme(legend.position="top",
legend.text=element_text(size=14),
legend.title=element_text(size=16),
legend.key.height=unit(.8, "cm"), legend.key.width=unit(.8, "cm"))
Some of what’s now to be said will be repeated, without apology, in later
chapters – a bit of verbal redundancy never did anyone any harm. By default
ggplot2 graphs come with a light grey background and major and minor grid
lines. As these things go they’re not obtrusive, but we find the minor grid lines
a bit ‘fussy’ and get rid of them with the first of the theme commands – this
is entirely a matter of taste and you can get rid of everything and construct a
black-and-white graph if you want.
11
The American spelling color will also work.
18
The next two theme commands just control aspects of the appearance of the
axis, axes labels and title – other options are available. The price you pay for the
control you can exercise is the ‘wordiness’ of the necessary code.
There’s more than one way of getting the kind of plot that the above code
would produce. You can, for example, specify the aesthetics within geom_point
rather than ggplot and aesthetics can be specified outside, rather than within
aes(). We’ll meet examples later.
2.3.5
Colour
Whereas, in this text at least, the choice of plotting symbols (and line type) mainly
arises in the context of scatterplots (and hence are introduced in Chapter 8) colour
is ubiquitous and discussed briefly here. It is, or can be, a complex topic and we
largely only touch on what’s needed in the text. More detailed treatments are
available in Section 6.6.2 of Wickham (2016) and Chapter 12 of Chang (2012) and
in online material related to Chang’s book12 . Earl F. Glynn’s web page is also a
valuable resource13 .
For the most part we do simple and usually specify colours of bars or points
by name. We shall only discuss this here; the reader is refered to the sources
cited for a more thorough discussion. If you type colors() or colours() a list
of 657 colours that are immediately available is obtained, numbered according to
their position in the list. Thus colour 450 is magenta. If, as a simple example,
you wanted a plot of x against y, with a solid circular plotting symbol coloured
magenta then
plot(x, y, pch = 16, col = "magenta")
does what you want. Replacing "magenta" with its position in the list, 450, or
using hexadecimal code, "#FF00FF", has the same effect. For the last of these
you need to know the code; Glynn’s web page is valuable in this respect since it
provides colour charts to show what the colours look like and how they contrast,
including a listing of the hexadecimal codes.
A variety of colour palettes are available in base R and in contributed packages.
We do not make much use of these in the text, but will draw attention to them when
we do. Chapter 12 of Chang (2012) (or the associated web page) is recommended
for those wishing to pursue this in more detail.
It’s possible to get side-tracked by colour theory - both its scientific and psychological aspects. Section 6.6.2 of Wickham (2016) and Glynn provide brief accounts.
Wickham (216: 142) counsels against the use of ‘overwhelming’ bright colours on
bars, suggesting that subtle colours are better. Where we appear to have ignored
this advice think of it as an indication of how things are done rather than a specific
recommendation.
12
13
http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/ (accessed 10/10/2016)
http://research.stowers-institute.org/efg/R/Color/Chart/index.htm (accessed 10/10/2016)
19
Other general points in no particular order of importance include the following.
• It’s possible to produce transparent figures in various ways, perhaps most
simply by adding two extra numbers at the end of a hexadecimal code to
control the degree of transparency. We comment in the text when we do
this, but haven’t used it much.
• Most authors who deal with the subject point out that the colours you see
on your screen won’t necessarily be the same if you publish in other media.
The message, we think, is that if you want colour printing check what it is
likely to look like before submitting anything if it’s likely to be a concern.
• Don’t neglect grey-scale colours; these may be needed for black-and-white
publication, and can look perfectly OK regardless.
• Both Wickham (2016) and Chang (2012) (and others) draw attention to the
fact that a non-negligible number of potential readers of your text (about
10%, mostly male, the weaker of the species) will suffer some form of colour
blindness. Apparently if you submit a text for publication, with colour, and
it’s seen by three male referees there is about a 22% chance that at least one
of them will be colour blind. Chang provides a useful discussion on how to
alleviate the problem
• Because of this last issuse the juxtaposition of red and green is frowned upon.
Another approach cited is ‘redundancy’ in the graphical presentation. If you
want to differentiate plotting symbols using different colours use different
shapes as well in case you have readers who can’t differentiate the colours.
They discuss various strategies for dealing with this and draw attention to
packages with palettes designed to be colour-blind ‘friendly’. They suggest
that the juxtaposition of red and green in a plot should be avoided. Another strategy when dealing with graphs that employ point symbols is to
use redundant coding and endow the points with different shapes as well as
colours.
2.4
Don’t panic
Advice that is always good in many circumstances. Arthur C. Clarke said that
Douglas Adams’ use of it in The Hitchhikers Guide to the Galaxy was perhaps the
best advice that could be given to humanity.
When you begin using R from scratch, by typing in the commands, you will
almost inevitably get error messages. Quite frequently these will be typing errors; R
doesn’t like spelling mistakes and is case sensitive. Common errors that the second
author found herself making included misreading and mistyping commands that
required the underscore character. So typing geom.hist or geom-hist instead of
20
the correct geom_hist. Also, given the proximity of zero (0) and the letter o in
the standard English keyboard, she often discovered she had typed o rather than
0. The error messages attempt to tell you what is wrong, but it is always useful to
read with care what you have typed in to see if you have made similar slips of the
typing fingers. Once you get used to the R process, you will find yourself making
them less often. The second author found this process had started to kick in by
about Chapter 6 as she worked her way through the text.
Cutting and pasting the code either directly from this document (if it’s complete) or from the text file that is also provided (Section 1.6) is, of course, a way
to avoid such errors. At some point, though, you will want to start applying it
to your own data and changing defaults. At that point the error messages may
well start. Don’t despair. Trouble shooting starts with checking for typing errors.
You should also check whether you have imported the data correctly, defined the
variables correctly and so on. Checking data and variables is easily done by typing
in the name of the data set (e.g., pins) or the variable name (e.g., length). The
package will helpfully print out what it thinks the data set pins and the variable
length consist of.
The other thing that can happen is you get ‘trapped’ when constructing code.
Instead of getting the prompt sign (>) inviting you to continue you get a plus
sign (+) which tells you the current piece of code is unfinished. This can happen
because you’ve omitted a closing ), in which case type it and all will be well. If
it’s worse than this press the Esc key to escape, return to the prompt, and start
again.
The problem with making typing errors is the reason why we advocate using
functions as described in Chapter 5. Once you have edited them into a state where
you know they work (see Section 5.1.2), you will be able to use them again and
again to produce different graphs with different data sets. It becomes very easy to
create your own house style or, indeed, the house style of any publication you are
writing for.
21
Chapter 3
Descriptive statistics
3.1
Introduction
Descriptive or summary statistics are numbers intended to summarise some property of a set of data. The main emphasis in this text is on the graphical representation of data but an acquaintance with descriptive statistics helps to describe what
these show. The graphs themselves may also indicate which descriptive statistics
are useful summaries.
Most readers should be aquainted with much of what is presented in what follows – they are the staple in the opening chapters of introductory statistics texts,
including those written for archaeologists such as Shennan (1997) and Drennan
(2009). The discussion will therefore be fairly brief with an emphasis on what’s
relevant for later chapters. Forward reference is made to later chapters that illustrate the discussion and some readers may wish to skim this chapter in the first
instance and refer back as needed.
Section 3.2 discusses the types of data encountered in the text; Section 3.3
gives details about the definitions of summary statistics and their use; Section 3.4
indicates how they can be calculated in R
3.2
Types of data
Text books can go on at some length about the different kinds of data that can
arise in applications. What follows is restricted to what’s needed for later chapters.
Continuous data: This will be used for any variable that can, in principle, have
any numerical value within some plausible range. The height of a pot in
centimetres, for example, could be recorded as 10.123456 etc. In practice
there is a limit to the accuracy of measurement so this would be recorded as
(or rounded to) something like 10.12 or even 10.
Discrete data: This will be used for numerical data that can only assume a
distinct set of values, usually the integers 0, 1, 2, etc. An example might be
22
the number of different decorative motifs on an artefact. Where continuous
data are rounded to integers covering a reasonably small range it may be
convenient to treat them as discrete data (Section 10.4.1).
Categorical data: The frequency (counts) of observations within a set of classes
(categories) is available, such as the colour of glass beads (black, white, blue
etc.). In the instance given the ordering of the categories is arbitrary. If
it’s possible to order the categories (e.g., small, medium, large) the data
are ordinal . If data are unordered categorical it may often be sensible to
rearrange the data, by frequency of counts, for example, or any other sensible
archaeological criterion. This would not usually be advisable for ordinal data.
Continuous data for single variables is the main concern of Chapters 4, 6 and 7.
Chapter 8 deals with the graphics for investigating the relationship between two
continuous variables. Chapter 11 deals with the special case of ternary diagrams
which can be used, after suitable data transformation, with either continuous data
for three variables or three-category categorical data. Categorical data is dealt
with in Chapters 9, 10 and 12. There is more of an emphasis in these chapters
on dealing with crosstabulated , or crossclassified, two-way tables of data based on
two categorical variables.
3.3
3.3.1
Summary statistics
Definitions
The summary statistics to be defined are mostly applicable to continuous or discrete data. It is usual to distinguish between two kinds of statistic.
Measures of location: A measure of location is an attempt to summarise what the
‘central tendency’ or typical value for a data set is. The ones most commonly
encountered are the mean, median and mode.
Measures of dispersion: A measure of dispersion attempts to summarise the
spread of a set of data. The standard deviation (or its square, the variance),
the range, and the interquartile range (IQR) are perhaps the most familiar.
These will now be described using words for the most part; for more formal
presentations see any elementary statistics text (or Wikipedia). The subsection
that follows discusses their use. Some of the statistics to be discussed can be
very sensitive to extreme values in a data set (outliers), particularly with small
samples. Statistics which try to avoid this problem are often referred to as robust.
The following list is of the most common measures of location.
23
Mean: More properly called the arithmetic mean1 , this is just the sum of a set
of numbers divided by the sample size, n. It can be sensitive to outliers in
the data, particularly for small n.
Median: This is the middle value in a set of numbers ordered by size. If n is an
even number there will be two central values and their mean is calculated.
The median is an example of a robust statistic since it will be little affected
by extreme values, if at all.
Mode: This is the most commonly occurring value in a set of data. It can be fairly
useless with a set of continuous data where every number may be different
or only ‘accidently’ the same because of rounding. If the data are grouped
into classes (as in histograms, discussed in Chapter 4) it makes sense to talk
of a modal class and this terminology can be used with one-way categorical
data. If the distribution of the data is smoothed using some technique such
as kernel density estimation (Chapter 6) the mode will be apparent as a peak
in the data. If there are several peaks the data are multimodal . Sometimes
a distinction may be made between the highest peak, the major mode, and
minor modes.
Trimmed means: These aren’t much used in archaeology. An arbitrary percentage
of the smallest and largest values in the data is removed and the mean recalculated. Typically the same small number of cases at each end of the scale
is trimmed in the hope that this will produce a robust statistic unaffected
by outliers.
The mean is probably the most familiar but also the least ‘intuitive’ of the first
three measures listed. It’s the ‘centre of gravity’ of a set of numbers. Measures of
dispersion can be a bit trickier to define.
Range: The largest and smallest values in a data set are the maximum and
minimum (rather obviously). The range is the difference between them.
It perhaps the most obvious measure of dispersion but is very sensitive to
outliers, though it can be of interest in its own right
Interquartile range (IQR): The IQR is a more robust measure of spread; quartiles
need to be defined first. Order the data by size then find three numbers that
divide the ordered data into four equal-sized groups as closely as possible.
The middle one of the numbers is the median; the value which separates
out the smallest ‘quarter’ of numbers from the rest is the first quartile (or
lower quartile); the value separating out the largest ‘quarter’ is the third
1
It’s possible to define other kinds of mean such as geometric and harmonic means and they
do sometimes occur in the archaeological literature, but mean will be used in the text as defined
here.
24
quartile (or upper quartile)2 . The IQR is the difference between the third
and first quartiles; it encompasses the central 50% of the data and is often
used as a measure of dispersion in conjunction with the median as a measure
of location3 .
Standard deviation: The standard deviation is often used as a measure of dispersion to accompany the mean as a measure of location. Difference each
observation from the mean; square it (so everything has a positive sign) and
calculate the mean of the resulting numbers (i.e. the mean of squared deviations from the mean!). This is the variance; the standard deviation (sd) is
the square-root of the variance and is in the same units as the mean4 . As
with the mean it’s not ‘intuitively obvious’ but makes sense once you think
about it. By definition the number obtained is positive; the more spread
out the data are the larger will be the average (squared) distance from the
mean, so large values indicate a more dispersed set of data. The caveat
should be entered that the variance and standard deviation, like the mean,
can be sensitive to outliers in the data.
Mean absolute deviation: If you had to design a measure of dispersion from
scratch the average of differences from the mean seems more ‘natural’ to
use. You have to ignore the sign of the differences (i.e. use absolute values)
otherwise the result will be zero. This is the mean absolute deviation, otherwise ‘MAD’. It can be found in archaeological publications but is not that
widely used. This is partly because the standard deviation (and variance) is
mathematically easier to handle, with nice properties in some circumstances,
and their use is hallowed by tradition.
2
The ‘proper’ definition of quartile exercises some people, though most readers may find the
discussion esoteric. As just defined the quartiles are numbers that, along with the median, divide
the data into roughly four equal parts. An alternative view is that a quartile is an interval; so, for
example, the first quartile is the interval between the minimum and the number that separates
out the lower quarter of the data. We use the term in its first sense.
3
We’re skimming over the fact that how to divide a data set into ‘quarters’ is not immediately
obvious for inconvenient values of n and different ways of doing so exist. This is touched on again
in Section 3.4; it shouldn’t matter much if n is reasonably large. You may see the terms 25th
and 75th quantile or percentile used to refer to the first and third (or lower and upper ) quartiles.
This terminology can be extended to numbers that separate out fractions of the data other than
the smallest quarter and smallest three-quarters.
4
It’s not important for the purposes of this text, but what has just been described is what’s
used if you are dealing with numbers that constitute a population. If you conceive of the numbers
as a random sample from a population and want to draw statistical inferences about properties
of the population the mean of the squared deviations is usually calculated using a divisor of
(n − 1) rather than n. Since we’re not discussing statistical inference there’s no need to go into
the reasons for this. It doesn’t much matter if n is at all ‘large’ and good introductory statistics
texts will alert you to why this is done if it interests you.
25
3.3.2
Usage (and abusage)
60
6
7
8
9
10
(a)
11
12
13
8
10
12
14
16
6
Frequency
0
0
0
0
2
20
20
50
4
100
Frequency
40
Frequency
8
60
40
Frequency
150
10
80
80
12
200
It’s common to see the (arithmetic) mean referred to as the average of a set of
numbers. Don’t do this. Some people who use the term ‘average’ – and the usage
appears in both the popular and unpopular press – appear to be referring to the
mode or median, so it is a vague and often undefined term5 .
Most of the statistics listed above are widely used; they are not always useful.
One of the reasons for graphical analysis, and why it should be de rigueur, is that
apart from revealing pattern/stucture in the data, and any problems you have
with it, it is informative about the kind of summary statistics that may be useful
for describing the data if that’s the way you wish to go6 .
Figure 3.1, which uses continuous data, illustrates a variety of ‘shapes’ that
data might have, using histograms (Chapter 4). The data have been randomly
generated to display the features emphasised.
0
(b)
20
40
60
80
100
(c)
8
10
12
14
16
18
20
(d)
Figure 3.1: Some possible distributional ‘shapes’ for continuous data; (a) symmetric
unimodal, (b) bimodal, (c) skewed, (d) with an obvious outlier.
1. Figure 3.1a is a sample from a symmetrical distribution (normal/Gaussian).
The mean, median and mode/modal class will be very similar and data
looking like this are frequently summarised using the mean and standard
deviation and this may often be all that is needed.
2. Figure 3.1b illustrates a bimodal distribution. If the aim is to summarise
what a ‘typical’ value looks like then the mean and median would be fairly
useless as they will be closest to the least typical values (the antimode). The
data could be summarised by stating that it is bimodal, with an indication
5
You’re allowed to use the word in everyday conversation. Depending on context the person
you’re talking to will either think that they (and you) understand what you’re saying or, if they
care about numeracy and it matters, ask for clarification.
6
In view of some of the content of later chapters, this is not the same as saying the graphical
analyses undertaken, initially in the privacy of your own study, should necessarily be inflicted on
an unwitting public in the form of a published paper. The gist of some of what’s to follow is that
the graphics may inform you but that the message they have to impart can often be conveyed
in a more efficient form than a redundant pie chart or barplot.
26
of where the modes lie. The usual measures of dispersion are also useless
and such data are often best presented in graphical form.
3. Figure 3.1c shows a skewed distribution with a long right-tail. Such data are
quite common. For the ‘shape’ illustrated the mode will be smaller than the
median and the median smaller than the mean. There is no single ‘correct’
measure of location or dispersion; they tell different stories, all are equally
valid. Such a data set is often best summarised graphically.
4. Figure 3.1d shows an obvious outlier. The other figures have used a sample size of 1000; this uses a sample size of 30 (from a normal distribution)
with an added outlier. Faced with this kind of thing in practice, and if it’s
unanticipated, the first reaction should probably be ‘What!’ quickly followed
by ‘Why?’. Depending on how close you are to the data collection it may
become apparent that it’s a measurement or recording error that you may be
able to do something about. If it’s genuine you get interested; if obviously
not and you can’t explain why then you probably discard it. If it is ‘real’
and you want a summary of the data several options are open. The mean
and standard deviation will be misleading and, if you want to summarise the
data using them omit the offending observation(s) and report on it (them)
separately. This is why robust statistics are advocated in some areas of the
statistical literature; if you use the median and IQR for summary purposes
they are largely unaffected by extremes.
The above is by way of saying that, unless the data are very ‘well behaved’,
continuous data are often best summarised in the form of a graphic. We are rather
more ambivalent about the use of graphical methods for displaying simple categorical data sets, at least as witnessed to by countless examples in the archaeological
literature (Chapters 9 and 10). Given data on a single variable, classified into
a small number of categories, anything the data have to tell you that you want
to draw attention to can frequently be described in a single sentence or by commentary on a small table rather than the often large amount of space a pointless
graphic occupies. Pie charts and barplots, probably the most common type of
graphic to be found in the archaeological literature, where the data fall into just
one category (yes, such examples exist!) or two or three, are usually unnecessary
if the story is obvious. Ignoring aesthetic considerations – frequently pie charts
and barplots offend your sensibility – charts/plots are not necessarily technically
incorrect (they can be) but are not always needed7 .
Some graphical displays need to be understood in terms of the statistics that
the display embodies. Boxplots, for example and as typically used (Chapter 7),
7
Possibly the intended audience is judged to be incapable of understanding a sentence to
the effect that ‘Most (95%) of artefacts are of Category A with other categories rare or not
represented.’. Or ‘Two categories, A and B, dominate the assemblage (95%) in roughly equal
proportions, with the six other categories represented by three or fewer examples’. We have real
published examples in mind here.
27
show the median, first and third quartiles, and maximum and minimum values.
This is sometimes called a ‘five-number summary’. Refinements, such as devices for
indicating potential ‘outliers’, are usually in place. Boxplots can be uninformative
if the data are multimodal, and misleading (in terms of the number of ‘outliers’
suggested) if the data are skewed. Violin plots and beanplots, which we don’t
recall seeing in archaeological publications, can be more informative than the quite
common use of boxplots (Chapter 7). The main point here is that you should look
at data in more than one way, graphically, and the graphics will inform you about
the most useful way to display or summarise the data.
3.4
Obtaining summary statistics in R
It’s straightforward to obtain individual descriptive statistics in R, as it should be in
any decent statistical software package (Excel does not fall into this category). In
R create the data set of interest. For illustrative purposes the data of Table 3.1 will
be used; this shows the lengths (mm) of 35 late Romano-British hairpins, and will
be discussed in more substantive detail in Chapter 4 in connection with Table 4.1
where data on early hairpins are also provided. For brevity of presentation these
will be read into R in a variable (vector) x; in practice a more informative name is
advisable.
Table 3.1: Lengths (mm) of late Romano-British copper alloy hairpins from southern
Britain (Cool, 1990).
51
62
70
80
52
63
70
80
54
63
70
82
56
63
70
82
57
65
71
87
58
65
74
60
66
75
60
67
77
61
68
78
62
68
78
It is advisable, if not essential, to examine the data graphically before doing
anything even as simple as calculating a mean. To anticipate the material of
Chapters 4, and 6 Figure 3.2 presents a histogram and kernel density estimate
(KDE) of the data.
There is nothing dramatically untoward about the graphs in the way of obvious outliers or multimodality so that, for example, using the mean and standard
deviation as summary statistics is not ruled out. The distribution is, however,
not especially symmetrical so that, for example, different measures of location will
produce different results and it is worth looking at more than one.
Detailed aspects of graph construction and interpretation are discussed in the
relevant chapters; for the record the defaults hist(x) and plot(density(x))
were used to produce the graphs; they’ve been tidied up a bit for presentational
purpose, and code is given in later chapters, but this is not essential if you just
want a quick exploratory look at the data.
28
Lengths (mm) of later Romano−British hairpins
0.02
Density
0.01
4
0
0.00
2
Frequency
6
0.03
8
0.04
Lengths (mm) of later Romano−British hairpins
50
60
70
80
90
40
50
Length(mm)
60
70
80
90
100
Length(mm)
(a)
(b)
Figure 3.2: (a) A histogram and (b) a kernel density estimate for late Romano-British
hairpin lengths.
Once you’ve decided what’s sensible and what statistics you want to look at
how do you get them? Several ‘short-cuts’ exist. A quick way to obtain the mean,
median, quartiles, maximum and minimum in R is to use summary(x) which returns
the following
Min. 1st Qu.
51.00
61.50
Median
67.00
Mean 3rd Qu.
67.57
74.50
Max.
87.00
while fivenum(x) produces the same statistics other than the mean and is less
usefully labelled. These are available in base R. Other functions that produce
differing sets of descriptive statistics are available in various packages. Rather
than attempting a comprehensive summary here the best advice is to search the
web to see what’s around8 . Some of these functions include statistics that relate
to the skewness and kurtosis of a set of data; we’re not aware that they are much
reported in archaeological studies – they relate to the shape of a distribution which
is usually best appreciated graphically.
If individual statistics are of interest there is usually a function to produce
them. For example range(x) and IQR(x) produce the the range and interquartile
range; min(x), max(x), mean(x), median(x), mad(x), sd(x) and var(x) are all
pretty self-explanatory. For trimmed means, if you want one, use something like
mean(x, trim = .025) which in this case would trim 0.025 of the observations
from each end of the data set.
For the quartiles use quantile(x); the minimum, maximum and median will
also be obtained. The quantile function can be used to determine values that
divide the data into other than four parts. For example, quantile(x, probs =
seq(0, 1, 0.1)) will do its best to identfy values that divide the data set into 10
8
e.g., http://www.statmethods.net/stats/descriptives.html
29
roughly equal-sized groups - it’s not very sensible to try this with a sample size of
35! The seq(0, 1, 0.1) function generates values between 0 and 1 at intervals
of 0.1 and, in effect, the quantile function then finds values that separate out
the lower (10% (i.e. .1), 20% (.2) of the data and so on. There are nine different
ways in which the quartiles can be produced in the quantile function. The choice
does actually make some difference here, though nothing to get excited about. If
it does make much of a difference it will probably be because a small data set is
being used and the quartiles are fairly meaningless.
30
Chapter 4
Histograms
4.1
Introduction
Histograms are one of the most familiar methods for presenting continuous data.
For illustration unpublished data from Cool (1983) will be used. They are the
lengths (mm) of 90 copper alloy hairpins from southern Britain, 55 classified as
early and 35 as late on archaeological grounds (see Cool, 1990, for a review of the
use of such hairpins). The data are given in Table 4.1.
Table 4.1: Lengths (mm) of Romano-British copper alloy hairpins from southern Britain
(Cool, 1990). The upper set is early and the lower set late.
54
90
94
97
104
115
56
92
94
98
104
116
74
92
94
98
104
123
84
92
95
100
104
128
85
92
95
100
105
134
85
93
95
100
107
87
93
96
100
108
88
93
96
101
108
89
93
97
102
111
90
93
97
103
115
51
62
70
80
52
63
70
80
54
63
70
82
56
63
70
82
57
65
71
87
58
65
74
60
66
75
60
67
77
61
68
78
62
68
78
The data are continuous since, in principle, any length within realistic limits is
possible; in practice data are always rounded, in this case to the nearest millimetre.
A data frame, pins, is created consisting of two columns, Length and Date, the
latter being coded as either Early or Late. The first and last few rows as entered
into R are shown in Table 4.2. A variable for length only can be created using
either Length <- pins$Length or Length <- pins[,1] where $name will extract
the variable with the name given and [,k] will extract the kth column; simlarly
Date <- pins$Date can be used to create a Date variable.
31
Figures 4.1 and 4.2 show examples of Roman artefacts of the kind that are
subject to the analyses in this chapter. The hairpins data is also used in Chapter 7.
Figure 4.1: A selection of Romano-British hair pins in copper alloy, bone and jet from
(left to right) Worcester, Kenchester, Leicester, Winchester and York. (Source: authors).
Figure 4.2: A pillar moulded bowl from Thonon-les-Bains, France. (Source: J. Paul
Getty Museum, Los Angeles).
32
Table 4.2: The Romano-British hairpin data as entered into R
4.2
Length
Date
54
56
74
..
.
Early
Early
Early
..
.
82
82
87
Late
Late
Late
Histograms in base R
The base R histogram is produced using the hist function hist(Length). It is
shown in Figure 4.3a.
Frequency
15
0
0
5
5
10
Frequency
10
20
25
15
Histogram of Length
60
80
100
120
60
140
80
100
120
Romano−British hairpin lengths (mm)
Length
(a)
(b)
Figure 4.3: Histograms for the Romano-British hairpin lengths. (a) The R default using
hist; (b) the same data with modified bin-widths, labels and colour.
Histograms are constructed by dividing an interval that covers the range of the
data into classes, otherwise known as bins. The lower boundary of the histogram is
the origin (or anchor point). The number of data values that fall within each bin is
counted and the resultant frequencies represented in the histogram as contiguous
bars. The defining feature of a histogram is that the area of the bar is proportional
to the frequency but if, as is invariably the case in statistical software packages,
bin-widths are the same then the height of the bar is also proportional to the
frequency.
A problem with histograms is that the appearance, and hence interpretation,
can be affected in a non-trivial way by the choice of origin and bin-width (Whallon
33
1987). Software defaults are not always satisfactory; to our eye there are often too
few bins so the data are oversmoothed. It is straightforward to deal with this in
R; the possibilities can be explored using the breaks argument of hist.
Figure 4.3b was produced using the following code
hist(Length, breaks = 20, main = " ", col = "skyblue",
xlab = "Romano-British hairpin lengths (mm)",
cex.main = 1.4, cex.lab = 1.3, cex.axis = 1.2)
0.020
Density
0.015
0.010
4
0
0.000
0.005
2
Frequency
6
0.025
0.030
8
where 20 bins are specified. You won’t necessarily get exactly this number because
R will determine sensible boundaries between bins (i.e. nicely rounded numbers)
that may result in a slightly different number of bins. Other uses of breaks
are illustrated later. The opportunity is also taken to show how labels can be
changed and bars coloured. Using main = " " eliminates the original title, but
add text between the quotation marks if it’s needed. Colour can be specified, via
the col argument, in various ways, most simply by providing a name as shown.
Type colours() or colors() to see what’s available (Section 2.3.5). The cex
(character expansion) arguments control the size of the title, labels and axis text.
What’s appropriate will depend on the scale on which a graph is intended to be
viewed.
60
80
100
120
60
Romano−British hairpin lengths (mm)
80
100
120
Romano−British hairpin lengths (mm)
(a)
(b)
Figure 4.4: Histograms for the Romano-British hairpin lengths (a) showing the effect
of increasing the number of bins (b) on a probability rather than frequency scale with a
superimposed kernel density estimate.
Figure 4.4 shows some variations on Figure 4.3. The effect of increasing the
number of bins specified to 30 is illustrated in Figure 4.4a. This perhaps makes
the existence of two different size classes even more apparent than in Figure 4.3b.
It will be seen later that these classes can be related to date with the later hairpins
tending to be smaller. The colour used was grey80.
Figure 4.4b was produced using the same code as Figure 4.3b with some additions. The default is to use a frequency scale so that the total area of the histogram
34
will be equivalent to the total frequency. It is sometimes useful to rescale the histogram so that the total area is 1. Add the argument freq = FALSE to the previous
code to do this. Technically the density histogram so obtained is a non-parametric
density estimate of the distribution of the data. The colour of the borders of the
bars is changed to that of the bars using the argument border = "skyblue". Superimposed on the plots is a kernel density estimate (KDE) which can be thought
of as a smoothed form of density histogram. These will be discussed in Chapter 6;
for now it is sufficient to note that they have presentational advantages compared
to histograms. The KDE was produced from the following code, placed after the
hist command.
lines(density(Length, bw = 4), lwd = 2, col = "red")
where the lwd argument controls the line width.
4.3
Histograms using ggplot2
Use library(ggplot2) to load the package and library(grid) to control aspects
of later graphs. The data will be used in the form given in Table 4.2 and will be
called pins. A quick way of getting a histogram is
qplot(Length, data = pins, bins = 20)]
which is equivalent to
ggplot(data = pins, aes(x = Length)) + geom_histogram(bins = 20)
Figure 4.5a shows the output from these.
Romano−British hairpin lengths
10
count
count
10
5
5
0
0
50
75
100
50
125
75
100
125
lengths (mm)
Length
(a)
(b)
Figure 4.5: Histograms for the Romano-British hairpin lengths using ggplot2. (a) The
default output, except that 20 rather than 30 bins are used (b) a customized version.
35
Using qplot is simpler but ggplot is more extensible and it is this that will be
exploited here. Figure 4.5a is not especially satisfactory (the default for bins is
30). To add labels using qplot the arguments main and xlab can be used exactly
as in previous examples. If ggplot is used customization of the histogram is more
complex. Figure 4.5b shows a customized histogram using the code following.
fig5.b <- function() {
library(ggplot2); library(grid)
ggplot(data = pins, aes(x = Length)) +
geom_histogram(bins = 20, fill = "white", colour = "black") +
# remove minor grid lines
theme(panel.grid.minor = element_blank()) +
# add labels, title etc. and control their appearance
xlab("lengths (mm)") + ggtitle("Romano-British hairpin lengths") +
theme(axis.text=element_text(size=12),
axis.title=element_text(size=14)) +
theme(plot.title = element_text(size = 16, face="bold"))
}
fig5.b()
The code needed to produce the graph has been embedded within a user-defined
function (Chapter 5). We note that it is usually more efficient to code all but the
smallest examples in this way.
In principle every element of the graph can be separately controlled using the
theme function. Everything after the first block of four lines is ‘presentational’.
It can make for rather lengthy code but once you settle on a preferred ‘style’ it’s
easy enough to cut and paste from one analysis to the next. With an appropriate
changes of labels the ‘style’ above will be used for most further ggplot2 graphics
without always explicitly giving it and will be indicated by an ellipsis (. . .). New
presentational features will be introduced as they arise.
4.4
Histograms from count data
Construction of the foregoing histograms presumes access to the raw data. The
software produces the counts within bins and plots the histogram. On occasion
you may only have access to the counts and the boundaries or mid-points of the
bins. Thus the histogram in Figure 4.3b was based on the following counts
4 6 8 8 4 5 5 6 15 11 8 3 3 1 1 1 1
and intervals defined by
50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135
giving rise to mid-points
52.5 57.5 62.5 67.5 72.5 77.5 82.5 87.5 92.5 97.5 102.5 107.5 112.5 117.5 122.5 127.5 132.5.
36
To use these you need to create variables within R thus
Counts <- c(4, 6, 8, 8, 4, 5, 5, 6, 15, 11, 8, 3, 3, 1, 1, 1, 1)
with Breaks and Midpoints similarly constructed.
One way to construct a histogram from the data, in base R one is
Length <- rep(Midpoints, Counts)
then use the code needed to produce Figure 4.3b except that breaks = 20 is
replaced with breaks = Breaks. The replication function, rep, takes each midpoint in turn and replicates it according to the count in the associated interval.
The breaks = Breaks argument in the hist function then imposes the interval
boundaries that would be obtained had the raw data been available.
4.5
Histograms with unequal bin-widths
Occasions can arise,rarely, where it may be desirable to use unequal bin-widths. If
you have the raw data define a variable Breaks with the interval boundaries you
want, then use breaks = Breaks in the hist function.
A more detailed example using data that have already been grouped into bins
of unequal width now follows. This is based on Baxter (2015b: 52–54) using data
originally discussed in Cool and Price (1995: 15–19).
Table 4.3: Chronology of Roman glass pillar-moulded bowls found during excavations
at Colchester, 1971-1985. The data are ordinal, but the periods are of different lengths.
See Cool and Price (1995: 15–19) and the text for discussion.
Period
Date
Width
Midpoint
Percentage
I
II
III
IV
V
VI
VII
VIII
IX
X
XI
XII
XIII
XIV
XV
XVI
XVII
XVIII
43-50
51-60
61-70
71-80
81-90
91-100
101-125
126-150
151-175
176-200
201-225
226-250
251-275
276-300
301-325
326-350
351-375
376-400
8
10
10
10
10
10
25
25
25
25
25
25
25
25
25
25
25
25
46
55
65
75
85
95
112
137
162
187
212
237
262
287
312
337
362
387
11
22
9
8
6
5
9
5
4
3
2
2
2
3
2
1
1
1
The data in Table 4.3 relate to Roman pillar-moulded bowls found in excavations
at Colchester, UK. Eighteen periods with associated date ranges were identified.
If the date-span is ignored the data can be represented as a barplot. This is
37
10
0
5
percentage
15
20
shown in Figure 4.6a. The data are treated as categorical and ordinal since their
chronological sequence is known. This is quite common with chronological data;
the situation here where the date-span can be specified is seen less often.
I
II
III
IV
V
VI
VII
VIII
IX
X
XI
XII
XIII
XIV
XV
50
XVI XVII
period
100
150
200
250
300
350
400
period
(a)
(b)
Figure 4.6: Different ways of representing the data from Table 4.3. (a) The barplot
respects the ordinal nature of the data, but not the fact that periods are of different
spans. (b) The histogram to the right does respect this.
We can do better because information on the date-span of the periods, in
terms of actual dates and the width of the spans, is available. The earlier periods
are more tightly defined than later ones. Pillar-moulded bowls are an early form
and occur predominantly in the earlier periods. Knowing the dates and spans
means that the data can be treated as continuous and represented in the form
of a histogram. The data are ‘pre-binned’, and the midpoints and intervals are
known, so the methods discussed in the previous section can be used to produce
a histogram (Figure 4.6b).
The barplot was produced using the barplot function (Section 10.2). An
alternative approach to producing the histogram that facilitates the labelling of
bars by period is described in Chapter 4 of Baxter (2015b). This exploits the
barplot function. Read the data of Table 4.3 into a data frame pillar then use
Percentage <- pillar$Percentage; Period <- pillar$Period; Width <- pillar$Width
barplot(Percentage/Width, space = 0, width = Width, names.arg = Period,
col = "skyblue", xlab = "Period", axes = F, cex.names = 0.5)
The zero spacing between bars, space = 0, ensures bars are contiguous, and
widths of bars, width = Width, corresponding to the date-spans are specified.
The Percentage variable is rescaled by dividing by Width so that the area of a
bar corresponds to the percentage. The argument names.arg = Period controls
38
the labelling and cex.names = 0.5 the size of the labels. Because of the unequal
widths a sensible scale for the vertical axis is not available.
The histogram emphasises the decline in use over time more obviously. The
quite sharp and steady decline in usage is readily apparent. The barplot does not
do this, because it was constructed without use of the spans which are longer for
the later periods. There is the suggestion in the barplot of a secondary peak in
Period VII that exists simply because the span is longer than earlier periods and,
for the same reason, the sharp decline in usage over time is less apparent.
4.6
Using histograms for comparison
It’s possible to use histograms for comparative purposes though we would not
usually recommend this. One possibility is to ‘stack’ histograms. This is unwieldy
for more than two or three histograms and often impractical with more than this
number. Albarella et al. (2006), for example, needs 10 pages, most consisting of
five histograms, to compare (mostly) pig lower-tooth measurements from Italian
prehistoric contexts. Figure 4.7 shows an example comparing lengths of the early
and late Romano-British hairpins using ggplot2.
Romano−British hairpin lengths
10
Early
count
5
0
10
Late
5
0
50
75
100
125
lengths (mm)
Figure 4.7: Early and late hairpin lengths compared using stacked histograms.
We use Length as originally defined in Section 4.1 (i.e. individual values).
The following code to follow creates a function fig4.7.
fig4.7 <- ggplot(data = pins, aes(x = Length)) +
39
geom_histogram(bins = 20, fill = "white", colour = "black") +
facet_grid(Date ~ .) + ...
You then need to type fig4.7() to get the graph (Chapter 5). The same presentational arguments as in the code associated with Figure 4.5 were used.
In base R the simplest thing is to generate the histograms separately and format them when typesetting the document. Alternatively, within R, sandwich the
commands that generate them between par(mfrow = c(2, 1)) and par(mfrow =
c(1, 1)). The first of these generates a 2 × 1 grid for the plots, the second returns
to the default state. If you do it this way you need to ensure that the range of the
histogram is the same in both plots. This can be done using the xlim argument
in the commands generating the histograms, (e.g., xlim = c(40, 140)). For an
example and R code see Chapter 3 of Baxter (2015b). We are not illustrating this
here because ggplot2 is more convenient.
The other way to compare histograms is to superimpose them. Figure 4.8
shows the results using both base R and ggplot2
(a)
(b)
Figure 4.8: The early and late Romano-British hairpin lengths compared using superimposed histograms with (a) base R, (b) ggplot2.
We shall explain the code in detail as it is the first occasion we have had to
use plot legends. The base R code is
Length <- pins$Length; Date <- pins$Date
Early <- Length[Date == "Early"]
Late <- Length[Date == "Late"]
Breaks <- seq(50, 140, 5)
hist(Early, breaks = Breaks, ylim = c(0, 15),
col = "#FF000060", main = " ",
xlab = "Romano-British hairpin lengths (mm)",
cex.lab = 1.3, cex.axis = 1.2)
40
hist(Late, breaks = Breaks, col = "#0000FF60",add = T)
legend("topright", c("Early", "Late"),
fill = c("#FF000060", "#0000FF60"),
bty = "n", cex = 1.5)
where the first two lines create variables for the early and late hairpins. The third
line creates the break points to be used for both histograms using the seq function.
This defines a range of 50 mm to 140 mm with breaks at intervals of 5 mm.
The main idea is straightforward; create two histograms and superimpose one
on the other using add = T in the second hist command. The col arguments
use hexadecimal code, the last two numbers of which control the degree of transparency. The ylim argument specifies the y-axis range and needs to be chosen
to accommodate both histograms. The first three legend arguments specify the
legend position, labels and fill colour. The default is to place the legend in a box;
we usually prefer to omit this using bty = "n". The cex argument controls the
text size of the legend.
The R help on legend lists a lot of options and it takes getting used to. Looking
at lots of examples is probably the best way to go. One of the ‘selling points’ of
ggplot2 is that it takes the pain out of legend construction. To judge from online
traffic not everyone would agree; you can modify the defaults but not always in
an obvious way. The code for ggplot2 in Figure 4.8b provides some idea of what
is involved. We have placed the code in a function fig4.8b which produces the
graph when fig4.8b() is typed (see Chapter 5).
fig4.8b <- function() {
library(ggplot2); library(grid)
p <- ggplot(data = pins, aes(x = Length, fill = Date)) +
geom_histogram(binwidth = 5, boundary = 50,
position = "identity", alpha = 0.6) +
scale_fill_manual(values = c("red", "blue")) +
xlab("Romano-British hairpin lengths (mm)") +
theme(panel.grid.minor = element_blank()) +
theme(axis.text=element_text(size=12),
axis.title=element_text(size=14)) +
theme(plot.title = element_text(size = 16, face="bold")) +
# control position and size of legend and text
theme(legend.position="top",
legend.text=element_text(size=14),
legend.title=element_text(size=16),
legend.key.height=unit(.8, "cm"), legend.key.width=unit(.8, "cm"))
p
}
fig4.8b()
41
In addition to ggplot2 the grid package is needed for aspects of the construction. In the initial ggplot2 command aes(x = Length, fill = Date) specifies
that the Length variables from the pins data frame is to be used for analysis and
split into subsets determined by the value of Date. What happens in this example
is that the two values for Date, Early and Late, are mapped to different fill
colours for the two histograms subsequently produced. In the geom_histogram
function the binwidth and boundary arguments control the bin-width and origin
of the histogram. They are chosen here to be the same as those used in the base
R plot. The alpha argument controls the transparency of the colours used.
Everything that follows the geom_histogram command is presentational. The
code to create the plot is held in p which needs to be typed at the end to print
the plot once gghist is executed.
The use of colour was discussed in Section 2.3.5; Chapter 12 of Chang (2012)
provides examples that might be used as ‘templates’. To our eyes the default
colours in ggplot2 sometimes seem ‘washed out’. They can be replaced with your
own choice using the scale_fill_manual function as shown.
When the data are subsetted, as here, ggplot will automatically produce a legend. How you modify the appearance of a legend is not intuitively obvious; Chapter
10 of Chang (2012) provides practical guidance. The default is to place the legend
to the right; we often prefer to have it at the top and legend.position="top"
does this. Replace "top" with "none" if you don’t want a legend. The next two
components of the code control the text and title size of the legend. The final
two components control the size of the legend key – the boxes containing the fill
colours in this instance.
The fill is an example of an aesthetic (aes). Other aesthetics we shall meet
are colour, linetype, shape (of data points) and size. As with legend in base R
all this takes time to get used to though, once you have painstakingly constructed
a few plots, presentational code that is not specific to a particular analysis can be
‘recycled’.
Code can be built up incrementally as
p <- ggplot(data = pins, aes(x = Length, fill = Date))
p <- p + geom_histogram(binwidth = 5, boundary = 50,
position = "identity", alpha = 0.6)
p <- p + scale_fill_manual(values = c("red", "blue"))
P <- p + ...
p
where the ellipsis (. . .) indicates remaining presentational code; it is often convenient to do this. The plots work well enough here because the two groups are
well-separated. With more than two groups better methods are available (Chapter 6).
42
4.7
Frequency polygons
In Chapter 6 kernel density estimates (KDEs) are discussed, including their use
as an alternative to the histogram. A difference that needs to be appreciated is
that the histogram is a direct representation of the (grouped) raw data, whereas
a KDE is an estimate of the distribution from which the data are presumed to be
a sample. Other things being equal, we think the KDE has aesthetic advantages,
particularly when it comes to comparing the distribution of several sets of data.
For this purpose, and if one wishes to retain the ‘directness’ of the histogram
representation, another alternative is to use frequency polygons.
Put very simply, these can be constructed by connecting the mid-points of the
upper parts of the bars from which a histogram is constructed, dispensing with
the bars entirely. This makes the comparison of more than one set of data more
straightforward. If the data are pre-binned a frequency polygon can be constructed
directly, without the need for the ‘tricks’ required to represent such data in the
form of a histogram (Section 4.4). As with a histogram the precise appearance
will depend on the choice of bin width, which can be controlled.
It is most convenient to use ggplot2, exploiting the geom_freqpoly function
and this is all that is attempted here. Figure 4.9a shows a frequency polygon for
all the hairpin lengths in pins; Figure 4.9b contrasts the frequency polygons for
the early and late hairpins. For consistency with earlier histograms a bin width of
5mm is used.
Romano−British hairpin lengths
Romano−British hairpin lengths
Date
Early
Late
15
15
10
Count
Count
10
5
5
0
0
60
90
120
60
Lengths (mm)
90
120
Lengths (mm)
(a)
(b)
Figure 4.9: Frequency polygons for Romano-British hairpin lengths (a) using all the
pins, (b) distinguishing between early and late pins.
The code used for Figure 4.9a was
ggplot(pins, aes(x = Length)) + geom_freqpoly(binwidth = 5) + ...
and for Figure 4.9b
43
ggplot(pins, aes(x = Length, colour = Date)) +
geom_freqpoly(binwidth = 5, size = 1.2) + ...
omitting presentational arguments. The advantage of the frequency polygons compared to the analogous histograms in Figure 4.8 may not be especially obvious here;
there are only two groups which are well-separated. With more groups and less
clear separation frequency polygons or KDEs are to be preferred.
44
Chapter 5
User-defined functions
5.1
5.1.1
User-defined functions
Introduction
In Section 2.3.1 it was promised that user-defined functions would be introduced
in context. This is it, having first introduced some code in the previous chapter,
including an example of a user-defined function in connection with Figure 4.8b,
we elaborate on their construction here.
The basic idea is straightforward. The structure of a function is
name <- function(arguments) {code}
where nothing immediately happens until typing name(arguments) executes the
function. In this text code is constructed so that the output is a graph.
In general it is the use of arguments which endows functions with some of their
power, since they allow the same kind of analysis to be undertaken on different
data sets without rewriting the code. Although we indicate how this can be done,
in practice for much of the code in the text we have left the arguments blank. This
is partly because we have treated analyses as ‘once-off’; it would be possible to
generalise the code in some instances, but at the expense of complicating matters,
and we have largely chosen not to do this.
There are still good reasons for embedding code in a function, even if it is
‘once-off’. Readers new to, or fairly unfamiliar with, R who have taken the trouble
to work through the previous examples will have realised that it is easy to make
typing errors, tedious to correct them, and you have to repeat the exercise every
time you want to construct a histogram. This is particularly irksome if the code
is at all lengthy. If embedded with a function, as explained in the next section,
all that’s needed is to locate and correct errors; the rest of the code does not need
re-typing.
45
5.1.2
Writing functions – a very basic introduction
If you type ?mean when in R you will see that the function mean is defined as
mean(x, trim = 0, na.rm = FALSE, ...)
where trim is the amount of trimming to apply if you want a trimmed mean
(Section 3.3)1 .
The arguments to the function are x and trim. It may help to think of these as
two types of argument, this applying to most other functions that use arguments.
The first argument, x can be thought of as a ‘dummy’ argument with no default so
has to be replaced with some data when the function is called; trim has a default
value so can be ignored in the call to the function unless you want to vary it.
Thus, given a variable weight, mean(weight) will calculate the (arithmetic)
mean and mean(weight, trim = .025) omits the 5% most extreme cases, 2.5%
at each end. Suppose you want to write your own function, exploiting mean to
calculate a mean. You wouldn’t actually do this but we shall present several
approaches that anticipate what occurs later in the text. Begin with
my.mean <- function() {}
my.mean <- edit(my.mean)
which will bring up an edit screen. What you then do is sandwich the necessary
code between the curly brackets. Arguments for the function, if you need them,
are placed within the rounded brackets.
If there are errors in the code, which is likely if it is at all lengthy, R will report
this. It will also fail to save what you’ve done, so you need to take precautions
to avoid having to repeat things. There is more than one approach to this; our
usual practice is to copy the code before exiting and, if things go wrong, go back
into edit mode, paste the text back in, and correct it. The error messages you
get from R are usually, but not always, helpful in this respect. The advantage of
putting code within a function is that you only need to locate and correct errors,
you don’t need to re-type anything else.
We now illustrate some possible approaches. Suppose you know exactly what
you want, the mean of weight with trim = .025. Create the following function
my.mean <- function() {
m <- mean(weight, trim = .025)
m
}
my.mean()
1
For the record na.rm = FALSE says that missing values (NA) should be retained in the
calculation; if there are any the result will be NA. To get a mean that ignores missing values
replace FALSE with TRUE. This and the following ellipsis are ignored for present purposes.
46
which will return the mean once you execute the function using my.mean() (there is
more than one way of returning results). Functions like this, without arguments,
are useful for ‘once-off’ analyses where the code is at all lengthy and plenty of
examples follow in later chapters. We have sometimes explicitly presented these as
functions and sometimes as code not embedded within a function. We recommend,
however, that if you want to emulate anything that follows, you get into the habit
of placing code within a function. See Section 1.6 for additional help on this.
This is more general.
my.mean <- function(data, TRIM = 0) {
m <- mean(data, trim = TRIM)
m
}
In this instance, within the body of the function data is a ‘dummy’ argument
that will be replaced by whatever variable is specified when the function is called.
The argument trim will be replaced by whatever value we choose to give TRIM,
but it has a default of 0.
Thus my.mean(weight) will return the mean of weight, and my.mean(weight,
TRIM = .025) will replace TRIM within the body of the function with 0.025 and
produce a trimmed mean.
5.1.3
More realistic illustrations
Code for base R
Suppose you have a data set that you want a histogram of; we will use Length as
the example. If a quick look at the data is all that is needed then, as already seen,
something like
hist(Length, breaks = 20)
is needed. Here the first argument Length is the data, and the second argument
is the (approximate) number of bins needed. You can play around with the latter
until a satisfactory histogram is obtained and can also change the variable for
which a histogram is required.
For something more suitable for publication you’d want to change some of the
default arguments such as
hist(Length, 20, col = "grey80", main = "", xlab = "Hairpin lengths (mm)")
which removes the default title main, adds a more informative label for the x-axis,
and colours the bars of the histogram a light grey. More flexibility is provided
if a function, my.hist is created to do the work. We proceed by listing some
possibilities.
47
my.hist <- function(data, nbins = 20, COL = "skyblue", XLAB = "My text") {
hist(data, nbins, col = COL, main = "", xlab = XLAB)
}
then
my.hist(Length, 20, COL = "grey80", XLAB = "Hairpin lengths (mm)")
will produce a histogram identical to that produced by the previous code. One advantage of doing things this way is flexibility. The arguments in the hist function,
where not fixed, are replaced by those in the call to the function. It’s necessary
to replace data with the name of the data set of interest; other arguments have
defaults but can be varied as wished making it easy to experiment with different
numbers of bins and colour, for example.
The other advantage is generalisation. To anticipate later chapters, data on
the weights of loomweights from excavations at Pompeii are provided. Suppose
these are held in a variable Weight and that a histogram is needed.
my.hist(Weight, 30, COL = "red", XLAB = "Weight (g)")
will get you going.
All the base R code in the text was embedded in functions, though this is not
always shown explicitly. If you want to emulate the examples it’s recommended
that you do the same. The example given above is intentionally simple, to illustrate
the main ideas.
For more complicated code extra paraphenalia is usually desirable. A list of
some of the things we do now follows – it should be emphasised that this is what
we are accustomed to doing rather than ‘best practice’.
• It helps to annotate code if it’s at all complicated and you might need to
return to it in the distant future (this is the voice of bitter experience speaking!).
• We normally start the code with graphics.off(), which has the effect of
clearing the screen of any graphs already there.
• Then we load the packages needed, using the library() function, within
your own function.
• If the aim of a function is to produce a single graph then forget the next
bit. But you can construct a function to ‘multi-task’ and produce several
graphs, which we often do, and you need to save them. Our approach, and
there are others, is to precede each block of code that produces a graph
with win.graph(). When you execute the function this will result in all
the graphs appearing on the screen. The reason we do it this way is that
the graphs can be saved in Postscript (eps) format – which is what we need
for subsequent use, frequently incorporation into LaTex documents. Other
48
formats, including PDF, Png, Bmp, TIFF and JPEG, are also available. If
you click on the graph a menu appears in the top left (on our system) which
allows you to save it in the format of your choice.
Code for ggplot2
Here we draw on code given in Section 4.3, but embed it within functions. The
following the code chunks all have the same effect.
my.histGG <- function() {
library(ggplot2); library(grid)
ggplot(data = pins, aes(x = Length)) +
geom_histogram(bins = 20)
}
my.histGG()
my.histGG <- function() {
library(ggplot2); library(grid)
p <- ggplot(data = pins, aes(x = Length)) +
geom_histogram(bins = 20)
p
}
my.histGG()
my.histGG <- function() {
library(ggplot2); library(grid)
p <- ggplot(data = pins, aes(x = Length))
p <- p + geom_histogram(bins = 20)
p
}
my.histGG()
Presentational arguments have been omitted; once they are added, as can be seen
for the code given for Figure 4.5b, this can result in moderately wordy code, which
is one reason for placing it in a function.
Not that to get the histogram my.histGG() is typed. We emphasise this because the second author, who had no previous experience of user-defined R functions, when checking the code in the text and exiting from an edit session with
code in either of the last two formats, thought that the final p would produce
a graph. It doesn’t, because it is internal to the my.histGG function and only
produces a plot when my.histGG() is invoked.
We’ve kept things simple above by using a specific example. It can be generalized by providing arguments that control the data set, aesthetics, number of bins,
labels, titles and so on. This can become a bit messy if there are many aesthetics
to specify. We do this if we want to use a function frequently with different data
sets, but it is sometimes convenient, if usage is only occasional, to copy and edit
the ‘source’ function according to the needs of the moment.
49
5.2
Summary
As noted, this is intended as a basic guide to writing your own functions in R.
Decent introductory text books will cover the topic much more thoroughly, but
tend to do this late in the text. This is because the early part of a text devotes
far more time to the mechanics of R, data manipulation, and examples of a wide
variety of statistical analyses, often involving fairly small chunks of code. That is,
the treatment is more extensive and thorough, and less specialised, than anything
attempted here. Readers are advised to acquire a decent introductory text on the
subject – this is partly a matter of taste so we shall not recommend any particular
text.
We think, however, that for the purposes of this text an early acquaintance
with user-defined functions is desirable. As stated already, it’s useful if only to
facilitate emulation of code that occurs in later chapters, should you wish to do
so.
50
Chapter 6
Kernel density estimates
6.1
Introduction
At their simplest kernel density estimates (KDEs) can be thought of as smoothed
histograms. Though mathematically quite complex (Baxter 2003/2015a: 29–37)
they are readily explained and easily constructed in R. An illustration has already
been given in Figure 4.4. To illustrate their construction a small data set (3.7,
4.2, 6.5, 6.8, 7.0) is used. Symmetrical ‘bumps’ are placed at each data point and
‘added up’ to get the KDE. Figure 6.1 shows this.
0
2
4
6
8
10
x
Figure 6.1: The plot shows how a KDE is constructed; see the text for details.
The KDE is rescaled so that the area under the curve produced is 1 – that is,
it is an estimate of the probability density of which the data are a sample. It is
thus an alternative to a density histogram for displaying data. For those wanting
a mathematical treatment see Silverman (1986) or Wand and Jones (1995).
51
The bumps are called kernels and also have the property that the area they
define is 1 (i.e. they are also probability densities). Different types of kernel are
available but the choice isn’t usually important. A normal (Gaussian) distribution
as the kernel (the famous bell curve) is common and will be used.
What is important is that the spread of the curve is measured by what is
called the bandwidth or window-width. A large value will oversmooth and a small
value will undersmooth the data, losing information in the first case and failing to
achieve useful ‘simplification’ in the second case. Figure 6.2 illustrates. Rules are
available for making a ‘good’ choice of bandwidth but experimenting with different
values and making a subjective choice is perfectly acceptable. To our eyes default
KDEs often seem oversmoothed; the default can be adjusted downwards until it
induces spurious features in the KDE, often manifest as unsightly bumps.
0
2
4
6
8
10
0
x
2
4
6
8
10
x
(a)
(b)
Figure 6.2: The plot illustrates the importance of bandwidth choice on KDE construction using (a) a larger and (b) a smaller bandwidth than in Figure 6.1.
Compared to a histogram a KDE does not depend on an origin; the need to
choose a bandwidth is analogous to bin-width choice in a histogram. It is in the
nature of KDE construction that the KDE will ‘spill-over’ into regions beyond
the observed range of the data. If the data are conceived of as a sample from a
population this is not problematic if the spill-over region is plausible. If the data
are bounded – weight is bounded below by zero, for example – and the KDEs
spill-over into an ‘impossible’ region this can give rise to unease. One response is
not to worry about it; another is to truncate the KDE at a boundary, though this
then does not produce a ‘true’ KDE. Producing a proper KDE that respects the
boundary is tricky and beyond the scope of these notes.
52
6.2
Example – loomweights from Pompeii
The data in Table 6.1 are the weights and heights of complete loomweights found
in excavations at Insula VI.1, Pompeii, Italy. Analyses of these data were initially
reported in Baxter and Cool (2008) based on 95 loomweights. The subsequent
discovery of a further 42 loomweights (in areas of the finds store where they probably shouldn’t have been located) led to an updated paper (Baxter et al. 2010a).
The 2008 paper suggested that the weights of the loomweights had a clear bimodal
distribution, a conclusion strengthened when using the larger sample in the second paper. The opportunity was taken in the 2010 paper to explore the possible
implications of the bimodality in terms of the nature of the threads that might
have been manufactured, drawing on experimental work published by Mårtensson
et al. (2009) in the interval between the two papers. In 2010 and 2011 a few more
loomweights emerged from the store and it is these, 143 in all, that are reported
in Table 6.1. The site and some loomweights from it are illustrated in Figures 6.4
and 6.5 and additional analysis of the data is provided in Chapters 7 and 8.
Figure 6.3 shows the default KDEs produced in base R and ggplot2. The data
frame, loomweights.all, that was used contains two columns corresponding to
Weight and Height. For some analyses these variables were extracted from the
data frame, with those names, in the manner described in Section 4.1.
density.default(x = Weight)
density
0.002
0.001
Density
0.003
0.003
0.002
0.000
0.001
0
200
400
600
800
0.000
N = 143 Bandwidth = 34.79
200
400
600
Weight
(a)
(b)
Figure 6.3: Kernel density estimates of the weights of 143 loomweights from Table 6.1
using (a) base R, (b) ggplot2.
The code used was
plot(density(loomweights.all$Weight))
for Figure 6.3a and
ggplot(loomweights.all, aes(x=Weight)) + geom_density()
for Figure 6.3b.
53
Table 6.1: Weights (g) and heights (mm) of complete loomweights from the excavation
of Insula VI.1, Pompeii, Italy. The site and loomweights from it are shown in Figures 6.4
and 6.5.
Weight
Height
Weight
Height
96
106
107
110
111
112
113
117
118
119
133
133
133
134
136
136
137
138
138
139
140
142
143
143
146
152
158
159
159
160
163
163
164
166
166
166
166
167
170
171
173
175
178
180
180
182
187
189
191
193
198
199
201
202
202
202
204
207
211
214
216
222
226
228
230
233
235
239
240
242
247
255
69
78
57
74
62
71
69
55
64
56
58
60
70
67
64
58
65
63
74
64
70
74
79
72
61
68
69
83
72
59
64
86
81
57
71
72
61
73
66
69
70
72
87
82
88
68
84
88
60
97
91
79
88
81
95
87
69
87
89
87
100
93
90
96
99
99
78
100
84
88
97
103
256
257
258
260
260
266
267
269
269
273
276
277
280
281
283
284
284
286
288
288
290
290
291
296
296
297
298
298
300
301
303
304
304
306
306
310
313
314
316
318
321
322
324
327
327
328
337
340
342
343
343
348
349
350
351
355
356
371
373
374
383
388
391
420
420
449
467
484
564
618
737
100
100
89
87
98
100
98
92
89
94
102
95
87
104
107
61
91
107
72
110
116
101
107
99
114
103
97
98
98
100
98
91
98
95
97
88
118
97
70
99
112
100
100
104
101
112
95
118
107
87
100
104
111
110
91
94
122
103
112
101
108
99
118
116
128
120
105
116
114
119
127
54
Figure 6.4: A view of Insula VI.1, Pompeii, Vesuvius in the background. (Source:
authors).
Figure 6.5: Loomweights from Insula VI.1, Pompeii, illustrating their range of sizes.
That to the far left was probably used for votive purposes. (Source: authors).
55
Note that the base R plot lists the bandwidth used, while ggplot2 truncates
the KDE at the limits of the data. Both plots have a rather bumpy right tail, the
bumps being attributable to eight outlying points. The main interest lies in what
is happening in the bulk of the data so it’s sensible to remove these.
The data in Table 6.1 are ordered by weight. It can be seen that there is
something of a gap between 391 and 420 with the data very dispersed thereafter, so 8/143 values greater than 400 will be be omitted. Define a new data
frame as loomweights <- loomweights.all[1:135, ] and Weight as Weight
<- loomweights$Weight; Height is defined as Height <- loomweights$Weight.
The variables will be used in this form from henceforth.
Figure 6.6 shows modified versions of the plots in Figure 6.3. The default bandwidth of 26.85 gives the dashed red line; multiply this by 0.6 to get a bandwidth
of 16.11 which produces the solid blue line. The value of 16.11 is produced if the
argument adjust = 0.6 is added to the density command and will be illustrated
in the code that follows later.
0.005
0.005
Bandwidth
0.003
0.004
0.002
density
0.003
0.002
0.001
Density
0.004
26.85
16.11
0.000
0.001
100
200
300
0.000
400
0
Weight
100
200
300
400
weight
(a)
(b)
Figure 6.6: Kernel density estimates of the weights of loomweights, omitting outliers,
using (a) base R, (b) ggplot2. See the text for details.
The reduction in the bandwidth compared to the default is quite sharp; it
was obtained after some experimentation. As the bandwidth is gradually reduced
the KDE initially remains smooth, while increasingly emphasizing the bimodality.
The KDE starts to look bumpy once values much lower than about 0.6 are used.
The base R code is
kde_lw <- density(Weight, adjust = 0.6)
plot(kde_lw, xlab = "Weight", main = "", lwd = 3, col = "blue",
cex.axis = 1.2, cex.lab = 1.3)
lines(density(Weight, adjust = 1), col = "red", lwd = 3, lty = 2)
legend("topright", c("26.85", "16.11"), col = c("red", "blue"),
lwd = c(3, 3), lty = c(2, 1), title = "Bandwidth", bty = "n", cex = 1.5)
56
The first line generates the KDE using the density function without plotting
it and illustrates the use of the adjust argument. If desired, the bandwidth can be
extracted with kde_lw$bw. The plot function plots the first KDE and the lines
function adds the second KDE to the plot. The lty = 2 argument specifies line
type 2, a dashed line; lty = "dashed" is also permissible.
The ggplot2 analysis was constructed by separately generating two densities,
with the default bandwidth (the dashed red line) and an adjustment factor of 0.5
(the blue line), and plotting them on the same graph. The way ggplot2 works
deliberately discourages manually added legends, hence the absence of one. The
code is
ggplot(loomweights, aes(x=Weight)) +
geom_line(stat = "density", adjust=0.5, colour="blue",
size=1.3) + xlim(0, 450) +
geom_line(stat="density", adjust=1.0, colour="red",
linetype="dashed", size=1.3) ...
0.005
0.006
where the ellipsis (. . .) represents the addition of relevant presentational code similar to that illustrated in connection with Figure 4.8b.
It is sometimes useful to superimpose the KDE on the corresponding histogram
as in Figure 6.7.
density
0.003
0.002
Density
0.004
0.004
0.000
0.001
0.002
0
100
200
300
0.000
400
0
Weights (g)
100
200
300
400
weight
(a)
(b)
Figure 6.7: Kernel density estimates of the weights of loomweights, omitting outliers,
superimposed on a histogram using (a) base R, (b) ggplot2. See the text for details.
The code for base R has already been discussed in connection with Figure 4.4b.
The ggplot2 code is
ggplot(loomweights, aes(x = Weight, y = ..density..)) +
geom_histogram(binwidth = 20, fill = "skyblue") +
geom_density(adjust = 0.6, size = 1.1) + xlim(0,450) + ...
Here xlim(0,450) controls the range of the x -axis and is chosen to avoid truncation of the KDEs.
57
6.3
Comparing KDEs
For illustration the pins data from Chapter 4 is revisited. Figure 6.8 shows the
base R and ggplot2 graphs. They show the difference between the early and late
pins well. The ggplot2 histogram in Figure 4.8b shows the separation just as well;
which plot you prefer is perhaps a matter of taste. The KDEs have, in our view,
greater aesthetic appeal and are more suitable if more than two groups are to be
compared. An example is provided in Section 7.3.
Date
0.02
Density
0.03
0.02
0.01
Density
Late
0.04
0.03
0.04
Early
Late
Early
0.00
0.01
40
60
80
100
120
0.00
140
60
Romano−British hairpin lengths (mm)
90
120
Length (mm)
(a)
(b)
Figure 6.8: Kernel density estimates comparing lengths of early and late Romano-British
hairpins using (a) base R, (b) ggplot2.
The following R code was used (base R)
# Base R
plot(density(Early), lwd = 3, main = "",
xlab = "Romano-British hairpin lengths (mm)",
col = "blue", xlim = c(40, 140),
cex.main = 1.4, cex.lab = 1.3, cex.axis = 1.2)
lines(density(Late), lwd = 2, col = "red")
legend("topright", c("Early", "Late"), col = c("blue", "red"),
lwd = c(3,3), lty = c(1,1), bty = "n", cex = 1.5)
For ggplot2 the following was used.
# ggplot2
ggplot(pins, aes(x = Length, colour = Date)) +
geom_density(size = 1.3) +
scale_colour_manual(values = c("blue", "red")) + ...
The scale_colour_manual function function shows how the default colours can
be modified. The colour = Date component of the aes argument of the ggplot
function maps colours to the early and late dates and you may wish to change
them.
58
6.4
Two-dimensional (2D) KDEs
120
120
100
100
height
height
A few years ago we’d probably have omitted a treatment of 2D KDEs, but their use
in archaeology is increasingly common and implementation is now straightforward.
Different forms of bivariate KDEs are illustrated in Figure 6.9 for the loomweight
height and weight data using ggplot2. These show the two main modes in the
data with heavier weights tending to be larger, as might be expected.
80
60
80
60
40
40
100
200
300
400
100
weight
200
300
400
weight
(a)
(b)
120
height
100
80
60
40
100
200
300
400
weight
(c)
(d)
Figure 6.9: Different approaches to representing two-dimensional KDEs using ggplot2
for the weight and height of loomweights.
The idea is similar to one-dimensional KDEs. Each point in the plane defined
by the two variables is associated with a bump in the form of a ‘hill’. The hill
is a probability density function whose shape, in the most usual applications, is
defined by separate bandwidths for the two variables. The hill may be elongated,
and typically the axes are aligned with those of the plot. The use of a bivariate
normal distribution for the kernels (hills) is common.
Baxter et al. (1997) introduced the use of 2D KDEs into the archaeological
literature; the use of KDEs of any kind was still uncommon in this literature at
the start of the 2000s (Baxter 2003: 36–37) but the situation has changed since
59
then. One of the examples in Baxter et al. (1997) was an intra-site analysis
of the spatial locations of artefacts at the Mask Site studied by Binford (1978).
In their influential introductions to Geographical Information Systems (GIS) for
archaeologists both Wheatley and Gillings (2002: 186–187) and Conolly and Lake
(2006: 175–178) note the potential of 2D KDEs for spatial interpolation.
Examples at the intra-site level involving the study of artefact distributions
include Keeler (2007) at the Late Upper Palaeolithic site of Verberie (France);
Aldeias et al. (2012) at the late Middle Palaeolithic site of Roc de Marsal (France);
and Oron and Goren-Inbar (2014) at the Mousterian site of Quneitra in the Golan
Heights. Sayer and Weinhold (2013) investigated spatial clustering among graves
in a number of Anglo-Saxon cemeteries. At larger spatial scales McMahon (2007)
investigated the distribution of cultural resources in Colorado (USA); Schmidt et
al. (2012) studied the settlement patterns in Iberia of late Middle Palaeolithic and
early Upper Palaeolithic sites; Monna et al. (2013) compared the density of the
spatial distribution of Breton and Norman Middle Bronze Age palstaves across
Northern France; and De Reu et al. (2013) mapped the distribution pattern of
Bronze Age barrows in the highly urbanized landscape of north-western Belgium.
Most, though not all, of these use GIS.
Returning to the code for Figure 6.9 the following was used.
ggplot(loomweights, aes(x=Weight, y=Height)) +
geom_point(size=3) +
#stat_density2d(aes(colour="red"), size=1.1, h=c(100,20)) +
# a
#stat_density2d(aes(fill=..level..), geom="polygon", h=c(100,20)) +
# b
#stat_density2d(aes(fill =..density..), geom="raster", contour=FALSE,
#
h=c(100,20)) +
# c
# c
#stat_density2d(aes(alpha =..density..), geom="tile", contour=FALSE,
#
h=c(100, 20)) +
xlim(50,450) + ylim(40,130) + ...
# d
# d
To get Figures 6.9a,b,c,d remove the relevant comment signs, #, before the appropriate stat_density2d command. Other than changing the contour colour to red
in Figure 6.9a, and the h=c(100,20) argument, defaults are used. You can, if you
wish, omit the latter without harm, but it may be helpful to go into a little more
detail as it shows how to control aspects of the appearance of the plots.
The numbers 100 and 20 are the bandwidths used for the KDE. The coordinates used to generate the plots are derived from the kde2d function from the
MASS library associated with the book Modern Applied Statistics with S (Venables and Ripley, 2002). The bandwidths in the kde2d function can be controlled
using h(a,b). The defaults can be found using bandwidth.nrd(Weight) and
bandwidth.nrd(Height) which gives (rounded) values of (126, 29) in this instance. These have been changed to (100, 20) in the example to show how they
can be modified if you wish – it doesn’t make too much difference in this instance.
60
1e−04
80
Densit
Height
100
120
The ouput from kde2d can be passed to base R functions to visualize the density.
Thus Figure 6.10 uses the image and persp functions to visualize the output.
Contour plots in base R are illustrated in Figure 6.11.
y
100
ht
60
0e+00
We
ig
200
He
40
igh
100
200
300
t
50
400
400
Weight
(a)
(b)
Figure 6.10: (a) An image and (b) a perspective plot representing two-dimensional
KDEs using base R for the weight and height of loomweights.
The default colouring for the image plot has been used; it may be compared
with Figure 6.9. Three-dimensional plots are not supported in ggplot2 so if you
like this kind of thing something like persp or functions that do a similar job in
other packages are needed. The code used was
library(MASS)
# Code for Figure 6.10a
dens <- kde2d(Weight, Height, h = c(120, 20), n = 100,
lims = c(50, 450, 40, 130))
image(dens, xlab = "Weight", ylab = "Height")
# Code for Figure 6.10b
persp(dens, col = "green", theta = 30, phi = 25,
ticktype = "detailed", nticks = 3,
xlab = "Height", ylab = "Weight", zlab = "Density",
cex.lab = 1.3, cex.axis = 1.2)
where the lims argument in kde2d controls the range of the axes (50, 450) for the
x-axis and (40,130) for the y-axis. Producing the image plot is straightforward, the
perspective plot less so and it’s best to have a look at the help facilities for detail.
The arguments theta and phi determine the position from which the ‘landscape’
is viewed and experimentation with these would usually be necessary.
It was implied at the start of this section that 2D density estimation might be
regarded as a specialised topic, beyond what might be expected in an introductory
text; the hope is that it has been demonstrated that the production of visually
appealing and informative graphics is straightforward. There is, however, an issue
61
in the construction of the plots illustrated that is neglected and worth mentioning.
This is that the kernels used so far have been orientated with the axes of the
graph. In much software, including R packages, this is the default and often the
only option.
If you look at Figure 6.9 you will see that the main orientation of the point
scatter is along a south-west/north-east axis. The other ‘natural’ axis is northwest/south-east. The kernels, however, are orientated west-east and north-south
so may not capture the natural patterning of the data as well as they might. At
the expense of some mathematical complication this can be dealt with. The kernel
used is two-dimensional and can be displayed as a 2×2 table (matrix) with identical
off-diagonal elements. The off-diagonal value controls the orientation of the kernel,
but has so far been taken to be zero. A brief account of the mathematics is given
in Baxter (2003: 32–33) and will not be repeated here.
It’s possible to implement the more general methodology using the ks package.
This is illustrated in Figure 6.11 where the opportunity is taken to contrast the
results with those using kde2d with the base R contour function.
100
80
Height (mm)
40
60
80
40
60
Height (mm)
100
120
Non−diagonal bandwidth matrix
120
Diagonal bandwidth matrix
100
200
300
400
100
200
Weight (g)
Weight (g)
(a)
(b)
300
400
Figure 6.11: Contour plots for the loomweight data based on two-dimensional KDEs
using base R; (a) with kernels oriented to the axes (b) with non-diagonal kernels.
Although both plots would probably be interpreted in the same way by most
people there are noticeable differences. The contouring produced using the nondiagonal bandwidth matrix better respects the orientation of the data. The code
used is given below where dens is as defined in the code associated with Figure 6.10.
# Diagonal bandwidth matrix
contour(dens, drawlabels = F, ...)
points(loomweights, pch = 16)
# Non-diagonal bandwidth matrix
library(ks)
H <- Hpi(loomweights)
62
KH <- kde(loomweights, 2*H)
plot(KH, cont = c(seq(25, 95, 10)), drawlabels = F, ...)
points(loomweights, pch = 16)
The drawlabels = F suppresses the default labelling of the contour levels; if
you want these omit this argument. The ellipsis (. . .) should be replaced with the
arguments used to control colour, labels, text size and so on.
For the plot produced using the ks package, with a non-diagonal bandwidth
matrix, Hpi produces an estimate of the matrix; the kde function estimates the
KDE using 2*H, a rescaled bandwith matrix. The cont = c(seq(25, 95, 10))
specifies the contour levels, the seq function generating levels between 25% and
95% at intervals of 10%. Finally, in both plots, the points function adds the data
points with pch = 16 specifying a solid circle as the plotting symbol.
63
Chapter 7
Boxplots and related graphics
7.1
Introduction
Boxplots are among the most commonly used methods for univariate data display
and comparison. The seminal text of John Tukey in 1977, Exploratory Data Analysis, (Tukey 1977) both advertised and helped popularise what are also known as
box-and-whisker plots, though their origin is to be found in earlier work. Wickham
and Stryjewski (2011) provide a useful account of the early history and subsequent
developments.
We suspect many readers will have some acquaintance with boxplots; violin
plots (Hintze and Nelson 1998) or beanplots (Kampstra 2008), which ‘generalise’
the boxplot, may be less familiar and we cannot recall seeing applications of them
in the archaeological literature.
Figure 7.1 illustrates default output using boxplot in base R, vioplot from
the vioplot package of that name, and beanplot from the beanplot package.
The Early data from the pins data set are used. Without writing your own
code ggplot2 does not have an option for beanplots; the functions for producing
boxplots or violin plots are intended for comparing two or more groups and need
to be ‘tricked’ into providing plots for a single variable. This is not illustrated.
Use boxplot(Early), vioplot(Early) and beanplot(Early) after installing
and loading the necessary packages. Their construction and interpretation are
discussed in the next section.
7.2
Construction and interpretation
Descriptive statistics were discussed in Chapter 3. The boxplot is a graphical
representation of what was there called a five-number summary. The lower and
upper bounds of the box are the first and third quartiles so the length of the box
corresponds to the interquartile range (IQR) and contains the central 50% of the
data; the width is arbitrary. The bar through the centre of the box is the median.
The lines (or whiskers) extend from the limits of the box to, at most, 1.5 × IQR;
64
140
100
80
40
60
80
60
100
120
120
120
100
80
60
1
(a)
(b)
(c)
Figure 7.1: (a) A boxplot, (b) a violin plot, (c) a beanplot for the lengths of early
Romano-British hairpins (Table 4.1). The default output using functions discussed in
the text is shown.
points beyond this range are highlighted as potential outliers. If there are no such
cases, the whiskers extend to the maximum and minimum values and define the
range of the data.
The violin and beanplots show rotated and mirrored KDEs. It is usual, but
not mandatory, to show a violin plot superimposed on a boxplot; in the beanplot
the horizontal lines within the plot correspond to individual values, their length
being proportional to the number. The solid black line that extends to the sides
of the plot is the mean.
For the violin and beanplots it is desirable to be able to control the bandwidth.
The vioplot implementation of the former is limited in this and other respects,
and only the ggplot2 version is illustrated in later examples.
7.3
Using plots for comparing groups
For this section a different version of the Pompeian loomweight data will be used.
At the time Baxter et al. (2010a) was written detailed phasing of the site was
not available. Subsequently this deficiency has been remedied and 118 of the
loomweights (excluding outliers) come from dated contexts. These can be allocated
to seven phases but the numbers of loomweights in some of these is small, so for
present purposes a coarser four-phase grouping is used. Phase I is centuries BC;
Phase II covers roughly the first third of the first century AD; Phase III is roughly
the second third of the century to AD 62 when Pompeii was badly shaken by
an earthquake; and Phase IV is the period from then to AD 79 when Pompeii
was destroyed by the eruption of Mount Vesuvius. There is some overlap between
Phases I and II and between Phases II and III. The numbers of loomweights in
each phase are 19, 33, 21, 45.
In addition to Weight and Height the maximum and minimum lengths (mm)
of the rectangular tops and bases of the loomweights (Tmax, Tmin, Bmax and
65
Bmin) were measured. The variables occupy the first six columns of a data frame
lwphased, with Phase in the seventh column. The first few rows of data are shown
in Table 7.1, the full data set, called lwphased, being used in Chapter 8.
Table 7.1: The first few lines of data for phased loomweights and six dimensional variables.
Weight
Height
Tmax
Tmin
Bmax
Bmin
Phase
96
106
107
110
..
.
69
78
57
74
..
.
39
30
38
23
..
.
24
26
21
23
..
.
35
40
46
33
..
.
25
40
17
30
..
.
IV
IV
II
II
..
.
If all that’s needed is a quick look at the data minimal plots to compare the
weights across phase can be obtained using the following code.
boxplot(Weight ~ Phase, data = lwphased)
#
beanplot(Weight ~ Phase, data = lwphased)
#
ggplot(lwphased, aes(x=Phase, y=Weight)) + geom_boxplot() #
ggplot(lwphased, aes(x=Phase, y=Weight)) + geom_violin() #
base R boxplot
beanplot
ggplot2 boxplot
ggplot2 violin plot
Figure 7.2 shows a base R boxplot and a beanplot using the beanplot package
for comparing weights across phases.
Beanplot
300
Weight (g)
100
100
200
250
150
200
Weight (g)
300
400
350
400
Base R boxplot
I
II
III
IV
I
Phase
II
III
IV
Phase
(a)
(b)
Figure 7.2: Visualizing the weights of phased loomweights from Table 6.1 using (a) a
base R boxplot and (b) a beanplot from the beanplot package.
It’s possible to embellish or control the plots in a variety of ways for which
it is best to see the help facilities for the function or look at what is on offer
online. Restraint has been exercised in Figure 7.2. Apart from adding to the code
above the usual presentational arguments for boxplot and beanplot, the argument
66
boxwex = .5 was added to the boxplot code and bw = 25 to the beanplot code.
The first of these reduces the default width of the boxes by a half. The bw argument
in beanplot controls the bandwidth of the KDEs; beanplot(Weight ∼ Phase,
data = lwphased)$bw extracts the default bandwidth of about 36 and this was
reduced to 25 for the plot.
The figure suggests that the use of boxplots is inappropriate here. The hourglass shapes of the beanplots for Phases II and IV suggest that their weights have
a bimodal distribution. This is not apparent from the boxplots which, for useful
interpretation, require the data to be unimodal. Another way of presenting the
data would be to use KDEs as illustrated in Section 6.3. This results in Figure 7.3.
Phase
I
II
III
IV
Density
0.006
0.004
0.002
0.000
0
100
200
300
400
500
Weights of loomweights (g)
Figure 7.3: Kernel density estimates of weights of phased loomweights.
Looking at this together with the beanplot shows that, other than Phase III,
weights in all phases are spread across the full range. The data for Phases I
and III are skewed towards low and medium/large weights respectively. There is
a tendency for later loomweights to be larger but the bimodal distributions for
Phases II and IV rule out a simple interpretation. The patterns that are visible
can be interpreted in terms of the nature of weaving practice over time in the
Insula but the (necessarily) detailed discussion is beyond the scope of the present
text. The ‘business part’ of the code used was as follows.
fig7.3 <- function () {
library(ggplot2); library(grid)
p <- ggplot(lwphased, aes(x= Weight, colour = Phase)) +
geom_density(size = 1.3, adjust = 0.8) + coord_fixed(40000) + ...
p
}
fig7.3()
67
Determining the aspect ratio using coord_fixed requires a little experimentation. A typical value on the x-axis (e.g., 250) divided by the maximum on the
y-axis (0.006) produced a value close to the 40000 used, but we varied the value
about this to see what happened.
Figure 7.4 shows boxplots and violin plots obtained using ggplot2. The latter
plots suggest that some of the boxplots are misleading.
Weights of Loomweights by Phase − Boxplots
Weights of Loomweights by Phase − Violin Plots
300
300
Weight (g)
400
Weight (g)
400
200
200
100
100
I
II
III
IV
I
Phase
II
III
IV
Phase
(a)
(b)
Figure 7.4: Visualizing the weights of phased loomweights from Table 7.1 using ggplot2
for (a) boxplots and (b) violin plots.
Omitting most presentational code the following was used for Figure 7.4a.
fig7.4a <- function () {
p <- ggplot(lwphased, aes(x = Phase, y = Weight)) +
geom_boxplot(fill = "skyblue", width = 0.5) +
theme(legend.position = "none") + ...
p
}
fig7.4a()
Similarly, for Figure 7.4b use the following.
fig7.4b <- function() {
p <- ggplot(lwphased, aes(x = Phase, y = Weight)) +
geom_violin(fill = "magenta", adjust = .9) +
theme(legend.position = "none") + ...
p
}
fig7.4b()
These tell the same story as previous figures. The legend is suppressed using
theme(legend.position="none"). As with geom_density, the default in the
violin plot is to truncate the KDEs at the limits of the range for each phase.
68
This and other aspects of the appearance, and that of the boxplots, can be
modified. For illustration the Romano-British hairpins data will be used. Previous analyses (e.g., Figures 4.8 and 6.8) have shown that, within the early and
late groups the distribution of lengths is unimodal, so in contrast to the phased
loomweights boxplots are a useful method of comparison.
Figure 7.5 shows embellished boxplots obtained using both base R and ggplot2
and violin plots obtained with the latter package.
Romano−British hairpin lengths (mm)
Romano−British hairpin lengths (mm)
Romano−British hairpin lengths (mm)
120
120
80
100
Length (mm)
Length (mm)
100
120
90
80
60
60
60
Early
Late
Date
Early
(a)
Late
Early
Date
(b)
Late
Date
(c)
Figure 7.5: Boxplots and violin plots for Romano-British hairpin lengths; (a) base R
boxplots, (b) ggplot2 boxplots , (c) ggplot2 violin plots.
The code for the base R boxplot is
boxplot(Length ~ Date, data = pins, notch = T, varwidth = TRUE,
boxwex = 0.4, pch = 16, col = c("red", "skyblue"), ...)
The notch argument produces a notched boxplot with the useful interpretation
that if notches do not overlap it is likely that the medians for the corresponding
groups are significantly different. The argument varwidth = TRUE (the default is
FALSE) produces boxes with widths proportional to the square-roots of the number
of observations in the groups. The boxwex argument rescales the size of the boxes
usually resulting, with relatively few groups, in a more pleasing appearance. The
plotting character for outlying points is controlled by pch while col determines
the colour of the fill for the boxes. Just give one colour (e.g., col = "red") if you
want them to be the same. Remaining presentational arguments, omitted in the
code above, control labels and text size.
The corresponding code for ggplot2 is
ggplot(pins, aes(x = Date, y = Length)) +
geom_boxplot(aes(fill = Date), notch = TRUE, width = 0.4,
varwidth = TRUE, outlier.shape = 16, outlier.size = 2) +
scale_fill_manual(values = c("red", "skyblue")) +
theme(legend.position = "none") + ...
where the arguments to geom_boxplot are mostly similar to those used with
boxplot or are self-explanatory. The width argument is equivalent to boxwex. If a
69
single colour is wanted for box fills rdelete aes(fill = Date) in the geom_boxplot
and add an argument with a colour name (e.g., fill="blue") outside aes and omit
scale_fill_manual.
The violin plot was produced using the following code.
ggplot(pins, aes(x = Date, y = Length)) +
geom_violin(aes(fill=Date), adjust=0.95, trim=FALSE, scale = "count",
draw_quantiles = c(0.25, 0.5, 0.75)) + ylim(40, 140) +
scale_fill_manual(values = c("red", "skyblue")) +
theme(legend.position = "none") + ...
where the trim=FALSE argument avoids truncation at the limits of the range;
the limits on the y-axis, specified in ylim, need to extend somewhat beyond the
range of the data to ensure this. If scale = "count" is used areas are scaled
proportionally to the number of observations. The draw_quantiles argument
draws lines at the quantiles listed, the first and third quartiles and median in the
example.
It should be emphasised that these examples are designed to illustrate presentational options and are not intended as recommendations. The possibilities
shown are also not exhaustive, for which see the documentation on the functions.
7.4
Discussion
As noted at the start of this chapter, boxplots in the form they are now most
commonly seen have been around since the 1970s. The earliest texts intended
to introduce mathematical and statistical methods to archaeologists (Doran and
Hodson 1975, Orton 1980, Shennan 1988, Baxter 1994, Drennan 1996) either do
not mention them at all or do so only briefly. Shennan (1988: 45–46), for example
devotes just two paragraphs to them and does not expand on this treatment in
the second edition of the text (Shennan 1997). Baxter (1994: 30–32) (reprinted
as Baxter 2015a) has three pages on the subject, the graphical illustrations being
a good example of how ‘primitive’ things were then compared to what is now
possible. Drennan’s (1996: 39–44) treatment of what he calls ‘box-and-dot plots’
is slightly longer but rather idiosyncratic.
It is, in fact, difficult to find published archaeological applications of boxplots
that pre-date 1990, and then there was not exactly a flood though usage has inceased noticeably since then. Boxplots are a simple tool for comparing dimensional
data of artefacts from different contexts, for example, or other archaeological entities of interest. More recent applications of this kind include Hiscock (2003),
Mills (2007), Jennings (2008), Morehart and Eisenberg (2010), Mackay (2011) and
Scharra-Liška et al. (2016). Boxplots have also used for comparing variables derived in morphological studies of animals and humans, or variables measured in
the scientific study of the properties of materials (e.g., Nikita et al. 2012, Pollard
et al. 2012).
70
All the papers just cited use boxplots constructed and presented in a similar
manner to Figure 7.2 (apart from unimportant software dependencies that affect
their appearance). The notched boxplot was introduced by McGill et al. (1978).
The earliest archaeological applications we have located are from the early 1990s
(Stafford, 1991, 1994) but very little else from that decade. Since then their use
has increased (e.g., Wurz et al. 2003, Liebmann et al. 2005, Scarry and Reitz 2005,
Plummer et al. 2008, Seetah et al. 2014, Duval et al. 2015) but to nothing like the
same extent as the more usual form. A notched boxplot makes a brief appearance
in the second edition of Drennan (2009), but it does not otherwise feature in the
archaeological textbook literature.
Little concern seems to have been expressed about problems with boxplots if
the data within groups are multimodal, and perhaps this is because this is rarely
an issue. Before commenting further one other additional issue with the standard
boxplot may be mentioned. This is that, if the data set is large and the distribution
is heavily skewed far too many outliers may be suggested.
The usual rule is that an outlier is flagged if it is more that 1.5×IQR distant from the box. It can be shown that if the data are sampled from a normal
distribution there is about a 0.007 probability that a case will be flagged as an
outlier. That is, with 1000 observations you’d expect something like 7 outliers to
be flagged (even though there may be nothing untoward about them). Suppose
now that you have a highly skewed distribution with a long tail to the right –
such distributions occur naturally and not uncommonly 1 . In these circumstances
a lot of cases may stray quite naturally into the region where they are classified
as outliers even though they are not. This is illustrated in Figures 7.6a and 7.6b.
The histogram in the first of these is generated from a random sample from a
lognormal distribution; this is naturally skewed and, by construction, will have a
long right-tail none of which, unless we are unlucky, will be outliers. The boxplot
of Figure 7.6b suggests, however, quite a few outliers.
There is a straightforward and practical way to get round this problem, which
is to produce an adjusted boxplot using the adjbox function from the robustbase
package which works exactly as boxplot apart from the adjustment for skewness. The mathematics behind the adjustment is quite complex (Hubert and Vandervieren 2008). Roughly what happens is that a (robust) measure of skewness
is defined; the limits beyond which a point is declared an outlier are differentially
modified (in a slightly ad hoc manner) by quantities that depends on this measure.
In particular the whisker that stretches in the direction of the long tail is extended
to suggest far fewer outliers. The effect can be seen in Figure 7.6c where there is
a substantial reduction in the number of cases flagged.
It would be nice to cite archaeological applications of violin plots and beanplots
but we are not aware of anything. Apart from a possible dereliction of scholarship
on our part a good question to ask is ‘Why?’. Wickham and Stryjewski (2011: 11)
1
Weight in animal or human populations is one example; trace element concentrations in the
materials from which artefacts are manufactured is another.
71
Adjusted boxplot for skewed random data
0
10
20
30
40
50
50
0
10
20
30
Random data
40
0
0
10
10
20
30
Random data
40
50
50
40
30
20
Frequency
60
Boxplot for skewed random data
60
Histogram for skewed random data
60
Random data
(a)
(b)
(c)
Figure 7.6: For 300 observations randomly generated from a skewed lognormal distribution (a) a histogram, (b) a boxplot obtained using boxplot, (c) an adjusted boxplot
using adjbox from the robustbase package.
observe that boxplots ‘were created to provide a succinct distributional summary
that could easily be created by hand’; 40 years on nobody would dream of doing
this. They further comment to the effect that the dramatic increase in computer
power since then has led to the development of variants of the boxplot that remain
true to its original spirit ‘while supporting much richer display of the underlying
distributions’. The violin plot and beanplot are among these variants; both are as
compact as the boxplot and as easy to produce, with the advantage of displaying
features of the data not possible with the usual form of boxplot.
Correspondence analysis (Chapter 12) is one of the most useful, and used,
multivariate methods in archaeology for exploring large tables of data. Baxter
(1994/2015a: 133–139) discusses its adoption in archaeology in some detail. In
summary, the method was first applied to archaeological data in the mid-70s; was
explored by French archaeologists in the late 1970s; ‘took off’ in the Anglo-Saxon
speaking world as a result of publications by Scandinavian scholars in the 1980s;
began to be widely adopted in Britain in the early 1990s; but did not hit US
archaeology till the later 1990s. Some sub-disciplines, such as Roman archaeology,
were even later adopters (Cool and Baxter 1999). The point is that archaeologists
can be rather tardy in picking up on patently useful and readily applied methods
that emerge in the statistical literature.
To some extent the history of the use of boxplots in archaeology mirrors this
pattern; violin plots and beanplots may be going the same way. This describes
rather than explains the situation. We won’t speculate too much about the latter.
Unfamiliarity with the statistical literature and/or a reluctance to abandon familiar
software and experiment are almost certainly contributory factors. The beanplot
is a comparatively recent development as is the ease with which violin plots can
now be used. The message is that archaeologists who find boxplots useful for
displaying their data might find it worthwhile to explore these other alternatives.
72
Chapter 8
Scatterplots
8.1
Introduction
A common type of graph, with which everyone will be familiar, is what we shall call
a (bivariate) scatterplot. The values of two variables are plotted against each other
to see how, if at all, they are related. An example has already been presented,
without comment, in Figure 6.9, where the weight and height of loomweights were
plotted against each other with contours derived from a two-dimensional kernel
density estimate superimposed.
The loomweight data, omitting outliers, as held in the data frame loomweights
will be used for illustration here (see Section 6.2 for the detail about the data).
Remember that this can be stored in different forms, either as a data frame
loomweights with two columns Weight and Height, or as two separate variables,
Weight and Height. If the latter then plot(Weight, Height) produces the default base R scatterplot. If stored as a data frame then, apart from the labelling,
each of the first four lines of the following code will produce the same plot, shown
in Figure 8.1a; the ggplot2 code given is that used for Figure 8.1b.
plot(loomweights$Weight, loomweights$Height)
plot(loomweights[,1], loomweights[,2])
plot(Height ~ Weight, data = loomweights)
plot(loomweights)
#
#
#
#
a
b
c
d
ggplot(loomweights, aes(x = Weight, y = Height)) + geom_point()
Of the base R options when data are in a data frame (d) only works if it contains
just two columns in the correct order for plotting. Options (a) to (c) are different
ways of picking out variables to plot from a data frame containing several variables.
Options (a) and (b) are of the form plot(x, y) the first named being that on the
x-axis. In (a) variables are picked out by name using $name, in (b) by using column
number, [,number]. Option (c), structured as plot(y ∼ x, data = dataframe),
was used for Figure 8.1a, where dataframe is the set of data investigated.
73
120
Height
90
90
70
80
Height
100
110
110
60
70
100
150
200
250
300
350
400
Weight
100
200
300
400
Weight
(a)
(b)
Figure 8.1: Default scatterplots for loomweight heights and weights using (a) base R
and (b) ggplot2.
8.2
Presentational aspects
8.2.1
Plotting symbols, axis labels, axis limits
In the previous chapters presentational arguments have sometimes been implied
with an ellipsis (. . .). In this section enhanced forms of the graphs in Figure 8.1 will
be developed, as in Figure 8.2; the full code is given for convenience of reference.
A ggplot2 scatterplot
A base R scatterplot
100
80
Height (mm)
100
80
60
Height (mm)
120
120
60
100
200
300
400
100
Weight (g)
200
300
400
Weight (g)
(a)
(b)
Figure 8.2: Enhanced scatterplots for loomweight heights and weights using (a) base R
and (b) ggplot2.
The base R code to follow should be reasonably self-explanatory. The intention is to show how different elements of the plot can be modified rather than to
recommend the different options shown.
74
scatterplotR <- function(){
plot(Height ~ Weight, data = loomweights,
xlim = c(50,450), ylim = c(50,130),
# Modify plot limits
xlab = "Weight (g)", ylab = "Height (mm)",
# Add axis labels
main = "A base R scatterplot",
# Add a title
col = "red", pch = 16, cex = 1.2, # colour, shape, size of plot symbols
cex.axis = 1.2, cex.lab = 1.3, cex.main = 1.4) # Text sizes, axis
}
scatterplotR()
Figure 8.3: Basic code for a base R scatterplot.
The use of colour was discussed in Section 2.3.5. Options for the plotting
character (pch) are shown in Figure 8.4; cex controls the size of the plotting
symbol; other cex arguments control the size of the axis text, axis labels and title.
Figure 8.4: Available symbols (shapes) for plotting points in R.
In the ggplot2 code that follows the size, colour and shape (equivalent to pch
in base R) of the plotting symbols are specified by the arguments in the geom_point
function. Note the difference from the base R code in the way the plot limits and
axis labels are determined. If any of the shapes 21–25 are used, specify the fill
colour using, for example, colour = "blue", fill = "red", shape = 21 to get
a circle with a blue border and red fill. In base R pch = 21, col = "blue", bg
= "red" would be used.
75
fig8.2b <- function() {
ggplot(loomweights, aes(x = Weight, y = Height)) +
geom_point(size = 3, colour = "blue", shape = 15) +
xlim(50,450) + ylim(50,130) +
xlab("Weight (g)") + ylab("Height (mm)") +
ggtitle("A ggplot2 scatterplot") +
theme(panel.grid.minor = element_blank()) +
theme(axis.text=element_text(size=12), axis.title=element_text(size=14)) +
theme(plot.title = element_text(size = 16, face="bold"))
}
fig8.2b()
Figure 8.5: Basic code for a ggplot2 scatterplot.
8.2.2
Adding lines to plots
Simple lines
From Figures 6.6 and 6.7 the division between ‘small’ and ‘large’ loomweights (the
antimode) occurs at about 220 g; the antimode for height is at about 80 mm.
Visually a line running from the coordinates (100, 85) to (350, 60) would appear
to represent a natural ‘break’ in the point scatter. A solid line will be added to the
plots in Figure 8.2 to show this division with dashed lines showing the antimodes.
Figure 8.6 shows the result.
100
80
Height (mm)
100
80
60
Height (mm)
120
120
60
100
200
300
400
100
Weight (g)
200
300
400
Weight (g)
(a)
(b)
Figure 8.6: Scatterplots for loomweight heights and weights with added reference lines
using (a) base R and (b) ggplot2. See the text for an explanation.
Once you have the base R scatterplot, the following lines of code can be added
to produce the lines seen in Figure 8.6a.
abline(v = 220, lty = 2, lwd = 2)
abline(h = 80, lty = 2, lwd = 2)
segments(100, 85, 350, 60, lwd = 2)
76
where, in the abline function, v means vertical and h means horizontal lines should
be drawn in the positions specified; lty is the type of line; and lwd controls line
width. In the segments function the first two numbers are the x and y coordinates
at the start of the line to be drawn and the other two numbers are the coordinates
of the end point
The ggplot2 code is similarly structured. After the geom_point command in
the code of Figure 8.5 add the following.
geom_vline(xintercept = 220, linetype = "dashed") +
geom_hline(yintercept = 80, linetype = "dashed") +
geom_segment(x = 100, y = 85, xend = 350, yend = 60) +
Note that linetype (or lty in base R) can be specified by either name or
number. Figure 8.7 shows the options.
Figure 8.7: Available lines for plotting in R.
More complicated lines
The segments function can be used to fit a ‘theoretical’ line to a plot if this
is wanted. The abline function can also be used for this purpose; for example
abline(0,1) would add a straight line with a zero intercept and slope of 1 to a
plot. In ggplot2 using geom_line(intercept=0, slope=1) has the same effect.
Sometimes it may be useful to add a reference line to the plot that can be
thought of as a summary of the pattern in the data. This could be either a
straight or curved line and both are shown in Figure 8.8. The plots are described
as more complicated because they involve a modicum of statistics beyond the level
assumed in much of this text. In the code to follow the straight line is based
on linear regression where weight is predicted from height. The curved line is a
scatterplot smoother obtained by fitting a loess curve to the data.
If you just want a quick look at the plot then ggplot2 is simplest; just add
stat_smooth(method = lm) +
77
to the code of Figure 8.5 following the geom_point command. The code actually
used for Figure 8.8b was the following
stat_smooth(span = 0.6, se = F, colour = "black", linetype = "dashed") +
stat_smooth(method = lm, se = F, colour = "black") +
Linear regression
Loess smooth (span = 0.6)
100
120
80
Height (mm)
100
80
60
Height (mm)
120
where se = F suppresses the default standard error bands that are produced while
colour = "black" overrides the default colouring. Without additional arguments
stat_smooth() fits a loess curve. Very roughly what this does is calculate a rather
complicated kind of average value for predicting height in the region of each data
point and ‘joins up the dots’ so produced. The size of the region about each point
over which the ‘average’ is computed is determined by the span, smaller values
producing more ‘wiggly’ curves. In the second line of code method = lm specifies
that a linear regression model is fitted. You will get different lines if the roles of
weight and height are reversed (i.e. weight is predicted from height).
60
100
200
300
400
100
Weight (g)
200
300
400
Weight (g)
(a)
(b)
Figure 8.8: Scatterplots for loomweight heights and weights with added linear regression
lines and loess smooths using (a) base R and (b) ggplot2. See the text for an explanation.
The corresponding base R code, which may be added to in the code of Figure 8.3,
is
z1 <- lm(Height ~ Weight, data = loomweights)
lines(loomweights$Weight, predict(z1), lwd = 2)
z2 <- loess(Height ~ Weight, data = loomweights, span = 0.6)
lines(loomweights$Weight, predict(z2), lty = 2, lwd = 2)
where the first and third lines fit a linear regression model and loess curve using the
lm and loess functions. The predicted values from each model are obtained from
the predict function; the lines function adds fitted lines to the point scatter.
78
8.2.3
Points
Adding points to plots
Thus far a subset of eight outliers have been identified and omitted from plots.
These will now be added back into the data set to illustrate how points can be
added to a plot. The eight cases are in a data frame outliers with column
headings the same as for loomweights.
Adding the outliers to the base R plot requires that something like the following
be added to the code of Figure 8.3
points(outliers$Weight, outliers$Height, pch = 17, col = "blue", cex = 1.2)
legend("bottomright", c("No", "Yes"), col = c("red", "blue"),
pch = c(16, 17), title = "Outlier", bty = "n", cex = 1.3)
after changing the upper values in xlim and ylim to 800 and 140 to accommodate
the outliers. The result is shown in Figure 8.9a.
140
Outlier
No
Yes
120
130
Height (mm)
100
80
Height (mm)
110
60
Outlier
90
70
No
Yes
200
400
600
800
200
Weight (g)
400
600
Weight (g)
(a)
(b)
Figure 8.9: Scatterplots for loomweight heights and weights including outliers using (a)
base R and (b) ggplot2. See the text for an explanation.
Something similar is possible in ggplot2 by adding
geom_point(data = outliers, aes(x = Weight, y = Height),
size = 2, colour = "red", shape = 15) +
to the code of Figure 8.5. This automatically scales the axes but does not produce
a legend. An alternative is to create a data frame using all the data, adding a
third column Outlier with entries No and Yes. Call the data frame loom.all and
replace the first line of code in Figure 8.5 with
ggplot(loom.all, aes(x=Weight, y=Height, colour=Outlier, shape=Outlier)) +
to get Figure 8.9b, where the scale_colour_manual function has also been used
to change the default colours and scale_shape_manual has been used to change
the default symbol shapes.
79
Multiple groups
The preceding example can be thought of as a way of displaying multiple groups in
a set of data. As a more complicated illustration the data set lwphased, introduced
in Section 7.3 is used. Remember that this consists of 118 loomweights, excluding
outliers, that could be assigned to one of four phases; in addition to Weight and
Height and four other variables a factor with four levels, Phase, is included.
We want to produce a plot of height against weight using different coloured
and shaped plotting symbols to distinguish between phases. These are shown in
Figure 8.10.
I
II
III
IV
I
II
III
IV
110
Height (mm)
90
80
90
70
Height (mm)
100
110
120
Phase
Phase
60
70
100
150
200
250
300
350
400
100
Weight (g)
200
300
400
Weight (g)
(a)
(b)
Figure 8.10: Scatterplots for phased loomweight heights and weights showing the phasing, using (a) base R and (b) ggplot2.
This is much easier in ggplot2 than base R. In base R the ‘messy’ part is
mapping Phase to colours and symbols. A web search reveals several approaches
to this, that given in the code below being among the simplest.
plot(lwphased$Weight, lwphased$Height,
pch = c(21,22,24,25)[as.numeric(lwphased$Phase)],
bg = c("red","blue","magenta","green2")[as.numeric(lwphased$Phase)], ...)
legend("topleft", c("I", "II", "III", "IV"),
pt.bg =c("red", "blue", "magenta", "green2"), pch = c(21,22,24,25),
title = "Phase", bty = "n", cex = 1.3)
What’s happening here is that the as.numeric function maps the four levels
of the Phase factor to the integers 1 to 4 which are then associated with the plot
symbols given by pch and colours given by bg (background). The opportunity
has been taken to illustrate the used of filled symbols with a border (21–25 in
Figure 8.4). The border colour is black by default and bg is the fill colour. If you
want to vary the border colours an additional argument, col, defined in the same
way as bg will do it. Presentational arguments are implicit (. . .) but the legend is
80
given in full to illustrate the use of pt.bg (rather than bg) to specify fill colours.
The result of all this is Figure 8.10a.
In ggplot2 all that’s needed is to specify Phase as colour and shape aesthetics
in the call to ggplot, thus
ggplot(lwphased, aes(x = Weight, y = Height)) +
geom_point(aes(shape = Phase, colour = Phase), size = 3) + ...
with presentational code as before and default output shown in Figure 8.10b. We
tend not to like the default colours and symbols; to get something similar to
Figure 8.10a use
geom_point(aes(shape = Phase, colour = Phase), size = 3) +
scale_shape_manual(values = c(21,22,24,25)) +
scale_fill_manual(values = c("red", "blue", "magenta", "green2")) + ...
which produces Figure 8.11a. Colour palettes were mentioned in Section 2.3.5
but have not yet been illustrated. Figure 8.11b provides an example using the
Dark2 ColorBrewer palette, for which the RColorBrewer package needs to be installed. Precede the ggplot command with library(RColorBrewer) and replace
the scale_fill_manual command with
scale_fill_brewer(palette = "Dark2") +
Phase
I
II
III
Phase
IV
II
III
IV
110
Height (mm)
Height (mm)
110
I
90
70
90
70
100
200
300
400
Weight (g)
100
200
300
400
Weight (g)
(a)
(b)
Figure 8.11: Scatterplots for phased loomweight heights and weights indicating the phasing, illustrating ggplot2 variations on Figure 8.10b. (a) using scale_colour_manual,
(b) Using the Dark2 palette from the RColorBrewer package.
Figures 8.10 and 8.11 have been presented to illustrate some of the options
available; plots (a) in both figures involve a user-defined selection of colours; plots
(b) involve automatic selection. In practice, and particularly for publication, some
experimentation with colour and shape may be desirable. What looks best may
81
be partly dependent on the nature of the juxtaposition of points from different
groups.
As far as interpretation goes there is no obviously clear relationship between
weight, height and phase. This is a little misleading since we’ve already seen
from Figures 7.2 to 7.4 that the data for Phases II and IV show some evidence of
bimodality and there is a tendency for later weights to be heavier. As discussed
in Section 7.3, however, the overall pattern is quite complex.
8.2.4
Scatterplot matrices
Scatterplot matrices, also called pairs or draftsman plots, obtained using the pairs
function, provide a quick way of examining all possible bivariate scatterplots when
more than two variables are available. For illustration the 118 phased loomweights
are used; in addition to Weight and Height the maximum and minimum lengths
(mm) of the rectangular tops and bases of the loomweights are available (Tmax,
Tmin, Bmax, Bmin). The first few lines of the data frame lwphased are shown in
Table 7.1. Figure 8.12 is produced in base R using pairs(lwphased[, 1:6]).
110
15 25 35 45
20
40
60
250
400
60 80
110
100
Weight
15 25 35 45
60 80
Height
15 25 35 45
Tmax
50
60
Tmin
40
60
30
40
Bmax
20
Bmin
100
250
400
15 25 35 45
30
40
50
60
Figure 8.12: A scatterplot matrix for loomweight dimensions using base R.
It can be seen that weight and height are clearly linearly related; size as measured by these two variables also shows some relationship with the base dimensions
but not those of the top. The two base measurements are related, the striking lin82
ear ‘edge’ being associated with square bases; the two top measurements show a
similar pattern but are not related to the base measurements. The plots above the
diagonal are the same as those below the diagonal, except that axes are reversed,
and it’s possible to suppress one or the other if desired.
Some of the presentational arguments available in plot can be used here. Figure 8.13 uses code similar to that which follows Figure 8.10, namely
pairs(lwphased[, 1:6], pch = c(15,16,17,1)[as.numeric(lwphased$Phase)],
col = c("red", "green", "blue", "black")[as.numeric
(lwphased$Phase)])
110
15 25 35 45
20
40
60
250
400
60 80
110
100
Weight
15 25 35 45
60 80
Height
15 25 35 45
Tmax
50
60
Tmin
40
60
30
40
Bmax
20
Bmin
100
250
400
15 25 35 45
30
40
50
60
Figure 8.13: An enhanced scatterplot matrix for loomweight dimensions using base R.
Colours and plotting symbols correspond to different phases.
It’s possible, but messy, to add legends and because of the way the plot is
compressed doesn’t necessarily look that satisfactory, so is not attempted here.
Our view is that scatterplot matrices are useful tools but, unless the number of
variables is small, are perhaps best viewed in the privacy of one’s office/study
with a view to selecting the more interesting plots to subject to more detailed
scrutiny and publication. The full plot takes up valuable space and the detail can
be difficult to see.
It’s possible to vary the display in the upper- and lower-diagonal, to modify
variable labels, and to have graphs along the diagonal. This, for the most part,
83
requires some programming and will not be discussed in detail here. If it’s of
interest ?pairs provides code for some of the things that can be done as does
Chang (2013: 112–116). The scatterplotMatrix function from the car package
will do some of the work for you but is not illustrated.
There is a function for scatterplot matrices, ggpairs, in the GGally package
which we must confess to finding difficult to use. Figure 8.14 shows a fairly simple
example where 2D KDEs are plotted in the upper diagonal.
Weight
0.004
0.003
0.002
0.001
0.000
Height
110
90
70
50
Tmax
40
30
20
Tmin
40
30
20
60
Bmax
50
40
30
60
Bmin
50
40
30
20
100 200 300 400 70
Weight
90 110
Height
20 30 40 50
Tmax
20 30 40
Tmin
30
40
50
Bmax
60 20 30 40 50 60
Bmin
Figure 8.14: A ggpairs scatterplot for loomweight dimensions. The upper triangle
shows 2D KDEs for pairs of variables and the diagonal shows univariate KDEs for the
individual variables.
The code to produce this is given below.
library(GGally)
ggpairs(data=lwphased, columns=1:6,
upper = list(continuous = "density"),
lower = list(continuous = "points"))
Note that the syntax is more akin to that used in base R plots than the usual
ggplot2 plots. As it stands phases are not differentiated. It’s possible to rectify
this by adding aes(colour=Phase) as an argument, but the KDEs then become
difficult to read.
84
Chapter 9
Pie charts
Tables are preferable to graphics for many small data sets. A table is nearly always
better than a dumb pie chart; the only worse design than a pie chart is several of
them, for then the viewer is asked to compare quantities located in spatial disarray,
both within and between pies . . .. Given their low data-density and failure to order
numbers along a visual dimension, pie charts should never be used.
(Tufte 1983: 178)
9.1
Introduction
Tufte’s opinion concerning pie charts is not extreme – it is, if anything, the orthodox view of statisticians who take an interest in graphical presentation. Unwin
(2015: 60), in a text devoted to graphical analysis in R, notes that pie charts are
‘not generally recommended’ and that pie charts ‘in 3D should most definitely
always be avoided’.
The function, pie in base R, for constructing pie charts is fairly minimalist as
R functions go. It is accompanied by the warning
‘Pie charts are a very bad way of displaying information. The eye
is good at judging linear measures and bad at judging relative areas.
A bar chart or dot chart is a preferable way of displaying this type of
data.’
This statement is accompanied by others to the effect that, if you think you need a
pie chart other methods of graphical presentation are better. Frequently they are
wholly unnecessary and can be replaced by a succinct summary of the information
conveyed by the numbers used to construct them, saving space in the process.
There is not a purpose designed function for creating pie charts in ggplot2,
though there are plenty of posts on the web that remedy this deficiency1 . Chang
(2012: 307–309) devotes just over one page to the topic, about half the space being
1
e.g.,
http://mathematicalcoffee.blogspot.co.uk/2014/06/ggpie-pie-graphs-in-ggplot2.html
(accessed 25/05/2016).
85
devoted to an illustrative base R pie chart. He comments that ‘the lowly pie chart
is the subject of frequent abuse from data visualization experts’ and that more
effective methods are usually available, but does comment that pie charts have
the virtue ‘that everyone knows how to read them’. This last statement might be
queried.
Given a set of counts for each of a number of categories, converted to percentages of their sum, a properly constructed pie chart is a circular diagram divided
into slices with the property that the area (or arc or angle) of a slice is proportional
to the percentage of the category it represents. A frequent objection to pie charts,
supported by psychological studies of visual perception, is that humans find it difficult to compare areas in terms of relative proportions compared to other methods
such as barplots (Chapter 10) often suggested as an alternative.
If a pie chart is unannotated with numbers one gains a qualitative impression
of the relative abundance of different categories, often best summarised in a table
of the numbers themselves with a commentary. This applies particularly when,
as is often seen, a pie chart consists of very few categories (sometimes only one!).
Charts can be annotated to show the numbers or percentages involved but, once
again, direct commentary on these is preferable.
Another prevalent problem, which can apply to barplots as well, is that the
absolute numbers involved can be rather small. In these circumstances conversion
to percentages does not make much sense; to do so and then display the percentages
in a pie chart provides a meaningless picture that would be changed markedly with
the addition of a few extra observations.
Regrettably the pie chart, along with barplots, is one of the most common
graphics to be found in archaeological publications. The examples to follow illustrate all the problems noted above. While we concur with the negative sentiments
that have been widely expressed about pie charts, rather than simply noting them.
moaning, and moving on, it was thought to be useful to illustrate the problems
with real examples. Those to follow were located in a short period of browsing
through the books and journals on our bookshelves. There is nothing pathological
about them – they are the rule rather than the exception – and represent the tip of
a very large iceberg. Many other published applications would have served equally
well for illustration.
9.2
9.2.1
Examples
Pointless pie charts
Table 9.1 is adapted from Table 3 of Joy’s (2014) study of Iron Age and early
Roman cauldrons from Britain and Northern Ireland. The original table, of cauldron types (groups) by depositional context was ordered alphabetically by context
for those with a known context and numerically by group. There is no ‘natural’
ordering to either category and a clearer picture of any patterns in the data can
86
be obtained by rearranging rows and columns according to the magnitude of the
marginal totals, as is done here.
Table 9.1: Counts of Iron Age and early Roman cauldrons from Britain and Northern
Ireland by depositional context (Source: Joy 2014, Table 3).
Cauldron groups
Context
III
IV
I
II
Total
Burial
Beach
Settlement
Lake/Loch
River
Bog/Moss
Hoard
Unknown
0
0
0
2
0
0
0
2
0
1
0
0
0
3
0
1
0
0
0
3
0
3
1
5
1
0
3
0
5
0
18
1
1
1
3
5
5
6
19
9
Total
4
5
12
28
49
What’s most obvious from the table is that cauldrons found in hoards are much
the most common, accounting for 45% of the cauldrons from known contexts,
‘Unknown’ being the second most common ‘context’. All but one of the hoard
cauldrons come from Group (Type) II, this group accounting for 67.5% of the
cauldrons from known contexts. The other obvious feature of the table is that
the totals for most contexts and groups are too small for anything other than
describing the composition of the sample available for study – there is no real
basis for generalizing from the table to a wider population.
Joy (2014: 334) states that Groups I, III and IV ‘are predominantly found in
watery contexts’; allowing for unknowns this applies to 13/21 or just over threefifths of the cauldrons in these groups. There is little more that needs to be said
that adds to this tabular description, but Figure 6 in the paper presents a set of
pie charts to go with the table; Figure 9.1 reproduces this in essence. Joy states
that this, along with Table 3 in the paper, provides more detail about the ‘perhaps
surprising’ range of depositional contexts – it doesn’t.
There are some problems with Figure 6 in Joy (2014). The pie charts from
Groups I and II are correct; that labelled Group III repeats the plot for Group I;
the Group III plot is labelled as Group IV, with Group IV missing entirely2 . We
should emphasise that the critique to follow is presented because aspects of it apply
to a large number of pie charts that have appeared in the archaeological literature.
The first point is that converting numbers to percentages with samples as small as
2
The fact that no-one associated with publishing this paper in a leading international journal
noticed this error cannot but underline the fact that multiple pie chart figures are frequently a
waste of space. If a correction has been published we have not noticed it. No-one appears to
look at them.
87
Group II
Group I
Lake/Loch
Bog/Moss
Hoard
Unknown
25%
Burial
Settlement
River
Hoard
Unknown
18%
25%
11%
4%
4%
8%
64%
42%
Group IV
Group III
Lake/Loch
Unknown
Beach
Bog/Moss
Unknown
50%
20%
60%
20%
50%
Figure 9.1: Pie charts showing context by cauldron group (Source: Joy 2014).
about 10 or less, and certainly for samples as small as four and five (Groups III,
IV), is largely meaningless and potentially misleading. Adding a single cauldron to
one of the two groups represented in the chart for Group III would have a marked
effect on its impact.
Even if the numbers sustain representation as a pie chart it seems pointless
to produce a chart where there are a very small number of categories to represent. It’s far more informative, and succinct, to state the relative proportions
(e.g., 50:50, 50:30:20, 2:15:33:50) and provide a commentary on these (usually
stating the obvious) rather than wasting space on a pie chart. The use of pie
charts for comparative purposes is also to be discouraged since there are better
ways of displaying the data. That the words ‘Figure 6’ and the figure itself could
be removed from Joy (2014) without any effect whatsoever on the subsequent discussion – given the numerical information in the table – is telling evidence of the
superfluity of the pie charts often seen in other studies.
88
9.2.2
Dishonest pie charts
Figure 9.2a is a fairly close reproduction of the three-dimensional ‘pie chart’ of
Figure 1 in Egri (2007: 44). It represents the numbers of amphorae for wine
from the Lower Danube region, in the Roman provinces of Pannonia and Upper
Moesia, from Augustus to Hadrian; the numbers around the circle are those of
each amphora type available for study.
10
Dressel 6A
Lamboglia2
Selov
Rhodian
Dressel 2−4
12
111
Dressel 6A
Lamboglia2
Selov
Rhodian
Dressel 2−4
111
10
12
28
28
201
201
(a)
(b)
Figure 9.2: Pie charts showing the distribution of amphorae reflecting wine consumption
in the Roman provinces of Pannonia and Upper Moesia, from Augustus to Hadrian, (a)
‘three-dimensional’ after Figure 1 of Egri (2007), (b) two-dimensional and much better
if you have to have one.
The pie-chart is pointless and dishonest. It’s pointless because, as Egri (2007:
44) notes, ‘the amphorae of Dressel 2-4 . . . and Rhodian dominate’. This is the
only information extracted from the chart, used to justify a subsequent focus on
these types. Stating that 86% of the amphorae are of these types does the job
more efficiently and succinctly. If you must have a pie chart then Figure 9.2b is
better but not needed since what you want it to tell you is simply expressed in
words.
Why is it dishonest? It’s an example of what are sometimes called presentational graphics, thought to be suitable for business presentations, advertising
etc. A pie chart is a two dimensional construct, the areas of the slices showing
the relative proportions of the different categories. This interpretation is lost in
the elliptical representation imposed by the use of a non-existent third dimension.
Thus, three-dimensional pie charts are a distorted representation of the truth with
added fictional features and should be banned from academic publication. Excel
is to blame for a lot of examples, or, more properly, Excel users3 .
3
This isn’t an isolated example; the journal Archeologia e Calcolatori is a fruitful source
89
9.2.3
Could do better
Table 9.2 shows the percentages of cattle, pig and sheep/goat remains from seven
phases identified in excavations of the Wroxeter baths basilica (Barker et al. 1997).
The data have been extracted from a series of pie charts given as Figure 2 in
Hammon (2011). He discusses the charts with reference to percentages shown on
them, with a particular focus on pig proportions.
Table 9.2: Percentages of cattle, pig and sheep/goat (Sh/Gt) bones (based on
NME) from seven phases of the Wroxeter baths basilica
Phase
Cattle %
Pig %
Sh/Gt %
43
59
45
67
56
58
59
19
16
20
19
26
24
25
38
25
35
14
18
18
19
T-V
W
X
X-Y
Y
Y-Z
Z
n
236
2631
94
1422
2639
1849
755
A multiple pie chart representation of these data, which essentially reproduces
that of Hammon, in colour, is shown in Figure 9.3.
Phase X
Phase W
Phase T−V
43%
Phase X−Y
45%
59%
19%
67%
20%
16%
38%
Phase Y
19%
Phase Z
59%
58%
18%
26%
35%
Phase Y−Z
56%
14%
25%
18%
19%
Cattle
Pig
Sheep
25%
24%
Figure 9.3: Pie charts showing the cattle, pig and sheep/goat proportions from seven
phases from the Wroxeter baths basilica (after Hammon, 2011, Figure 2).
(e.g., Volumes 16: 136,144; 21: 115–121; 22 : 187); Chester-Kadwell(2009: 70); Christensen
(2006: 1638) (elliptical two-dimensional pie charts), De La Fuente (2011: 244) and Cocis (2015:
61) are particularly horrible examples elsewhere in the recent journal literature that defy the
imagination.
90
Figure 9.4 shows ‘descendants’ of Roman sheep and cattle, believed to resemble
them (Cool 2006: 86–87); similar data to that of Hammon (2011) are analysed in
Sections 10.3 and 11.2.2. Archaeologists, of course, work with excavated bones but
pictures of live animals are more exciting.
Figure 9.4: Soay sheep and Dexter cattle. Similar data to that of Hammon (2011) are
used in Sections 10.3 and 11.2.2. (Sources: Photographs by Evelys Simak (top) and
Annie Kavanagh (bottom), distributed under a CC-BY 2.0 license).
91
80
From Table 9.2 the obvious features are that (a) cattle predominate throughout,
(b) there is a sharp decline in the proportion of sheep/goat after Phase X, at
which point the relative proportion of pig to sheep/goat also increases markedly,
(c) Phase X-Y is a little anomalous, (d) the last three phases have very similar
profiles. Hammon (2011: 284) says most of this and, given a table, the need for
the pie charts (which occupy half a page) is questionable.
Using pie charts the dominance of cattle is fairly obvious, as is the similarity
of profiles for the last three phases. The relationship between pig and sheep/goat
proportions, a focus of the table, is more difficult to judge other than qualitatively.
To make quantitative statements about the relationship, as is done in the paper,
resort to the numbers that annotate the charts is needed; this is more naturally
done with the data expressed in tabular form.
If you do want to make quantitative assessments from graphical analysis a
(clustered) barplot does the job much better (Figure 9.5); they involve judgments
based on length rather than area which, as Unwin (2015: 53) suggests, is ‘easier’.
These are discussed in detail in Chapter 10 but are common in the archaeological
literature.
40
0
20
Percentage
60
Cattle
Pig
Sheep
T−V
W
X
X−Y
Y
Y−Z
Z
Phase
Figure 9.5: A clustered barplot showing the cattle, pig and sheep/goat proportions from
seven phases from the Wroxeter baths basilica.
The barplot makes it much easier to effect comparisons both within and between phases. The actual proportions are represented much more directly than
in the pie chart, the ‘linear’ arrangement of the plot in a single row also aiding
comparison. Apart from the more obvious features previously mentioned the decline in the proportion of sheep/goat after Phase X and the changes in the relative
proportion of pigs to sheep/goat is much more readily seen than in the pie chart.
92
A large number of barplots (22) similar to those of Figure 9.5, contrasting
up to 12 assemblages, are used in Hesse (2011). Some of her plots draw on data
from a paper of King (1999a) who makes extensive use (there and elsewhere) of yet
another approach to displaying the data. This is the ternary diagram (or triangular
graph) designed specifically for the purpose of displaying three-category data sets
expressed as percentages. They are particularly useful for effecting comparisons
when a large number of assemblages are involved. A brief illustration, using the
data of Table 9.2, is provided here; a fuller account including R code is provided
in Chapter 11.
Figure 9.6 was obtained using the ggtern package, one of several based on the
ggplot2 ‘philosophy’. The position of each point in the diagram is determined by
the three percentages; because they sum to 100%, given any two values automatically determines the third. Technically this means the data are two-dimensional
and can be plotted in two dimensions in what is called a simplex.
Phase
Pig
T−V
50
W
X
50
X−Y
40
Y
ttle
Ca
Pig
60
Y−Z
Z
30
70
20
40
30
20
10
80
Cattle
Sheep
Sheep
Figure 9.6: A ternary diagram showing the cattle, pig and sheep/goat proportions from
seven phases from the Wroxeter baths basilica.
For the moment terminology and detail need not concern us. What’s important
is that each point on the plot represents the data for the corresponding category
93
exactly; it embodies precisely the same information as a pie chart. The distance
between any two points on the plot can be read as the distance between the two
cases involved, something not readily done with either a pie chart or barplot.
Reading a ternary diagram can take a little practice to get used to, but the close
similarity between Phases Y, Y-Z and Z, previously noted, is immediately apparent, testified by the adjacency of the symbols representing them on the plot. The
close similarity of Phases T-V and X is also evident with the comparatively lower
proportion of cattle an obvious feature. The other two phases stand out as slightly
anomalous, Phase X-Y most obviously because of the comparatively higher proportion of cattle, Phase W because of the relatively high proportion of cattle in
conjunction with a ratio of sheep/goat to pigs greater than 1. This contrasts with
other phases having a similar proportion of cattle but with a sheep/goat to pig
ratio of less than 1.
Ternary diagrams are only useful with precisely three categories in the data to
represent; the technique of correspondence analysis is available for larger tables
(Chapter 12). Clustered barplots are available with more than three categories
but can become unwieldy or impractical with large tables and, once again, correspondence analysis is available. The main message of this example is that some
method of presentation (tabular, barplots, correspondence analysis) is almost invariably a better way of extracting useful information from the data in situations
where multiple pie charts have been used.
Robeerst’s (2005: 83) application is identical to that of Hammon (2011);
Chester-Kadwell (2009: 70) presents an example using multiple pie charts showing
the percentage of metals in finds from different contexts that, in addition to using
three-dimensions and gratuitous shading (‘chart-junk’ in a term frequently employed in Tufte 1983), is wholly unnecessary. The charts show that metal-detected
finds are largely made of copper alloy, whereas finds from other contexts are made
from iron or copper alloy in roughly equal proportions; this can be stated in a
single sentence without the need for the pie charts (a table would make the point
equally well but is not provided). The pottery report of Leach (1993) in Woodward and Leach (1993) contains several multiple pie charts, linked to positions on
maps to illustrate the spatial distribution of different forms and fabrics. They are
borderline unreadable because of the number of categories involved and the small
numbers in many categories; where differences are obvious they could have been
more concisely expressed. One could go on!
9.2.4
From bad to worse?
The following story is anecdotal, but regrettably true. Correspondence analysis
(Chapter 12) has been mentioned above and, briefly, in Section 7.4. Today it is a
widely-used method for the graphical display of large, possibly complex, tables of
categorical data. It was not always so. The method only really began to attract
archaeological attention in British publications from the early 1990s.
We first collaborated on a problem using the method for a conference presen94
tation in 1993, published (eventually) as Cool and Baxter (1995). A fuller, more
archaeologically oriented, account was published in Cool et al. (1995). The eminent archaeologist with editorial responsibility for the series in which this appeared
asked, rather plaintively, if the analysis might not be done more readily with pie
charts and had to be told, politely but firmly ‘No!’.
At the time the excuse, perhaps, existed that correspondence analysis would
have been unfamiliar in archaeological publications in the Anglo-Saxon speaking
world. This is no longer the case, and one would have hoped it was now possible
to submit papers for publication without the choice of correspondence analysis as
a means of presenting the data being queried. Alas! It isn’t. On several subsequent occasions referees or editors have asked if we could replace a correspondence
analysis with pie charts or barplots, on the grounds that these would be more
‘comprehensible’. Apart from being wrong this might be read as sad testimony
to the fact that there are eminent and literate archaeologists – conversant with
current theoretical intricacies and/or the area(s) they specialise in – who regard
the pie chart as the acme of graphical aspiration.
For illustration we draw on the most recent paper to suffer this fate (Cool
2016) where some effort was needed to resist the suggestion. The data in question
are given in Table 9.3. The table shows counts of coloured glass ‘counters’ by
phase recovered in the excavation of Insula VI.1, Pompeii, Italy, undertaken by
the Anglo-American Pompeii Project between 1995 and 2006.
Table 9.3: Colours of glass ‘counters’ by phase, from excavations at Insula VI.1, Pompeii.
Colour
Code
Multi-Coloured
Multi
Opaque-White
OpW
Black
Bl
Dark_Yellow_Brown
DYB
Deep_Blue
DpB
Peacock
Pea
Mid/Light Blue
MiB
Yellow/Brown
YB
Yellow/Green
YG
Pale_Green
PaG
Green-tinged colourless GTC
Colourless
Cls
Blue/Green
BG
I
II
1
1
1
0
1
1
0
4
4
2
0
2
4
6
1
7
5
14
4
4
17
8
10
5
17
14
Phase
III IV
11
4
15
5
11
6
1
18
5
13
5
14
24
V
5 4
3 3
10 8
5 1
11 5
3 3
4 1
15 7
3 6
12 7
2 2
6 5
31 25
19 112 132 110 77
n
More properly the ‘counters’ are plano-convex discs of glass found throughout
the ancient world. They are normally called ‘counters’ in English, ‘pedine’ in Ital95
ian, ‘pion’ in French and ‘Spielsteine’ in German, with the connotation that they
were used as playing pieces in the various board games known to have been played
by people in the Roman world. The thrust of Cool (2016) is that this interpretation may often be wrong, many ‘counters’ being more likely to be important
elements of interior decoration. The arguments for this are not relevant here, but
will be discussed in connection with the use of barplots in Section 10.4.
Table 9.3 is a subset of the full 20 × 7 data set based on 20 standardised colours
and seven phases. Owing to small sample sizes seven colours have been omitted,
one phase has been eliminated, and two merged. As given here Phase 1 is the first
century BC; Phases II and III cover, roughly, the first and second thirds of the first
century AD; Phase IV runs from AD 62 to AD 79, between the earthquake that
ruined parts of Pompeii and the eruption of Vesuvius which destroyed it; Phase V
is the eruption level. Given 13 categories of glass colour are used the sample size
of 19 for Phase I is rather small but retained as it leads to interpretable results in
the correspondence analysis to be reported in Section 12.5.
In what might be thought to be a spirit of irony the referee’s suggestion of
showing the data in the form of pie charts has been used here for illustration. The
outcome is shown in Figure 9.7.
Phase I
Phase III
Phase II
4.8%
19%
12.5%
3.6%
3.6%
4.8%
8.3%
4.5%
4.8%
6.2%
4.8%
15.2%
3.8%
11.4%
4.5%
0.8%
3%
0.9%
5.4%
4.8%
8.3%
13.6%
19%
7.1%
12.5%
3.8%
18.2%
19%
9.8%
8.9%
9.5%
4.5%
9.5%
Phase IV
10%
2.7%
3.6%
3.8%
15.2%
Phase V
6.5% 1.3%
3.9%
1.3%
4.5%
10.4%
9.1%
9.1%
13.6%
2.7%
3.9%
7.8%
5.2%
4.5%
2.7%
9.1%
10.9%
2.6%
1.8%
5.5%
10.6%
28.2%
6.5%
32.5%
Multi−coloured
Opaque White
Black
Dark Yellow/Brown
Deep Blue
Peacock
Mid/Light Blue
Yellow/Brown
Yellow/Green
Pale Green
Green−tinged colourless
Colourless
Blue/Green
Figure 9.7: Pie charts for counter colours by phase from Pompeii, Insula VI.1.
Other than ‘multi-coloured’ some effort has gone into making the colours of the
slices correspond roughly to that of the glass, hence the rather muted appearance.
As it happens, differentiation between slices is not too bad; more generally, particularly with more categories to represent or if publishing in black and white, as
was necessary for Cool (2016), such differentiation can be problematic. We have
96
seen pie charts using as many as 30 categories and, once annotation is added, the
result can look both awful and incomprehensible.
Shennan (1997: 23–24), along with Drennan (2009), one of the textbooks most
widely used for introducing quantitative methods to archaeologists, rather surprisingly – in our view – states that the pie chart ‘is a very helpful mode of data
presentation when the aim is to illustrate relative proportions of unordered categories, especially when making comparisons’. We obviously disagree entirely. More
pertinently, he does make the point that ‘it can be confusing if there are numerous
categories, or categories with zero or very small entries’.
If a pie chart is used at all it ought to be for a purpose. In many examples we
have seen where some sort of purpose is evident the purpose is often to emphasise
the dominance or otherwise of a subset, often small, of the categories involved.
This quite often takes the form of a comment on the percentages involved, which
are frequently included on the chart. This essentially renders the chart redundant
and is better done with reference to a table of the data. The same is true when pie
charts are used for comparative purposes; other graphical methods are preferable
but, if patterns are present to comment on, a judiciously constructed table will
reveal the patterns if the table is not too large.
As far as Table 9.3 is concerned it was constructed to investigate whether there
was any trend in the way different colours were used over time. The correspondence
analysis to be reported in Section 12.5 shows clear and interpretable temporal
variation. Figure 9.7 may or may not look pleasing, or at least not offensive, to
some eyes. It is wholly useless for the purpose for which the data were collected;
that such a representation can be urged on one by journal editors and referees,
in preference to methods that are much more effective, is a cause for regret and
concern.
97
Chapter 10
Barplots
10.1
Introduction
Barplots (or bar charts) are one of the commonest kind of graphic to be found
in the archaeological literature. Histograms (Chapter 4) are a specialised type of
barplot. In this form the counts of continuous data within contiguous intervals
are represented by bars whose areas are proportional to the frequencies. That the
data are continuous is indicated by the contiguity of the bars.
Barplots are also commonly used for representing categorical data where each
case in a sample is assigned to a category (e.g., red, green, blue). For a single categorical variable counts within each category are obtained, and the counts (possibly
after conversion to percentages) are displayed in a barplot where the heights of the
bars are proportional to the frequency/count/percentage. This is an instance of
nominal data. The nature of the data – in contrast to the histogram – is indicated
by gaps between the bars. Provided they are the same, both the width of the bars
and the gap between them are irrelevant.
For the kind of data just discussed the order of categories as displayed on
the barplot is arbitrary. Choosing an order that reflects the relative proportion
of different categories (e.g., most to least common) often aids interpretation. If
the categories have a natural order (e.g., small, medium, large) then the data are
ordinal and this should be reflected in the order in which bars are displayed.
In the case just described categories are ordered but not associated with a
numerical value, so the distance between bars on a plot carries no information.
If categories can be equated with numerical values, usually integers (i.e. 0, 1, 2
. . .), the data will be referred to as discrete; that is, values between 0 and 1 etc.
can’t occur. An example might be the number of distinct decorative motifs on an
artefact. The data are ordinal and the distances between bars on a barplot are
meaningful.
There is a ‘grey area’ here where data are continuous, but rounded to the
nearest integer of some unit. If the range of the data is reasonably small it is
sometimes convenient to treat the data as discrete. An example is provided in
98
Section 10.4.1 where diameters of glass counters from Pompeii and elsewhere were
measured to the nearest millimetre. Another situation where barplots are used is
where data are categorised in some way (e.g., Context A, B, . . .) and then within
each category values of a continuous variable are contrasted (e.g., total weight of
pot) and the values represented by bars.
So barplots are widely used for a variety of purposes. We’d also argue overused,
misused and abused; problematic uses are illustrated in Sections 10.4.2 and 10.4.3.
In summary the use of barplots can be sullied for the same reasons we think pie
charts are largely unnecessary. You don’t need a barplot if you only have two
categories to deal with; if you have a large number of categories but counts for
most categories are generally low just say so; pseudo three-dimensional barplots
should be banned from academic publications. Our preference, when dealing with
a small number of categories, is to present the data in a judiciously structured
table with added commentary. This is, perhaps, a matter of taste. Unwin (2015:
58) presents a barplot using just three categories and comments that a ‘table would
show [the pattern] just as well in principle but a graphic is more convincing’, so
we won’t be too dogmatic.
Another point to be noted, that we will not dwell on too much, is that there are
plenty of examples in the literature of what should be histograms masquerading
as barplots for discrete/categorical data. This is irritating rather than fatal since,
other than misrepresenting the nature of the data, interpretation is not distorted
too much. At least some of this misuse can be attributed to a dependence on
Excel, or similar packages, which – for a long time – was unable to provide a
proper histogram. Even now, as Unwin (2015: 21) notes, ‘histograms are difficult
to draw in Excel’.
The next two sections discuss, in detail, how to produce barplots in base R
and ggplot2. Doing so is not always as straightforward as one might wish, so the
attractions of menu-driven software such as Excel are obvious. The archaeological
literature is disfigured with such plots. The real problem is not so much Excel itself
– you can produce perfectly good plots and we have published such – as the fact
that users do not exercise their inalienable right (and capacity) to customise plots,
opting instead for default or ‘jazzy looking’ presentations that can look awful.
Taking the trouble to customise an Excel graph so that it is truly worthy of
publication can be as time-consuming as doing the same in R. Once the mechanics
have been dealt with a variety of applications are provided in Section 10.4.
10.2
One-way data
The simplest case arises when a single set of numbers is displayed. For illustration
Table 10.1, adapted from Table 3.1 in Cool (2006), will be used. The adaptation
is to facilitate entry of the data into R; the column ‘Amphora’ is the percentage of
amphora in the total pottery assemblage, measured by weight.
The first unlabelled column is the ‘Site’; when read into R in this form the sites
99
Table 10.1: Amphora as a proportion of total pottery assemblages for various types of
Romano-British site. Source: Table 3.1 in Cool (2006).
Bainesse
Leicester
Lincoln
Alchester
Kingscote
Ivy_Chimneys
Asthall
Chignal
Type
Amphora
Military
Urban
Urban
Suburban
Rural
Rural
Roadside
Villa
27.8
18.1
11.7
11.1
8.2
5.4
4.9
2.3
5
10
15
20
25
Military
Urban
Suburban
Rural
Roadside
Villa
al
l
gn
al
th
hi
C
As
e
ne
ys
ot
m
sc
ng
hi
Ki
Iv
y
C
n
te
r
es
ch
Al
r
te
ol
Li
nc
es
ic
Le
Ba
in
es
se
0
0
5
10
15
Percentage
20
25
are treated as the row names. Sites have been ordered by amphora percentage and
it is immediately apparent that the order is a very good indication of what sort of
settlement the assemblage is coming from (Cool 2006: 19). No further elaboration
is needed and a barplot is superfluous though they are used here in various forms
for illustrative purposes.
Read the data as given into R (i.e. with a blank header for the first column),
calling it amphora; extract the percentages as Amphora <- amphora$Amphora and
create a site variable using Site <- row.names(amphora). Figure 10.1a, which is
not very useful, is obtained in base R using barplot(Amphora). The presentation
in Figure 10.1b is more informative.
(a)
(b)
Figure 10.1: Default and enhanced barplots for amphora presence on Romano-British
sites using the data of Table 10.1.
Producing Figure 10.1b is not as straightforward as one might wish. The
sixth element of the row names (Site) is renamed from Ivy_Chimneys – which
was needed to allow the data to be read into R – to Ivy Chimneys; this is purely
100
for presentational purposes. Colouring the bars makes sense here; the particular
choice is illustrative only and different shades of grey will do as well; use solid
colours rather than differently angled lines, dots etc. which should be avoided if
possible – apart from being displeasing to the eye, ‘op-art’ effects can be created
that distract from any message the plot might have to offer. The way the legend
is created also differs from other plot functions in that it is created within the
call to the barplot function rather than separately from it. The legend.text
argument indicates that both a legend is needed and the text to appear within it;
args.legend controls the appearance in a list whose arguments are the same as
those of the legend function used with other plots.
Labels for the bars can be produced using the names.arg argument but results
are unsatisfactory if there are lots of bars and/or labels are lengthy. One solution
is to reduce the size of the print using the cex.names argument. This can be
unsatisfactory since labels may become too small to read easily. Vertical labels
can be produced with the argument las = 2 but the result can look ugly. Angled
labels are nicer and the last line in the code to follow shows how this can be done.
In the following code bar <- barplot produces the plot and saves the coordinates
of the x-axis in bar; the axis function creates an x-axis with the plotting positions
for the labels; the text function adds the labels.
This has been written as a function, fig10.1b. Type fig10.1b(), once it’s
working, to get Figure 10.1b.
fig10.1b <- function() {
row.names(amphora)[6] <- "Ivy Chimneys"
Amphora <- amphora$Amphora
Site <- row.names(amphora)
Colour = c("red","skyblue","skyblue","orange","yellowgreen",
"yellowgreen","grey70","pink")
bar <- barplot(Amphora, ylab = "Percentage", col = Colour,
legend.text = c("Military", "Urban", "Suburban", "Rural",
"Roadside", "Villa"),
args.legend = list(fill = c("red","skyblue","orange","yellowgreen",
"grey70","pink")))
axis(1, at = bar, labels = F, tick = F)
text(x=bar, y=-1.25, Site, xpd=TRUE, srt=45, pos=2)
}
fig10.1b()
In ggplot2 the comparable function is designed for two-way tables of data. It’s
possible to persuade it to deal with one-way data but is not illustrated here.
101
10.3
10.3.1
Two-way data
Preliminary considerations
Even if a barplot for one-way data is judged necessary they can be fairly boring.
Things get much more exciting when dealing with two-way tables of counts. This
is partly because several decisions need to be made about the objective of the plot
and presentational details, even if these are often made more-or-less unconsciously.
For illustration a set of data from Table 9.1 of Cool (2006) is used, reproduced in
part as Table 10.2.
Table 10.2: Later first- to mid-second century animal bone assemblages from RomanoBritish sites, after Table 9.1 in Cool (2006).
Site
Castleford
Ribchester
Castleford
Wroxeter
Leicester
Wilcote
Type
Cattle (%)
Sheep (%)
Pig (%)
Fort
Fort
Vicus
Urban
Urban
Rural
64
63
59
53
44
38
26
25
28
29
39
52
10
11
13
18
17
9
n
14155
2473
3648
1548
2039
3116
The table shows the percentages of cattle, sheep and pig bones found at
Romano-British sites of different types. Sites are ordered by the proportion of
cattle. This illustrates a more general pattern, that proportions of cattle and pig
bones tend to be higher on more ‘Romanized’ sites – military establishments, major towns and villas. More specifically, Cool (2006: 81) observes that the ‘rural
site of Wilcote stands out as the only one where sheep are dominant. The urban
sites, though interestingly not the military ones, stand apart with their higher
proportion of pigs’. She did not feel the need to embellish the text with a barplot
that would have told the same story.
Note that Table 10.2 shows the numbers on which the percentages are based.
Such information is sadly lacking in some archaeological publications that exploit
barplots, as indeed, sometimes, is any quantitative information other than that to
be gleaned from the plot itself. En passant it can be noted that Cool additionally
provided information on the different methods of quantification used, going on to
argue that the percentages did not accurately reflect meat consumption because of
different animal sizes and differential amounts of waste generated in the butchering
process. Crudely put, a pig produces more pork than a sheep does lamb, and more
bits of it get eaten.
That’s by-the-by. One decision has already been made, standard for these kind
of data, and that’s to effect comparisons between assemblages in terms of relative
rather than absolute quantities. In some circumstances a comparison of absolute
quantities may be needed, which is why having access to the original numbers may
102
be useful.
If you do have access to the raw data, and even if relative proportions are of
interest, there is a choice between using row proportions or column proportions.
In the present example interest lies in comparing assemblages with respect to the
proportions of animal bones from different species. In principle one could also
compare species with respect to their distribution across sites; it’s not of great
interest here because it would largely reflect relative sample sizes, but in other
situations a choice needs to be made and it can be of interest to look at the data
in both possible ways.
That’s two decisions to make. The third one is what kind of barplot to use.
The choice is between a clustered or stacked barplot. The rest of this section
illustrates how these may be constructed in base R and ggplot2.
10.3.2
Barplots for two-way data in base R
The first thing to realise is that the ‘natural’ way to prepare a table for publication,
as in Table 10.2, isn’t necessarily the best way to prepare it for analysis in R. The
approach adopted here is to prepare the data before entry into R, tailored to the
needs of the function(s) to be used for analysis.
If the barplot function in base R is to be used recast the data in the form of
Table 10.3. The species in the first column, when the table is read into R, will be
treated as row names, as a column heading is lacking. Here the Castleford sites
are distinguished by ‘F’ for Fort and ‘V’ for Vicus. ‘Default’ barcharts are shown
for the data in Figure 10.2 following the table.
Table 10.3: The data of Table 10.2 recast for use with the barplot function in base R.
Cattle
Sheep
Pig
Castleford-F
Ribchester
Castleford-V
Wroxeter
Leicester
Wilcote
64
26
10
63
25
11
59
28
13
53
29
18
44
39
17
38
52
9
Table 10.3 is constructed such that the columns correspond to the categories
to be compared, with respect to the categories designated by the row names.
Figure 10.2 shows ‘default’ stacked and clustered barplots for the data which is
imported into R as a data frame faunalR. A minimalist approach, to produce a
stacked barplot, is to try using barplot(faunalR), which produces an error.
Error in barplot.default(data1) :
‘height’ must be a vector or a matrix
Effectively this is telling you that barplot won’t work with a data frame,
even if the contents are entirely numeric. Use tab <- as.matrix(faunalR) to
convert the (numeric) data frame to a matrix, then barplot(tab) will produce
Figure 10.2a. To get Figure 10.2b add the argument beside = T.
103
60
100
40
50
80
30
60
20
40
0
10
20
0
Castleford.F
Castleford.V
Leicester
Wilcote
Castleford.F
Castleford.V
(a)
Leicester
Wilcote
(b)
Figure 10.2: Default barplots, (a) stacked, (b) clustered, for species presence on RomanoBritish sites, using the data of Table 10.3.
The figures are almost serviceable but not all the bar labels are printed and
a legend is needed somewhere. The first issue can be dealt with by reducing the
character size of the labels using cex.names, but angled labels are more satisfactory. It’s easy enough to add a legend using legend.text but some work is
needed to get it to display in a satisfactory fashion. Modified barplots are shown
in Figure 10.3.
Clustered barplot − base R
80
Stacked barplot − base R
Species
Species
Sheep
Cattle
Cattle
Sheep
Pig
40
Percentage
60
e
r
ot
ilc
W
es
ic
Le
ro
xe
t
W
te
er
V
d.
tle
as
C
R
ib
ch
fo
r
es
d.
C
as
tle
W
ilc
ot
fo
r
e
r
te
es
ic
Le
V
d.
ro
xe
te
r
W
fo
r
C
as
tle
ch
ib
R
C
as
tle
fo
r
es
d.
F
te
r
F
te
r
0
0
20
20
40
Percentage
80
60
100
Pig
(a)
(b)
Figure 10.3: Enhanced barplots, (a) stacked, (b) clustered, for species presence on
Romano-British sites using the data of Table 10.3 and base R.
We shall give the code for the stacked barplot, embedded within a function,
then discuss how it needs to be modified to get the clustered barplot. Part of the
aim here is to show how to get nicely angled labels, which is not straightforward
for clustered barplots. The data of Table 10.3 are held in a file called faunalR.
104
fig10.3a <- function() {
tab1 <- as.matrix(faunalR)
bar <- barplot(tab1, beside = F, ylab = "Percentage",
main = "Stacked barplot - base R", yaxt = "n",
legend.text = T, names.arg = rep("",6), ylim = c(0,120),
args.legend = list(x = "top", horiz = TRUE,
title = "Species", cex = 1.3, bty = "n"),
cex.names = 0.8, cex.lab = 1.3, cex.main = 1.4)
axis(2, at = seq(0,100,20))
text(x=bar, y=-2, names(faunalR), xpd=TRUE, srt=45, pos=2)
}
fig10.3a()
We have already met some of this kind of code in connection with the simple
barplot of Figure 10.1a but will repeat some of the detail here.
• ylim = c(0,120) expands the y-axis to accommodate the legend but produces a tick mark at 120 , which exceeds the maximum range of the data and
we want to get rid of it. To do so yaxt = "n" gets rid of the axis and the
axis command replaces it with one having a range (0,100) with tick marks
at intervals of 20.
• names.arg = rep("",6) replaces the default labels with blanks; 6 is the
number of labelled columns in the data table. This is needed because otherwise the default labels will overlap the angled labels we are going to produce.
• bar, as well as producing the plot, stores the plotting positions for the 6
labels later used in text which adds angled labels.
• In text the argument y = -2 determines the placement of the labels below
the barplot; you may wish to experiment with this to see what pleases most.
For the clustered barplot call the function fig10.3b and make the following
changes to the function for the stacked barplot.
• Use beside = T and delete yaxt = "n" and the axis command (and obviously change the main title!).
• Replace the second element of ylim = c(0,120) with a smaller number,
large enough to accommodate the legend without overlapping the bars. Here
80 works, but you may need to experiment a bit.
• In text replace x=bar with x=bar[2,]. It is not obvious why something like
this is needed, so we explain below.
105
When using bar the resultant object holds the otherwise invisible plotting
positions of the labels. For simple and stacked barplots this is not a problem,
since the number of bars and plotting positions are the same. In a clustered
barplot there are R × C bars where R is the number of rows and C the number of
columns, so 3 × 6 = 18 for our example.
If you just use x=bar with a clustered barplot you will thus get 18 labels, three
for each of the six clusters of bars; this is unsatisfactory since we only want one
label for each cluster. Discussion forums on the web, where the question has been
asked more than once, proved unsatisfactory for our purposes. We shall see shortly
that producing angled labels for clustered barplots in ggplot2 is simpler, though,
overall the amount of code needed is not reduced.
The following worked for our problem. When constructing the function for
the clustered barplot insert the command print(bar) at some point. When you
execute the function the following will appear (using our data).
[1,]
[2,]
[3,]
[,1] [,2] [,3] [,4]
1.5 5.5 9.5 13.5
2.5 6.5 10.5 14.5
3.5 7.5 11.5 15.5
[,5]
17.5
18.5
19.5
[,6]
21.5
22.5
23.5
where rows correspond to species and columns to sites. Thus each column contains
three x-coordinates that determine the plotting position for each of the three
species associated with the cluster. We only want one label per cluster so extracting
a single row – the central one seems sensible – to get one plotting position per
cluster will do the trick. Hence the use of x=bar[2,] in the text function to do
this, with the result shown in Figure 10.3b.
10.3.3
Barplots in ggplot2
For use with ggplot2 the data of Table 10.2 can be restructured, outside of R, as
in Table 10.4, then imported into R as a data frame, here called faunalG.
The code used to get the clustered barplot in the figure was
ggplot(faunalG, aes(Site, Percentage, fill = Species)) +
geom_bar(position = "dodge", stat="identity")
where the data of Table 10.4 are in faunalG. Remove the position = "dodge"
argument to get the stacked barplot.
106
Table 10.4: The data of Table 10.2 recast for use with ggplot2.
Site
Castleford-F
Ribchester
Castleford-V
Wroxeter
Leicester
Wilcote
Castleford-F
Ribchester
Castleford-V
Wroxeter
Leicester
Wilcote
Castleford-F
Ribchester
Castleford-V
Wroxeter
Leicester
Wilcote
Type
Species
Percentage
Fort
Fort
Vicus
Urban
Urban
Rural
Fort
Fort
Vicus
Urban
Urban
Rural
Fort
Fort
Vicus
Urban
Urban
Rural
Cattle
Cattle
Cattle
Cattle
Cattle
Cattle
Sheep
Sheep
Sheep
Sheep
Sheep
Sheep
Pig
Pig
Pig
Pig
Pig
Pig
64
63
59
53
44
38
26
25
28
29
39
52
10
11
13
18
17
9
Figure 10.4 shows ‘default’ clustered and stacked barplots using ggplot2 and
the geom_bar function.
100
60
75
Percentage
Species
Cattle
Pig
Sheep
Percentage
40
Species
Cattle
50
Pig
Sheep
20
25
0
0
Castleford−F
Castleford−V
Leicester
Ribchester
Wilcote
Wroxeter
Castleford−F
Castleford−V
Leicester
Ribchester
Site
Site
(a)
(b)
Wilcote
Wroxeter
Figure 10.4: ‘Default’ barplots, (a) clustered, (b) stacked, for species presence on
Romano-British sites using the data of Table 10.4 and ggplot2.
The main problem with the plots, if you don’t dislike the colours, is that the
bars are ordered alphabetically by site rather than in the informative order of
Table 10.2. A modified version of the clustered barplot is shown in Figure 10.5.
The opportunity is taken to show how the width between bars can be altered,
colours changed, and rotated bar labels produced.
107
Species
Cattle
Pig
Sheep
Percentage
60
40
20
te
W
ilc
o
st
er
ic
e
Le
W
ro
xe
te
r
V
d−
C
as
tle
fo
r
he
ib
c
R
C
as
tle
fo
r
d−
F
st
er
0
Site
Figure 10.5: A modified clustered barplot for species presence on Romano-British sites
using the data of Table 10.4 and ggplot2.
The code used follows, presented within a function as it is quite lengthy. Type
fig10.5() to get the plot once the function is constructed.
fig10.5 <- function() {
library(ggplot2);library(grid)
# Re-order factor levels
faunalG$Site <- factor(faunalG$Site, levels = c("Castleford-F", "Ribchester",
"Castleford-V", "Wroxeter", "Leicester", "Wilcote"))
p <- ggplot(faunalG, aes(Site, Percentage, fill = Species)) +
geom_bar(width = 0.6, position = "dodge", stat="identity") +
theme(panel.grid.minor = element_blank()) + xlab("Site") +
coord_fixed(ratio = 1/20) +
scale_fill_manual(values=c("chartreuse3", "orchid", "skyblue")) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
theme(axis.text=element_text(size=12), axis.title=element_text(size=14)) +
theme(plot.title = element_text(size = 18, face="bold")) +
theme(legend.position="top", legend.text=element_text(size=14),
legend.title=element_text(size=16, face="bold"),
legend.key.height=unit(.8, "cm"), legend.key.width=unit(.8, "cm"))
p
}
fig10.5()
The width argument in geom_bar controls the spacing between bars; the theme
108
command that follows scale_fill_manual (used to change fill colours) produces
rotated labels. The coord_fixed command is used to change the aspect ratio of
the plot, in this case with the intention of elongating it; you need to play around
with the ratio used to get the effect required.
The various barplots presented, once bars have been appropriately ordered, tell
the same story as Table 10.2. Our preference is to base interpretation on the table
but it’s partly a matter of taste if you think the impact of a graphic is greater and
makes the same points. It’s helpful to include a table of the data even if you go
down this route. As far as style of barplot goes, patterns are sufficiently strong,
and the categories sufficiently few, that both the clustered and stacked barplots
are easily ‘read’. More generally, with more categories and less obvious pattern,
we prefer to use clustered barplots; it’s a lot easier to effect comparisons both
between and within the clustered bars than it is with a stacked barplot.
10.4
10.4.1
Examples, including things to avoid
Glass ‘counters’ and their function
The example presented in Section 9.2.4 was based on glass ‘counters’ discovered in
excavations at Insula VI.1, Pompeii. The term ‘counters’ usually implies that the
artefacts were used as gaming pieces. Cool (2016) queries the universality of this
assumption; here the evidence against it is presented. Table 10.5 shows counts
of counters with different diameters from Insula VI.1 and from grave contexts
elsewhere where they were associated with gaming pieces such as dice.
Table 10.5: Glass ‘counter’ sizes from Insula VI.1 Pompeii and grave contexts where the
counters were associated with dice. Source: Cool, 2016.
5
6
7
8
9
10
11
12
13
14
15
16
VI.1
Grave
2
4
18
22
47
66
66
63
55
48
35
23
0
0
0
0
1
1
0
0
2
9
19
29
17
18
19
20
21
22
23
24
25
26
27
VI.1
Grave
16
12
8
2
4
1
1
0
0
1
1
39
33
11
6
5
1
0
0
0
0
0
The first, unnamed, column ‘Diameter’ is a continuous variable but was only
measured to the nearest millimetre and for comparative purposes it is convenient
to treat the data as discrete. The lack of a header means that ‘Diameter’ will form
the row names. Other headers are convenient for data entry and will be modified
in Figure 10.7. This uses a clustered barplot to contrast the diameters of ‘counters’
109
from Insula VI.1 with those from the graves which, it seems reasonable to assume,
were in fact used for gaming purposes. Figure 10.6 shows some of the material on
which, here and in Sections 9.2.4 and 12.5, analysis is based.
Figure 10.6: Glass counters from Gloucester and Pompeii. (Source: the authors).
It is immediately obvious from Figure 10.7 that the distribution of size is bimodal with counters associated with gaming pieces in graves being generally larger
with a somewhat smaller range than those from Pompeii. This casts doubt on the
110
60
Context
30
0
10
20
Count
40
50
Insula VI.1
Graves
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Diameter (mm)
Figure 10.7: A clustered barplot for glass ‘counter’ dimensions from Insula VI.1, Pompeii
and from graves where they are associated with gaming pieces.
interpretation of most of the Insula VI.1 ‘counters’ as gaming pieces. Cool (2016)
suggested they were more likely to be elements of interior decoration and there
is other archaeological evidence from elsewhere that suggests this is a plausible
explanation. Additionally, those counters that can reasonably be assumed to have
been used for gaming purposes tend to have strongly contrasting colours (black
and white); the Pompeii counters come in a wide range of colours, not all easily
distinguished.
The data are held in a data frame, countersgraves that is renamed data
within the function. The base R code used for the graph was
fig10.7 <- function() {
data <- countersgraves
barplot(t(as.matrix(data)), beside = T, ylim = c(0,60), xlab = "Diameter (mm)",
ylab = "Count", legend.text = T, col = c("orange", "black"),names.arg = 5:27,
args.legend = list(legend = c("Insula VI.1", "Graves"),
title = "Diameter (mm)",
cex = 1.3, title = "Context", bty = "n"),
cex.names = .7, cex.axis = 1.2, cex.lab = 1.3, cex.main = 1.4)
}
fig10.7()
where the data are held in a three-column data frame called countersgraves.
This has to be converted to a matrix and transposed (i.e. columns become rows
and rows columns) to get the effect desired. Although this is just a single command
111
it is worth embedding it in a function as you may need to play around with some
of the arguments to get a figure suitable for publication.
112
10.4.2
Chartjunk I
Tufte (1983: 107), in a chapter devoted to chartjunk, starts thus.
‘The interior decoration of graphics generates a lot of ink that does
not tell the viewer anything new. The purpose of decoration varies –
to make the graphic appear more scientific and precise, to enliven the
display, to give the designer an opportunity to exercise artistic skills.
Regardless of its cause, it is all non-data-ink or redundant data-ink,
and it is often chartjunk.’
Among the things he singles out are ‘over-busy grid lines’, ‘redundant representations of the simplest data’ and ‘unintentional optical art’. He further comments
that ‘like weeds, many varieties of chartjunk flourish’. Some of this has already
been discussed in Chapter 9 with reference to pie charts; sadly, archaeological uses
of the barplot are not immune from this critique.
It can be difficult to reproduce some of the awfulness of plots appearing in
the literature. Figure 10.8 uses the data of Table 10.1 to illustrate some of the
problems.
Urban
Suburban
Rural
Roadside
Villa
30
Military
20
15
18.1
11.7
10
Percentage
25
27.8
11.1
5
8.2
5.4
4.9
al
gn
al
l
C
hi
th
As
hi
C
Iv
y
Ki
ng
sc
m
ne
ys
ot
e
te
r
es
ch
Al
ol
n
nc
Li
te
es
ic
Le
Ba
in
e
ss
e
r
0
2.3
Figure 10.8: An alternative (and poorly) ‘enhanced’ barplot for amphora presence on
Romano-British sites using the data of Table 10.1.
There are far too many grid lines which are far too intrusive and distract
from the message the graph has to offer. The visual impact of grid lines can be
particularly pernicious with stacked barplots. This is illustrated in the alternative
rendition of Figure 10.3a shown in Figure 10.9.
113
Species
Sheep
Cattle
60
te
ilc
o
W
st
er
ic
e
Le
ro
xe
te
r
W
d.
V
fo
r
C
as
tle
ch
ib
R
C
as
tle
fo
r
es
d.
F
te
r
0
20
40
Percentage
80
100
Pig
Figure 10.9: A version of Figure 10.3a illustrating the visual impact of grid lines.
It hurts visually to look at Figure 10.9; stare long enough and the gaps between
the bars leap to the foreground, distracting from the otherwise clear message of the
plot. If you must have grid lines use fewer and make them less obtrusive. Unwin
(2015: 3–4) quotes Tukey (1993) to the effect that graphics ‘are for the qualitative/descriptive . . . never for the carefully quantitative (tables do that better)’ and
that graphics ‘are for comparison . . . not for access to individual amounts’. If you
accept this there is no real need for the grid lines – they’re just what’s produced by
the defaults in the software that gets used. Adorning the bars with the numbers
as in Figure 10.8 is superfluous. If you do want to comment on these do so on the
basis of a table.
The examples above are not pathological; plenty of comparable plots are to be
found in the literature. Figure 10.10 is a reasonable attempt to reproduce Figure
5 of Schörner (2007), except that we’ve been unable to reproduce the rather garish
shading of the bars used there.
The paper in which this appears is derived from inscribed stelae dedicated to
Saturn found in Roman North Africa. They are associated with the ritual sacrifice
of animals and some of the stelae are dated; Figure 10.10 is intended to show that
the sacrificial ritual ‘was regularly performed all year round’ (Schörner 2007: 98).
Other than via the barplot the data are not given, but it’s easy enough to see –
and without the grid lines – that for January to December the number of sacrifices
attested to is (0 2 1 2 4 4 2 1 4 4 2 3). Note that the numbers are rather
small and integers so that the grid lines in the plot at 0.5, 1.5 etc. are wholly
unnecessary, as the others are arguably so; the ‘jazzy’ shading of the bars, more
114
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
Ju
ly
Au
gu
st
Se
pt
em
be
r
O
ct
ob
er
N
ov
em
be
D
r
ec
em
be
r
Ju
ne
M
ay
Ap
ril
Ja
nu
ar
y
Fe
br
ua
ry
M
ar
ch
0.0
Figure 10.10: Monthly distribution of sacrifices to Saturn (after Schörner 2007).
intrusive in the original, is also unwanted. The message of the plot is that, January
apart, between one and four sacrifies are attested to in the remaining months and
there is no evidence that they were confined to official, or ‘seasonal’ events. You
can say this in a single sentence, given the numbers, without recourse to a barplot.
The plot falls into the category of ‘redundant representations of the simplest
data’. The same might be said of several of the barplots in White (2007: 122) that
appear in the same volume as Schörner (2007). Two barplots are used to display
the differing sources of food offerings in burials at Poundbury, UK, in the period
surrounding the Roman conquest of Britain and in the third/fourth centuries AD.
The sample sizes are small (10 and 16) as is the number of categories represented.
The message of the data is that in the earlier period only pig and sheep/goat
offerings were found, in roughly equal numbers (4:6) while pig is not found in the
later period where chicken and sheep/goat dominate (6:8) with single examples of
two other species (cattle, dog). This can be stated in a single sentence and would
be obvious from a small table; using two-thirds of a page to show barplots is a
waste of space.
10.4.3
Chartjunk II
The previous section presented unfortunately real examples, quite common in practice, of barplots that were redundant and/or awful to look at. If they have any
merit it is that they recognise that barplots are two-dimensional constructs, and
you can (through gritted teeth) ‘read’ them if it is the only form in which data
are presented. In this section we stray beyond the merely pointless or awful into
the criminal use of barplots – namely pseudo-three-dimensional constructs. The
arguments against this sort of representation have already been rehearsed in the
context of pie charts and will not be repeated in detail here.
115
n = 419
Deceased
10
9
350
7
5
6
Count
8
300
250
200
4
150
50
2
3
100
Quantity of samian stamps
Commemorators
11
400
450
Here we shall illustrate their misuse, as with several prvious illustrations, using
examples taken from TRAC proceedings. It is not possible (easily) to produce
three-dimension barplots in R – it’s not allowed. We have therefore had to write
our own code to produce the following examples.
The first example, in Figure 10.11a, is an imperfect attempt to reproduce Figure
3 of Weber (2012) – our version actually looks better. In the original it occupies
half a page. It shows the distribution of samian ware with potters stamps from
the Castleford (UK) pottery shop assemblage, from two production sources. The
overwhelming majority, 419/425, come from a single production centre, Lezoux
(France), and that’s all that needs to be said. A graphic is entirely unnecessary
unless it’s assumed that some aspirant theoretical archaeologists can’t understand
a simple sentence with numbers in it.
Given it’s there, the most egregious thing about the graphic is the use of a
three-dimensional barplot. Because of the distorting effect of the non-existent
third dimension it is usually difficult to relate the bars to the numbers associated
with the grid lines (of which there are too many). Our graph is better than that
in Weber (2012), where the bars appear to float with no obvious relation to the
grid, and need to be endowed with numbers to make sense.
0
0
1
n=6
Lezoux
Les Martres−de−Veyre
Ingenui
Liberti
Servi
Incerti
Production centre
Figure 10.11: Examples of unnecessary three-dimensional barplots, (a) after Figure 3 of
Weber (2012), (b) after Figure 4 of Tupman (2005). See the text for contextual details.
The same paper, in Figure 1, contains a similar and equally pointless chart,
albeit with four categories. As an added bonus the paper (Figures 2 and 4) contains
two pie charts – thankfully two-dimensional – which are borderline unreadable,
and too ‘busy’, because of the large number of categories involved. The important
information the pie charts are intended to convey would be better displayed in
tabular form.
For a second example Table 10.6 is that on which Figure 4 of Tupman (2005) is
based. It’s taken from a study of the relationship between social status and commemoration as testified by inscriptions on funerary monuments for the non-elite
from Iberia, dating to the first to third centuries AD. Four classes of monuments
116
Table 10.6: Individuals commemorated with and setting uo funerary altars (after
Tupman, 2005, Figure 4).
Deceased
Commemorators
Ingenui
Liberti
Servi
Incerti
3
2
3
3
3
2
6
10
were defined, the table relating to inscriptions on altars. Tupman’s terminology is
used in the table and the associated plot of Figure 10.11b; the status designations
correspond to freeborn citizens, freedmen, slaves and uncertain. Inscriptions are
categorised as referring either to the deceased or to commemorators.
In Tupman (2005) the table and associated figure is one of a suite of four,
and it is the contrast between them that is of interest. It is therefore perhaps
a little unfair to treat a single table and figure in isolation; we do so in order
to reiterate a number of points made elsewhere. Ironically, the bulk of Tupman
(2005) is taken up with the archaeological background and theory. The actual data
analysis is confined to about one page at the end of the paper, and is conducted in
quantitative terms without reference to the graphs. The discussion is also couched
in terms of percentages, some based on quite small numbers.
As far as Figure 10.11b goes it does not exactly reproduce Figure 2 of Tupman
(2012) where the bars ‘float’ and it is difficult to relate them to the numbers
associated with the grid lines; the iniquities of pseudo-three-dimensional barplots
have already been commented on and this is one of them. We have not attempted
to reproduce the wholly unnecessary variable grey-shaded background to the plots
that Tupman uses. One problem with this kind of graph is the optical illusions
it can generated. Look long enough and what is intended as the background grid
leaps to the foreground.
In isolation, and ignoring Incerti, Table 10.6 shows that for each status category
there are few inscriptions (5 or 6) more-or-less equally split between the deceased
and commemorators. In fact, given the other monument types, the contrasting
graphs are interesting – despite the poor presentation – and some commentary on
them would have been welcome. All that is actually said is that ‘Figures 3, 4, 5
and 6 show the proportion of people commemorated with each type of monument
belonging to each social status’ (Tupman 2005: 127). That’s it. Incidentally,
the figures show numbers of people, not proportions. The numbers are converted
to proportions (percentages) for the purposes of interpretation in Tupman (2005:
130). This is in the form of a list which would have been better presented, and
more comprehensible and interpretable, as a table.
If we can emphasise a point made earlier, in singling out papers for critical
comment, sometimes in detail, we have no intention of impugning the archaeological scholarship of the authors involved. The intention is to use genuine examples to
illustrate general points about the quality of graphical representation in a seriously
non-negligent number of papers that use graphs.
117
What is disturbing – and we are not claiming this is universal – is the disjunction often to be seen between the quality of the archaeology and the quality of the
statistical graphics that accompany any data analysis that takes place. Graphs
of the kind illustrated in the two ‘chartjunk’ sections are all too readily produced
in Excel, and unfortunately often are. If you need a graphic, and they are often
superfluous, they deserve the same care and attention as the rest of the text and
should be properly integrated into it. This is not always the case.
118
Chapter 11
Ternary diagrams
11.1
Introduction
An example of a ternary diagram was shown in Figure 9.6; here further examples
and details of their construction are provided. Ternary diagrams go under a variety
of names – triangular, tripolar or phase diagrams among them. They are used when
p = 3 variables are available that are scaled to sum to 100% (or 1). The data may
be categorical or continuous. This scaling (or compositional constraint) means the
values of any two variables determine the value of the third variable. This implies
that the data are exactly two-dimensional.
The data can be plotted in an equilateral triangle where a point represents
the relative proportions of the three variables as measured by the (perpendicular)
distances to the three axes. In effect, and as previously noted, a point in a ternary
diagram encapsulates the same information as a pie chart for three categories, with
the considerable advantage that the distances between points are visualized.
Howarth’s (1996) history of the ternary diagram traces its origins to the mid18th century. Their use in archaeology is widespread, often occurring in the more
‘specialised’ literature. Howarth (1996: 338) notes that they are widely used in
geology, physical chemistry and metallurgy and are to be found in archaeological
publications that interact with these disciplines. Recent papers in this vein that use
ternary diagrams include Hughes et al. (2010) study of flint compositions, Panich
(2016) of obsidian, and Chirikure et al. (2010) of slags. For data sets with more
than 3 variables composite variables (i.e. linear combinations of those available) are
sometimes defined to enable their representation in a ternary diagram (e.g., Hein
et al. 2007: 149; Plaza and Martinón-Torres 2015: 93). Another use is in phase
diagrams where a plot is ‘zoned’ in some way to identify, for example, different
classes of material or manufacturing technologies corresponding to different regions
of the diagram. Data are then plotted to characterize the cases beings studied (e.g.,
Thornton and Rehren 2009; Radivojević et al. 2010).
Archaeozoological applications, to data of the kind used for Figure 9.6, are also
common. An important body of work making extensive use of ternary diagrams
119
for this kind of purpose is attributable to King (e.g., King 1984, 1999a, 1999b,
2005); Levitan’s (1993) report on the vertebrate remains from excavation of the
Uley shrines (Woodward and Leach 1995) also makes extensive and effective use
of the method.
Stiner (1990, 1994) popularised the use of ternary diagrams for studying mortality patterns (i.e. old, prime, juvenile) in archaeofauna. Steele and Weaver
(2002), for similar data, proposed what they called a ‘modified triangular graph’
allowing an indication of sampling uncertainty to be incorporated into the diagram in the form of density contours (confidence ellipsoids) surrounding individual
points. Weaver et al. (2011) further develop their ideas. Stiner introduced the
idea of demarcation of ‘zones’ in ternary diagrams to identify specific mortality
patterns; Discamps and Costamango (2015) provide a critique of the methodology
and propose an alternative species-specific approach to zoning1 .
This by no means exhausts the uses to which ternary diagrams can be, or have
been, put in the service of archaeological data analysis. Meighan (1959) is the
earliest archaeological application of ternary diagrams we have seen; he used the
relative proportions of three types of pottery, displayed in a ternary diagram to
seriate pottery assemblages. The illustrative example in the next section can be
viewed in this light.
Implementation is possible in a variety of packages in R, the triax.plot function from the plotrix package and plot.acomp from the compositions package
among them. There are others but only the ggtern package, dedicated to the
production of ternary diagrams, will be illustrated.
11.2
11.2.1
Examples
Palaeolithic assemblages
For an initial illustration, data from Table 9.12 in Doran and Hodson (1975) are
used (Table 11.1). The left-hand table shows the counts of cores, blanks and stone
tools found in different levels at the palaeolithic site of Ksar Akil (Lebanon); the
levels are numbered from earliest (25) to latest (12). The right-hand table shows
the counts converted to percentage for each level and can be represented in a
ternary diagram2 .
Read the data (i.e. the headers and what follows) into a data frame Ksar –
you will need to select a subset of the columns for analysis. Either Ksar[, 1:4]
or Ksar[, c(1,5:7] will do it. we’ll use the former; ggtern will transform these
1
The last three papers are quite mathematical and go beyond what is discussed in any detail
in the text.
2
Note that the row sums are not all exactly 100% because numbers have been rounded to
integers for presentational purposes; more exact values were used for the analysis. If you see a
table like this in the literature where row sums are all exactly 100% the chances are that the
table has been ‘doctored’ a little to achieve this.
120
Table 11.1: To the left, counts of cores, blanks and tools from middle levels of the
palaeolithic site at Ksar Akil (Lebanon). This is Table 9.12 from Doran and Hodson
(1975). The right-hand table converts the data to row percentages, rounded to integers.
Levels
Cores
Counts
Blanks
Tools
CoresP
21
36
126
159
75
176
132
46
550
76
17
4
29
133
12
52
650
2342
487
1090
713
374
6182
846
182
21
228
2227
70
115
549
1633
511
912
578
266
1541
349
51
14
130
729
20
18
10
4
7
8
9
7
7
6
7
10
7
4
25
24
23
22
21
20
19
18
17
16
15
14
13
12
Percentages
BlanksP ToolsP
12
26
49
57
45
50
50
55
75
67
73
54
59
72
68
57
41
40
48
42
41
39
19
27
20
36
34
24
to the percentages needed. The ternary diagram for these data, in Figure 11.1a,
is followed by the code used.
Blanks
Blanks
100
100
20
20
80
80
17
1512
16
40
40
res
res
Co
s
s
Co
60
13
22
1418
19
20
23
21
nk
Bla
nk
Bla
60
60
60
40
40
80
80
24
20
20
25
10
Tools
Tools
(a)
(b)
0
10
80
60
Cores
40
Tools
20
0
10
80
60
40
20
0
0
10
Cores
Tools
Figure 11.1: Ternary diagrams for the palaeolithic assemblage data of Table 11.1.
ggtern(data=Ksar[,1:4],aes(Cores, Blanks, Tools, label=Levels)) +
geom_point(size = 3) + theme_showarrows()
121
The slightly curvilinear pattern of points suggests the possibility of seriation,
presumably related to the sequencing of the levels3 . To investigate this replace geom_point(size = 3) with geom_text() which adds the text contained
in Levels. The result, Figure 11.1b, confirms the interpretation as a seriation,
though it is not perfect. Later levels are to the top of the diagram and earlier
ones to the lower-right. The (partial) seriation could, of course, have been inferred
from the percentages displayed in Table 11.1.
11.2.2
Romano-British civilian faunal assemblages
For more detailed illustration data on the faunal assemblages from Romano-British
civilian sites, extracted from Table 2 of King (1984) and Table E of King (1999a),
are used. Pre-Roman assemblages from the first of these tables have been omitted.
The first few lines of the data, as entered into R are shown in Table 11.2
Table 11.2: Faunal remains from Romano-British civilian sites; C = cattle, P = pig, SG
= sheep/goat. (Source: King 1984, King 1999a).
Type
Settlement
Vicus
Settlement
Settlement
Villa
Villa
Villa
Villa
..
.
C
P
SG
37
37.4
74.4
62
48.7
56.2
59.5
71.3
..
.
10.4
13.7
4.7
10.8
29.2
12.9
10.1
8.5
..
.
52.6
48.9
20.9
27.2
22
30.9
30.4
20.2
..
.
The codes for site type in King (1984), sett, urb, vic and vill, have been renamed
here as Settlement, Urban, Vicus and Villa; King (1984: 14) provides a fuller
description of the site types embraced by these terms (sett = unromanised (rural)
settlements; urb = larger town/civitas capital, colonia etc.; vic = civilian vicus,
Romanised settlement, smaller town/civitas capital; vill = villas). It is convenient
here, for presentational purposes, to use our ‘shorthand’ code, but readers should
bear in mind the fuller definitions. The additional data extracted from the 1999
paper are those with identical site type codes to those used in the 1984 paper.
There are 254 assemblages of the percentage presence of cattle, pig and sheep/goat
bones, based on the total number of identified pieces of bone for those three species
groups. In R the data were read into a data frame King with column names Type,
C, P and SG for the site type and cattle, pig and sheep/goat percentages. For an
3
label = Levels is not actually needed for Figure 11.1a but does no harm. It’s included
because it is needed for Figure 11.1b.
122
initial look at the data the following code was used. The upper part of Figure 11.2
is the result.
library{ggplot2}; library(ggtern)
ggtern(data=King,aes(C, P, SG, colour=Type, shape=Type)) +
geom_point() + theme_showarrows()
P
100
20
80
40
Type
Settlement
P
C
60
Urban
60
Vicus
40
Villa
80
20
0
10
10
0
80
60
40
20
C
SG
SG
Type
P
Rural settlement
100
Urban site
Vicus
20
80
Villa
40
P
C
60
60
40
80
20
10
0
10
80
60
40
20
0
C
SG
SG
Figure 11.2: The upper plot is the ‘default’ ternary diagram for Romano-British civilian
faunal assemblages using ggtern; the lower plot is a ‘modified’ version. C = Cattle, P
= Pig, SG = Sheep/Goat.
123
The code is the ‘default’ in the sense that it is almost the minimum code needed
to get a ternary diagram, using ggtern. The theme_showarrows() command isn’t
essential, but we find that the arrows produced, aligned with the axes, help in
interpreting plots. We don’t much like the defaults; an alternative rendition with
modified presentational details is shown in the lower part of Figure 11.2. The code
used to produce it was as follows.
fig11.2lower <- function() {
library{ggplot2}; library(grid); library(ggtern)
King$Type <- factor(King$Type, labels = c("Rural settlement", "Urban site",
"Vicus", "Villa"))
ggtern(data=King,aes(C, P, SG, colour=Type, shape=Type, fill = Type)) +
geom_point(size = 3) + theme_showarrows() +
scale_shape_manual(values=c(21,22,24,25)) +
scale_colour_manual(values=rep("black", 4)) +
scale_fill_manual(values=c("skyblue", "red", "yellow", "white")) +
theme_legend_position("tl") +
theme(legend.title=element_text(size=16),
legend.key.height=unit(1, "cm"), legend.key.width=unit(1, "cm"))
}
fig11.2lower()
Here King$Type renames the levels of Type to produce a more informative
legend. The different scale commands, in sequence, select open plotting symbols;
specify their border colour as black; and specify different colours for their fill. The
legend is placed at the top-left using theme_legend_position("tl").
King’s papers can be consulted for an interpretation of the data; these also use
assemblages for military sites. A brief summary of some of King’s (1999a: 178–
180) conclusions is that ‘sites in Britain show that the high cattle/high pig pattern
correlates with the apparent “Romanized” nature of the sites. There is a gradient
towards higher average cattle and pig percentages that goes in the sequence; rural
settlements, villas, secondary urban centres, urban sites . . .’ with military sites
and legionary sites completing the sequence.
This can be envisaged by imagining straight lines drawn from the vertex for
cattle (C) through the scatters of points for each site type. The imagined line
for (rural) settlements, relative to the sheep/goat axis, would make a sharper
angle with the horizontal axis than that for urban sites. This reflects differences
in the relative proportions of the different species that King interprets in term
of ‘Romanization’; rural settlements tend have relatively more sheep, with the
presence of cattle and pig being correspondingly less.
With several different types to display and lots of data such differences can
be difficult to discern. One approach to alleviating this is to compare types in
a pairwise fashion, which could be done by modifying point sizes. This is illustrated in Figure 11.3 which emphasises the contrast between the patterns for rural
settlements and urban sites.
124
Type
P
Rural settlement
100
Urban site
Small town
20
80
Villa
40
P
C
60
60
40
80
20
10
0
10
80
60
40
20
0
C
SG
SG
Figure 11.3: A ‘modified’ ternary diagram for Romano-British civilian faunal assemblages emphasising the distinction between two types of site. C = Cattle, P = Pig, SG
= Sheep/Goat.
This was effected by modifying the geom_point command in the previous chunk
of code and using scale_size_manual to set the point sizes as desired.
# Link point sizes to site type
geom_point(aes(size = Type))
# Not especially nice so manually set point sizes
scale_size_manual(values = c(3,3,.5,.5)) +
Another approach would be to display separate ternary diagrams for each site
type. Figure 18 in King (1999a) overlays summary ternary diagrams on a map
of the Roman Empire for Roman Imperial assemblages. Each diagram includes a
‘blob’ summarizing the compositional variation within a region. In ggtern2 this
kind of display can be achieved by overlaying a ternary diagram with a contoured
two-dimensional kernel density estimate for the data. This is illustrated for the
British civilian faunal assemblages in Figure 11.4, where site distinctions have been
ignored, using the following code.
fig11.4 <- function() {
library(ggplot2); library(grid); library(ggtern)
ggtern(data=King, aes(C, P, SG)) + theme_showarrows() +
stat_density_tern(geom="polygon", n=200, aes(fill=..level..)) +
geom_point() + scale_fill_gradient(low="blue", high="red") +
guides(color = "none", fill = "none")
}
fig11.4()
125
P
100
20
80
40
P
C
60
60
40
80
20
0
10
0
SG
10
80
60
40
20
C
SG
Figure 11.4: A ternary diagram for Romano-British civilian faunal assemblages overlaid
with a contoured two-dimensional kernel density estimate.
Rural settlements
Villas
P
P
100
100
20
20
80
C
60
60
60
40
40
80
80
20
20
SG
SG
Urban sites
Small towns
P
P
100
0
SG
100
20
20
80
80
40
P
60
60
40
40
80
80
20
P
60
C
40
60
C
10
80
60
C
40
SG
20
10
80
60
40
20
0
0
10
0
10
20
10
0
80
40
C
20
10
0
80
60
40
20
0
0
10
10
SG
SG
60
C
C
P
P
C
40
40
60
80
SG
SG
Figure 11.5: Ternary diagrams for Romano-British civilian faunal assemblages by site
type overlaid with contoured two-dimensional kernel density estimates.
126
To get the plot for rural settlements, for example, replace King in the first line
of code with King[King$Type=="Settlement",]; the title is added using
ggtitle("Rural settlements") +
theme(plot.title = element_text(size = 20, face="bold"))}
and replacing Type=="Settlement" with "Urban", "Villa" and "Vicus" for the
other plots, adjusting titles accordingly.
11.2.3
Staffordshire Hoard gold compositions
The Staffordshire Anglo-Saxon Hoard is a treasure find from 2009, consisting of
about 7kg of metal pieces, mostly of gold and, to a lesser extent, silver. The
project design established for the study of the Hoard (Cool, 2013: 21–22) provides
the data given in Table 11.3 and background; some of the objects analysed are
shown in Figure 11.6. The data are measurements of the surface composition,
with respect to gold (Au), silver (Ag) and copper (Cu), of 22 metal pieces from
six objects found in the Staffordshire Anglo-Saxon Hoard, normalized to sum to
100%. A ternary diagram is shown in Figure 11.7.
Table 11.3: Gold/silver/copper compositions for the surfaces of artefacts from the
Staffordshire Hoard, normalized to sum to 100%.
Artefact
Cu
Ag
Au
Artefact
Cu
Ag
Au
a
a
a
a
b
b
b
c
c
c
c
3.3
2.3
2.6
3.0
2.1
2.6
2.0
5.3
3.7
3.9
3.7
10.9
11.5
11.0
12.0
18.8
20.8
17.7
29.0
26.9
27.9
28.1
85.5
85.9
85.9
84.4
78.8
76.1
79.7
65.5
57.1
60.7
60.7
c
c
c
d
d
e
e
f
f
f
f
5.3
6.1
7.8
2.2
1.7
1.1
5.4
1.6
1.9
2.4
2.2
14.0
16.7
26.0
18.4
15.3
24.4
25.8
15.2
17.7
18.1
18.1
79.6
71.8
65.6
79.0
82.6
74.2
68.5
83.0
78.2
78.7
78.2
The analysis was undertaken to see if pieces that could be associated with the
same object were compositionally similar with respect to their surface and distinct from other objects. The plot shows that objects are not as compositionally
consistent and distinct from other objects as one might wish, though it has subsequently transpired that surface measurements do not adequately reflect the body
composition, something it was hoped might be the case (Blakelock, 2016).
The code used is similar to that already illustrated, using a data file staffgold,
with one change. This is that only a subset of the full triangle is displayed, the
better to see any patterns in the data this is achieved by, in this case, adding the
code
127
Figure 11.6: A selection of objects from the Staffordshire Hoard analysed for Table 11.3.
(Source: Photographer Guy Evans. Copyright Barbican Research Associates).
tern_limits(T = .4, L = .4, R = 1)
which determines the top, left and right apices of the triangle. This was previously used, without comment, to produce Figure 9.6 where the plotting region was
delimited by values of 0.5, 0.8 and 0.45.
fig11.7 <- function() {
library(ggplot2); library(grid); library(ggtern)
p <- ggtern(data=staffgold,aes(Cu, Ag, Au, color=Artefact,
shape = Artefact, fill=Artefact)) +
geom_point(size = 3) + theme_showarrows() +
scale_shape_manual(values=c(21,22,21,23,24,25)) +
scale_colour_manual(values=rep("black", 6)) +
scale_fill_manual(values=c(NA, "yellow", "skyblue", "darkorange",
"purple", NA)) +
theme_legend_position("tl") + theme(legend.title=element_text(size=16),
legend.key.height=unit(1, "cm"), legend.key.width=unit(1, "cm")) +
tern_limits(T = .4, L = .4, R = 1)
p
}
128
Artefact
Ag
a
40
b
c
10
d
30
e
Cu
Ag
f
20
20
30
10
0
10
90
80
70
40
Cu
Au
Au
Figure 11.7: A ternary diagram for the Au/Ag/Cu surface compositions of 22 pieces
from six metal artefacts from the Staffordshire Hoard.
fig11.7()
129
Chapter 12
Correspondence analysis
12.1
Introduction
Several previous chapters have concentrated on the analysis of tabulated data,
either in the form of a set of counts for two or more categories of a single categorical
variable, or two-way tables obtained by crosstabulation of two categorical variables.
Pie charts, sometimes used to inspect such data in archaeological applications,
are usually not ‘fit for purpose’ (Chapter 9). For small tables of data, if you must
have a graphic, barplots are generally better (Chapter 10). It will be clear, from
discussion in these chapters, that we think simple tabular presentation with an
intelligent commentary is often preferable, though we recognise that some archaeologists prefer graphical presentation even when their subsequent discussion of a
graph is based on the numbers represented by it.
It’s a different story if you have a large table to deal with, and this is where
correspondence analysis (CA) comes in. Any informative pattern in the data may
be both non-obvious and subtle; CA is one way to get a handle on this.
Put very simply, the rows of a table of data are reduced to points that can be
plotted on a map that allows the similarity between rows, and any pattern in the
data, to be assessed. The method is commonly used for the purposes of seriation,
for example. The same sort of thing can be done with the columns of the table,
generating a second set of points. Joint inspection of the two set of points/maps
can be informative about the reasons for any pattern in the data.
This is discussed in more detail in the next section. Section 12.3 introduces
data on the regional and temporal distribution of brooch use in Late Iron Age and
Roman Britain, used for detailed illustration in Section 12.4 which discusses how
a basic CA can be carried out in R and shows some options for displaying results.
It’s possible to use CA for pattern recognition; in archaeological applications it is
widely used for seriation which generates what can be thought of as a ‘specialised’
form of pattern. Sections 12.5 to 12.7 provide some additional examples that
illustrate chronological and spatial seriation. The different sections also illustrate
different presentational options available with CA.
130
The method is a valuable tool for the graphical exploration of data. In the
examples to follow sufficient detail has been provided about the archaeological
context, including the substantive conclusions that CA can give rise to, in the
hope that readers not already convinced by the merits of CA might see that it is
a method worth knowing about.
12.2
Correspondence analysis – basics
Apologies to anyone reading this who is familiar with CA and uses it; they will
recognise that some corners are being cut in the account that follows.
The more theoretical treatments of CA can be borderline unreadable, even for
someone with a training in statistics. But the basic idea is very simple. Greenacre
(2007) is an accessible introduction; Baxter (1994/2015a) provides a fairly detailed
account aimed at archaeologists, including a review of the history of the use of
CA in archaeology to that date; Chapter 13 of Shennan (1997) is also a detailed
introductory account aimed at archaeologists; Baxter (2003) updates the story and
provides a more concise account of the methodology; Chapter 9 of Baxter (2015b)
provides a number of illustrative examples using R. Some of this earlier work is
repeated here, but we go into more detail about implementation in R. Baxter and
Cool (2010b) was intended as an introduction to CA in R for archaeologists with
little or no statistical training, who nevertheless appreciated what CA might have
to offer in their own work; it is now outdated. Alberti (2013a) is in a similar spirit
while taking a different approach to exposition; see, also Alberti (2013b, 2015).
At a more advanced methodological level are Peeples and Schachner (2012) and
Lockyear (2013), the former providing access to R code.
The story begins with a table having R rows and C columns, consisting of raw
counts based on the crosstabulation of two categorical variables. The values of R
and C are sufficiently large that it may not be obvious what the story, if any, is
that the data have to tell you, or it may not lend itself to immediate narrative
discourse. So to simplify things, and allow such a discourse, the data are reduced
to two pictures, one for rows and one for columns.
The pictures are in the form of scatterplots (Chapter 8). Roughly, and concentrating on the rows, each row is converted to row percentages – call these row
profiles. The ‘distance’ between each row profile is measured; this generates a
set of x- and y-coordinates for the R rows which can be plotted on an ordinary
scatterplot, so that’s the first picture. The second picture is obtained by treating
the columns in a similar way.
The mathematics is quite ‘messy’ but the idea is simple; you are just measuring
how similar pairs of row profiles are, and computational aspects have been taken
care of by the people who write software for this kind of thing. Call the points
row markers; rows that plot closely to each other should have similar profiles and
if they are far from each other they should have dissimilar profiles. It needs to be
recognised that the maps produced by CA are only an approximation to ‘reality’,
131
so the quality of the approximation needs to be judged.
This might be done in various ways, including expert assessment of any pattern
revealed. For example, if rows correspond to assemblages and columns to artefact
types the raw data can be inspected to see if it makes archaeological sense. More
technically a quantity called inertia is often used; this will be dealt with in context
in the examples to follow.
At the heart of many applications of CA is a comparison of the relative positioning of row and column markers. This can be effected in various ways, often by
the superimposition of the two maps. An alternative, which is sometimes better
with large data sets, is to present the maps as adjacent plots. Sometimes the main
focus may be on the rows (or columns) only, and there is nothing wrong with this.
The superimposition of the two maps constiutes a biplot and the term may be
used more loosely to refer to the presentation of adjacent maps. A certain amount
of mathematical ink has been spilled on the topic of the best/most ‘correct’ way
to display a biplot. The problem, if a ‘purist’ view is adopted, is that plots can
become difficult to read; the points for one map tend to lie around the periphery
of a plot while those for the other map get bunched-up in the middle. Therefore a
‘compromise’ approach, which can be explained mathematically, is often adopted,
which gives both maps equal prominence. This is the approach adopted here, but
software for CA often allows a choice if you want to explore presentational options.
The fundamental issue is that it is meaningful to talk about distances between
points on the individual maps but does not make sense to talk about distances
between points from different maps. Although this latter interpretation of ‘distance’ is not legitimate it is the case, the point of CA, that the relative location
of points on each map is informative. Imagine that maps are divided into four
quadrants, centred on the (0, 0) point. Rows that plot in the north-east quadrant,
for example, will tend to have larger values on those column categories that plot
in the same region, smaller values on the column categories that plot opposite in
the south-west quadrant, and so on.
As noted at the outset of the section this is a simplified account. Interpretation
in practice is often more complex and best illustrated by examples to which we
shall turn to after introducing one of the data sets to be used in the next section.
12.3
Some data
Table 12.1, has previously been published in the supplement to Cool and Baxter
(2016a) and in Baxter and Cool (2016). The first of these papers is a detailed
archaeological exploration of temporal and regional patterns of brooch use in Late
Iron Age and Roman Britain; the second paper is an extended, and generalised,
account of some of the statistical methodology developed for analysis in the first
paper. The data were extracted, by the second author, from Mackreth’s (2011)
corpus Brooches in Late Iron Age and Roman Britain. Figure 12.1 shows a selection
of the brooches studied.
132
Table 12.1: Regional counts of late Iron Age and Roman brooches by TPQ and period.
The regions are EA = East Anglia, N = North, S = South, EM = East Midlands, WM =
West Midlands, SW = South-West. Periods are IA = Iron Age, AT = Augusto-Tiberian,
CN = Claudio-Neronian, FL = Flavian, TH = Trajanic-Hadrianic, AS = AntonineSeveran, L = Late.
TPQ
-70
-50
-30
-20
-5
1
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
95
100
105
110
115
120
125
135
140
145
150
155
160
170
175
180
185
190
200
210
215
275
290
310
340
350
370
400
EA
25
25
13
0
6
62
2
173
9
38
3
32
247
31
82
190
297
68
55
29
89
130
161
70
4
10
52
12
10
36
8
21
23
34
10
29
13
22
7
16
68
44
15
19
23
3
5
3
8
5
4
N
0
1
2
1
0
11
0
14
0
4
0
0
26
2
8
8
24
4
10
11
58
70
89
73
5
1
37
25
6
10
7
17
15
50
2
21
18
3
7
3
16
25
19
33
2
7
3
3
5
0
0
S
88
107
49
7
24
141
19
532
26
277
13
56
561
72
203
106
182
68
74
27
106
57
73
26
5
12
58
18
7
38
6
19
11
37
8
15
31
12
15
12
40
28
9
33
3
7
6
3
12
3
7
EM
20
29
11
1
3
64
5
157
7
51
0
20
176
41
54
77
119
22
45
22
90
79
142
18
5
2
37
11
17
15
2
7
4
18
7
10
5
10
6
0
16
14
3
8
2
2
1
1
2
0
2
133
WM
3
3
1
0
0
30
0
37
3
15
0
1
50
5
20
28
34
19
64
10
53
35
63
10
13
1
40
13
9
11
3
7
0
11
1
6
2
0
4
0
11
3
0
1
2
1
1
3
5
2
2
SW
24
26
25
20
1
95
82
119
3
81
1
14
319
8
339
142
98
48
146
17
140
94
145
22
4
14
49
32
17
53
6
5
15
39
6
14
14
13
6
6
18
16
5
11
11
0
2
1
6
5
2
Period
IA
IA
IA
IA
IA
AT
AT
AT
AT
AT
AT
AT
AT
CN
CN
CN
CN
CN
CN
FL
FL
FL
FL
FL
TH
TH
TH
TH
TH
TH
TH
TH
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
L
L
L
L
L
L
L
Figure 12.1: A selection of Romano-British bow brooches of first to fourth century date.
(Source: authors).
134
A brooch can be assigned to a type and the region in which it was found. Many
of the brooches were not recovered from dated contexts, but the type to which they
belong can be assigned a terminus post quem (TPQ) based on the earliest date for
brooches of the type found from dated contexts. Several types may be associated
with the same TPQ and have been grouped together in Table 12.1. That is,
the entries refer to counts of brooches from a region whose type has a particular
TPQ with type differences for that TPQ being ignored. The period labels are
conventional, named, for the most part, after the more memorable emperors or
their families who dominated the period.
For present purposes the data will be used to illustrate the mechanics of carrying out a correspondence analysis in R and some of the presentational possibilities.
Some of this was undertaken in Cool and Baxter (2016a) and the reader will be
alerted to other analyses undertaken there, as appropriate.
12.4
12.4.1
R basics
‘Default’ analyses
Enter the data of Table 12.1 into R with the name brooches. Figure 12.2 shows
the (almost) default output from two implementations of CA in R, using columns
2 to 7 only (i.e. just the counts).
−0.5
0.0
0.5
1.0
46
S
EA
0.0
WM
−0.5
Dimension 2 (25.8%)
EM
SW
−1.0
4
N
0.5
43
24
N
−1.0
−1.0
−0.5
0.0
44
32
47 42
40 41
37
17 39
48
36
EA
49
33
EM35 20
3134
3 13
38
18
27 22
6
23
45
16
25
21
WM
28
26 30
29
50
SW19
15
0.0
9 14
21 812
10 S 51
−0.5
0.5
11 5
0.5
1.0
1.0
−1.0
7
−1.0
−0.5
0.0
0.5
1.0
−0.5
0.0
0.5
1.0
Dimension 1 (46.5%)
(a)
(b)
Figure 12.2: Basic correspondence analyses for the Romano-British brooch data set
of Table 12.1; (a) using the corresp function from the MASS package, (b) using the ca
function from the ca package.
Figure 12.2a was obtained using the corresp function from the MASS package
(Venables and Ripley 2002); Figure 12.2b was obtained using the ca function from
the ca package (Nenadic and Greenacre 2007). The code follows.
135
library(MASS)
biplot(corresp(brooches[,2:7], nf = 2))
library(ca)
plot(ca(brooches[,2:7]))
The corresp and ca functions carry out the CA and biplot and plot produce
the graphs from the output generated. The default in the corresp function is to
extract just one axis, so the argument nf = 2 is necessary to extract two axes
for plotting purposes. Both plots are informative in different ways, but for both
presentational and interpretive purposes can clearly be improved on.
A more detailed interpretation will follow in Section 12.4.3; for the moment
note that the south-west (SW) region sits apart from the other regions and Figure 12.2a shows that this is attributable to three ‘outliers’, rows 4, 7, and 15 which
correspond to TPQs of 20 BC, 5 AD and 45 AD. Inspection of Table 12.1 shows
that the SW contributed an unusually high proportion of the brooches for each
of these TPQs. Detailed analysis of the brooches associated with these TPQs,
reported in Cool and Baxter (2016a), revealed that this was being caused by the
number of Durotrigian brooches present in these date-assemblages. For reasons
discussed in that paper these TPQs were removed from subsequent analysis and
this will be emulated in this section.
There is a general lesson here. The ‘outliers’ are not ignored; they merit interpretation in their own right. But they can also obscure other informative structure
in the data so it is sensible to remove them and see what happens. The same applies when an analysis reveals clear structure in a data set, such as clusters of
cases. It can be useful to ‘break down’ the data set into its ‘component parts’ and
examine these separately to see if more subtle structure is also present. Cool and
Baxter (1999) is a more extended treatment of this kind of thing.
In Figure 12.2b the percentage figures associated with the two axes/dimensions
are the inertias, which sum to about 72%. Remember that this is a measure of the
‘quality’ of approximation to the ‘reality’ of the data set. It’s generally considered
good practice to report the values and 72% can be regarded as respectable. It’s
not perfect, however, which suggests that looking at axes/dimensions other than
the first two may be worthwhile and this will be illustrated shortly. To read the
plots as maps it’s important that the axes are equally scaled. In technical terms
the aspect ratio should be 1.
12.4.2
Customising plots
In this section enhanced versions of the plots in Figure 12.2 are presented using
both base R and ggplot2. We prefer the former, which seems to us to be simpler,
but provide code for the latter as well. The general idea, which applies to either
approach, is to undertake a CA and extract, separately, the information needed to
plot the maps for both rows and columns. These can then be superimposed, as is
done here, or plotted as adjacent maps – this approach allows much greater control
136
3
over presentational matters than do the defaults. The use of the third dimension
in a CA is illustrated and is informative. Initially the graphs, with accompanying
code and commentary, are presented. The archaeological interpretation of the
plots is discussed in Section 12.4.3.
In all the analyses the three outliers associated with the SW region are omitted
and this changes the picture a bit as regards the positioning of the SW in relation
to other regions. To reiterate, they are not being ignored; the reasons for their
‘outlyingness’ can be explained on archaeological grounds and a more ‘balanced’
picture of the overall pattern is obtained if they are removed. Figure 12.3 shows
the base R plot; we’ve chosen to do this in black-and-white but it’s easy to have
coloured symbols, labels and arrows if you wish.
2
WM
S
N
1
IA
AT
CN
FL
TH
AS
L
EA
EM
0
EA
Period
S
SW
−1
EM
Axis 3 (inertia = 15.5%)
SW
−1
0
IA
AT
CN
FL
TH
AS
L
−2
N
WM
−3
−3
−2
Axis 2 (inertia = 21.1%)
1
2
Period
−3
−2
−1
0
1
2
3
−3
Axis 1 (inertia = 53.8%)
−2
−1
0
1
2
3
Axis 1 (inertia = 53.8%)
(a)
(b)
Figure 12.3: Enhanced correspondence analyses for the Romano-British brooch data set
using the ca function and base R; (a) the CA plot based on the first two axes (b) a plot
based on the first and third axes.
The code used can be broken down into different chunks. The first chunk
is used for both the base R and ggplot2 analyses. The ca function is used; a
data frame, df is created that omits the outliers previously identified; the regional
counts are extracted in regions; and a correspondence analysis is carried out with
the results held in CA. The final two lines extract the row and column coordinates
used in subsequent plots. This could be done in other ways that use less code, but
the approach used here is intended to be more transparent.
library(ca)
df <- brooches[-c(4,7,15),]
regions <- df[,-c(1,8)]
CA <- ca(regions)
x1 <- CA$rowcoord[,1]; x2 <y1 <- CA$colcoord[,1]; y2 <-
# omit outliers
# extract regional data
# carry out CA
CA$rowcoord[,2]; x3 <- CA$rowcoord[,3]
CA$colcoord[,2]; y3 <- CA$colcoord[,3]
137
The base R code used now follows. That for the plot based on the first two
axes is given.
xlab <- df$Period
ylab = names(regions)
PCH <- rep(c(16,3,7,8,15,2,17), c(4,7,5,5,8,12,7))
plot(x1, x2, asp = 1, pch = PCH, xlab = "Axis 1 (inertia = 53.8%)",
ylab = "Axis 2 (inertia = 21.1%)", cex.axis=1.2, cex.lab=1.3, cex=1.2)
text(y1, y2, ylab, cex=1.2)
arrows(0, 0, .85*y1, .85*y2, length = .15, angle = 10, lwd = 2)
legend("topleft", c("IA", "AT", "CN", "FL", "TH", "AS", "L"),
pch = c(16,3,7,8,15,2,17), title = "Period", bty = "n", cex = 1.2)
The chunks of code above could usefully be combined in a user-defined function.
The first three lines in the second chunk set up the labels to be used in the
subsequent plot; this will be discussed further after the ggplot2 code has been
presented. In the plot function the asp = 1 argument ensures that the aspect
ratio is 1; everything after that is presentational. If the axis labels include the
inertias, which is recommended, these need to be obtained. When the initial
correspondence analysis is undertaken you can type CA. It will generate a lot of
output you may not wish to see, but the first part gives the principal inertias,
which is what’s needed. Alternatively, plot(CA) will produce a plot labelled with
the inertias for the first two axes.
The arrows function is optional; not all authors display analyses in this way,
though we find it can make interpretation easier. If it’s omitted plots somewhat
similar to those of Figure 12.2 will be obtained and this kind of presentation is
often seen in the literature.
Period
IA
CN
TH
AT
FL
AS
L
Period
IA
CN
TH
AT
FL
AS
L
3
WM
2
0
EA
EA
S
0
S
SW
−2
1
EM
Axis 3 (inertia = 15.5%)
EM
Axis 2 (inertia = 21.1%)
SW
2
−1
N
WM
N
−2
−3
−4
−2
0
2
−2
Axis 1 (inertia = 53.8%)
0
2
Axis 1 (inertia = 53.8%)
(a)
(b)
Figure 12.4: An enhanced correspondence analyses for the Romano-British brooch data
set using the ca function and ggplot2; (a) the CA plot based on the first two axes (b)
a plot based on the first and third axes.
138
The ggplot2 code, the output from which can be seen in Figure 12.4 can be
broken into two chunks in addition to that needed to get the plotting coordinates
in the first place. The three chunks involved should be embedded, sequentially, in a
user-defined function. The first chunk below is a bit tedious, but necessary to construct the data frames that are subsequently needed. It is not always immediately
obvious how to do this.
Period <- data.frame(df$Period)
dfscores <- data.frame(cbind(Period, x1, x2, x3))
names(dfscores) <- c("Period", "X1", "X2", "X3")
dfscores$Period <- factor(dfscores$Period,
levels = c("IA", "AT", "CN", "FL", "TH", "AS", "L"))
colnames <- data.frame(names(regions))
colscores <- data.frame(cbind(y1, y2, y3))
dfcols <- cbind(colnames, colscores)
names(dfcols) <- c("Region", "Y1", "Y2", "Y3")
dfcols$angle <- with(dfcols, (180/pi) * atan(Y2/Y1))
Essentially the extracted coordinates from the original CA need to be combined with the labelling information to be used in two separate data frames, one
corresponding to rows and the other to columns. Some of the coding is needed to
ensure that what’s being created is actually a data frame; the columns need to be
named in a suitable way for subsequent use. The factor command on the fourth
line is needed if you want the legend to reflect the temporal order of the periods,
otherwise they are arranged alphabetically. The last line is needed if you want to
have appropriately angled column markers; if you don’t it can be dispensed with.
The code to produce the CA plot requires the three packages noted to be
available. It’s stored in an object p which the final line plots.
library(ggplot2); library(grid); library(scales)
p <- ggplot(dfscores, aes(x=X1, y=X2)) +
xlab("Axis 1 (inertia = 53.8%)") + ylab("Axis 2 (inertia = 21.1%)") +
coord_fixed() +
geom_point(data=dfscores, aes(x=X1, y=X2, shape=Period,
colour=Period), size=3) +
geom_segment(data=dfcols, aes(x=0, y=0,
xend=Y1*.8, yend=Y2*.8), arrow=arrow(length =unit(1/2,
"picas")), color=muted("red")) +
geom_text(data=dfcols, aes(label=Region,
x=Y1*1, y=Y2*1, angle = angle), color="darkred", size=5) +
scale_colour_manual(values=c("darkred", "brown", "green3", "black", "blue",
"red", "magenta","red")) +
scale_shape_manual(values=c(16,3,7,8,15,2,17,16)) + xlim(-3,3) +
theme(axis.text=element_text(size=12), axis.title=element_text(size=14)) +
theme(panel.grid.minor = element_blank()) +
theme(legend.position="top", legend.text=element_text(size=12),
legend.title=element_text(size=14), legend.key.height=unit(.8, "cm"),
legend.key.width=unit(.8, "cm"))
p
139
The full code for Figure 12.4a is given; much of it is presentational. for Figure 12.4b, in the second chunk of code, change X2 to X3 and Y2 to Y3, where they occur. Change ylab("Axis 2 (inertia = 21.1%)") to ylab("Axis 3 (inertia
= 15.5%)"). The coord_fixed() function produces an aspect ratio of 1; the
xlim(-3,3) command is designed to produce a square plot rather than the rectangular one otherwise obtained. If you’re happy to accept default colours and
horizontally aligned column markers without the arrows, which is perfectly acceptable, the code could be simplified quite a lot, but we were curious to see what
was needed to obtain the kind of plots illustrated in Figure 12.4.
The scale_shape_manual command is needed. If there are more than six
categories to be represented, and there are seven here, then without it a stern
message to the effect that it ‘becomes difficult to discriminate with more than 6
categories . . . consider specifying shapes manually if you must have them’ appears.
This rather overstates the problem, but a judicious choice of shapes/symbols that
takes into account the configuration of points on the plot may be needed. It’s
possible to use the same symbol twice with suitably contrasting colours. In general
it is simpler in ggplot2 to specify colours and shapes via the colour and shape
arguments in the aesthetics (aes). In base R it’s necessary to create variables that
govern the colour and symbol shape and this can be tedious. It’s simplified in the
present case, when defining the plotting symbols in PCH which exploits the rep
function, because the data are grouped by period.
12.4.3
Interpretation
We’ve already noted that, at least for the approach adopted here, the base R plot
is somewhat simpler to implement and we often prefer to do CA in this way,
even when a presentation in colour is feasible. We don’t think, in this instance,
that ggplot2 confers any particular advantages either from a presentational or
interpretational perspective - the interpretation is the same using either approach.
The plot based on the first two axes shows a clear south-north opposition.
Other regions plot between them. It is possible to read a kind of geographical
progress with time in the plots. Starting in the south, then running through the
diagonal East Anglian/Midlands/South-West band and then arriving in the north.
If the relationship between the column and row markers is inspected it is easy to
see that the northern end is dominated by the later periods and the southern end
by the earlier periods, though the separation is not perfect. The CA was merely
the first stage of analysis in Cool and Baxter (2016a) and the other methodologies
used in that paper show the regional divisions more clearly.
The plot of the first against third axis is interesting. The third axis accounts
for 15.5% of the inertia, which is not negligible; the first three axes account for
about 90% of the total inertia. What’s interesting is the positioning of East Anglia on the plot, which separates out, and thereby hangs a tale. The Portable
Antiquities Scheme (PAS) in Britain encourages the reporting and recording of
metal-detected finds and promotes interaction between archaeologists and metal140
detectorists. Coins apart, brooches are one of the most commonly reported finds.
The PAS was in its infancy at the time Mackreth closed his corpus so, for the
most part, metal-detected finds are poorly represented in the corpus. The exception is East Anglia where a considerable fraction of the relevant corpus was
metal-detected. This is because archaeologists in the region pioneered collaboration with metal-detectorists well before the PAS came into being.
The interesting bit is that there is a considerable, not always appreciated, bias
in the kind of brooch recovered, or at least reported, by metal detecting. This
is discussed at length, and demonstrated, in Cool and Baxter (2016b). Roughly,
bulky and/or gaudy brooches are much more likely to be reported than slighter,
less glamorous types, which are, however, common. Furthermore, there is a temporal dimension in that the bulky/gaudy types tend to occur later than their less
‘attractive’ forebears. It means that there is almost certainly a serious bias in the
East Anglian assemblage related both to the types of brooch recorded and temporal patterns in the data. The end result is that any interpretation of the East
Anglian data in relation to other regions has to be undertaken with considerable
caution. Cool and Baxter (2016b), in addition to illustrating this in detail, examine more recent PAS data on brooches – the issue remains and poses questions
about how such data should be used to address archaeological questions that go
beyong the purely descriptive. The critical appraisals of PAS data that we are
aware of – mostly undertaken by the PAS itself or in collaboration with it – have
been largely concerned with the nature of geographical bias in the locations of
metal detecting rather than bias in the nature of artefact types of a particular
kind that are recovered.
12.5
Glass ‘counters’ from Pompeii
In Section 9.2.4, Table 9.3, data on the colours of glass ‘counters’, found in excavations of Insula VI.1, Pompeii, Italy, were presented. The data appear in Cool
(2016), the main purpose of which was to question the common assumption that
such ‘counters’ were usually gaming pieces, suggesting that it was equally, if not
more, likely that they were used as elements of ‘interior decoration’. Some of the
evidence for and thinking behind this is discussed in Section 10.4.1.
In the present instance the focus is on the variation in the use of colour over
time, regardless of function. Correspondence analysis was used in Cool (2016) to
demonstrate that there was, and this analysis is repeated here. Section 9.2.4 was
written in a spirit of some irony to show that presenting the data using pie charts
rather than CA, as suggested by a referee, was not the best idea anyone has ever
had.
More detailed information is given about the original data set here, since it
illustrates the kinds of decisions that may need to be made before undertaking
a CA. In total 566 glass counters were recovered making them the commonest
small find from the excavation other than coins. The counters were found in a
141
2
considerable number of colours, ‘rationalised’ in Cool (2016) into 20 categories.
Multi-coloured counters formed 6% of the assemblage with the remainder being
monochrome. Opaque colours accounted for just under 6%, counters which appeared black for just under 10% and the remainder were monochrome translucent
glass. Of those, blue/green was the commonest (19%), followed by yellow/brown
with the very dark shades accounting for 4% and the lighter shades, often variously
described as amber or honey-coloured, for 13%. Other common colours (8-10% of
the assemblage) were deep blue, pale green and colourless.
An early decision was made to use only those colours for which there were 10 or
more counters, reducing the number of colour categories used to 13. Phases I to V
in Table 9.3 correspond to the first century BC (C1BC), Augusto-Tiberian (AugT),
Augusto-Neronian (AugN), the period between the AD 62 earthquake and the
eruption of Vesuvius that destroyed Pompeii in AD 79 (Post 62), and the eruption
level itself for which the term ‘Modern’ is used as a proxy. This can be regarded
as a chronological sequence with some overlapping arising from uncertainty in
some of the phasing. The last two phases are secure; the Augusto-Tiberian phase
overlaps with those either side of it. Omitted entirely is the earliest third to
second century BC assemblage. This was comparatively small, atypical in several
respects, and there is uncertainty about some of the dating evidence for it. The end
result, Table 9.3, subjected to CA was therefore a result of informed archaeological
judgment and statistical considerations related to sample sizes. Eventually 452
counters were used in the analysis, presented in Figure 12.5.
MiB
DY/B
1
Post 62
DpB
0
AugT
Cls
Bl
PaG
AugN
Multi
B/G
Y/B
−1
Pea
Modern
−2
OpW
−3
axis 2 (inertia = 29.6%)
GtC
Y/G
−4
C1BC
−3
−2
−1
0
1
2
3
axis 1 (inertia = 49.1%)
Figure 12.5: A correspondence analysis of glass counter colours by phase from Insula
VI.1, Pompeii, Italy.
142
Presentation differs from that exploited so far in this chapter and is designed
to emphasise the fact that there is chronological variation in the use of colour.
In archaeological applications CA can be used for pattern recognition; perhaps
most frequently it is used for the purposes of seriation – that is, patterns are
sought in the data that can be clearly interpreted in terms of a chronological
sequence. Section 12.6 will illustrate this in more detail, Section 12.7 provides
an example where a seriation is interpretable in spatial rather than chronological
terms. Broadly speaking – and it is a demonstrable mathematical thing – if the
data seriate then the row and/or column markers should display a ‘horseshoe’
shaped pattern; interpretation as a chronological sequence, or whatever, has to be
based on criteria external to the CA itself.
Often, and as in Section 12.6, the focus is on the row markers. In the present
case the chronological sequencing of the columns is known. If you ‘join up the dots’
corresponding to the column markers, in chronological order, and get a ‘horseshoe’
shaped pattern without any crossovers of lines it’s interpretable as a chronological
seriation, and that’s what’s happening in Figure 12.5. It wasn’t expected.
This enables the patterning associated with the colours to be interpreted. Yellow/green counters are typical of the first century BC assemblage, moving onto
colourless and green-tinged colourless ones in the following Augusto-Tiberian period, with strongly coloured translucent and multi-coloured ones more prominent
moving into the longer Augusto-Neronian period. Black is clearly becoming more
important as time progressed, being typical of the post-earthquake period, whilst
blue/green and opaque white characterise the modern deposits that can be taken
as a proxy for the eruption level. The inertias of the first two axes sum to about
79% which is satisfactory. The reasons for the chronological shift are discussed in
Cool (2016); it suffices here to note that it exists and explanations can be advanced
for it.
The code used to produce Figure 12.5 is given below as it introduces features not covered earlier in the text. A data frame countercolour was created
that contains the second to sixth columns of Table 9.3; that is, the first column
countercolour[,1] contains the shortened names of the colours that will be used
for labelling purposes. The second line of code executes a CA of the count data held
in countercolour[,-1] with the following two lines extracting the coordinates to
be used for plotting purposes.
library(ca)
countersCA <- ca(countercolour[,-c(1)])
x1 <- countersCA$rowcoord[,1]; x2 <- countersCA$rowcoord[,2]
y1 <- countersCA$colcoord[,1]; y2 <- countersCA$colcoord[,2]
plot(x1, x2, asp = 1, type = "n", xlab = "axis 1 (inertia = 49.1%)",
ylab = "axis 2 (inertia = 29.6%)", main = "",
cex.axis = 1.2, cex.lab = 1.3, cex.main = 1.3, xlim = c(-3.5,3.5))
text(x1, x2, countercolour[,1], col = "black", cex = 0.8)
text(y1, y2, c("C1BC", "AugT", "AugN", "Post 62", "Modern"),
col = "red", cex = 1.3)
segments(y1[1:4], y2[1:4], y1[2:5], y2[2:5], lwd = 2)
143
In practice we’d embed this within a function, fig12.4 say, and obtain the
plot using fig12.4(). In the plot function the argument type = "n" suppresses
any plotting of points which are then added, with appropriate labelling, using the
text functions. These add, respectively, the row and column markers, the first two
arguments specifying the plotting positions and the third the labels to be used.
The segments function ‘joins up the dots’ corresponding to the column markers,
which have been arranged in chronological order. It’s perhaps best to look at the
help facility in R for segments if you feel moved to use it. Essentially it’s necessary
to specify the (x, y) coordinates for each pair of points you wish to unite with a
line segment.
In the present application the construction of the data frame is such that an approximate horseshoe pattern should be obtained if the pattern can be interpreted
as a seriation and that’s what happening. It should perhaps be noted that, to the
best of our recollection, this kind of presentation is less common in the archaeological literature than the other forms of display used in this chapter. Often, and as
in the next two sections, the emphasis is on the pattern of the row markers, when
we suspect (not entirely with reason) that the seriation only gets published if it is
obviously manifest in the form of a clear horseshoe that doesn’t need additional
presentational embellishment to clarify the picture. This is discussed further in
the next section, which includes an example of CA that is much more typical of
seriation studies.
12.6
Anglo-Saxon male graves and seriation
This section is designed to illustrate the more thoroughgoing use of CA for the
purposes of chronological seriation, in this case of 272 male Anglo-Saxon burials
characterised by the presence or absence of 80 types of grave goods, coded as 1 or
01 . The section is based on Section 9.5 of Baxter (2015b) which, in turn, drew on
a more extended acoount in Baxter (2014a).
The use of CA for the purposes of seriation in Anglo-Saxon burial studies is
quite common; typically there is a clear expectation that a chronological seriation
exists. Stratigraphy enabling the relative dating of graves may be unavailable, so
methods used elsewhere for this purpose, which is what is attempted in CA applications, are not readily applied. It is expected that changing fashions associated
with the grave goods will result in graves that are temporally close to each other,
having assemblages more similar to each other than to temporally more distant
graves. It can be shown that in such situations CA is expected ‘to work’.
The data were collected for, and analyzed in, Bayliss et al. (2013). Analysis, to be found in Chapter 6 there, proceeded iteratively, over 50 pages or so.
Rather than entering all the data at the start, only a subset of the types was used
1
Given the size of the data set it is not reproduced here but can, with a little effort, be
extracted from the Archaeology Data Service archives at
http://archaeologydataservice.ac.uk/archives/view/aschron_eh_2013/
144
initially. Graves and types were then omitted if their representation was unsatisfactory; further types added; and the process repeated until a seriation judged
to be satisfactory was achieved. There is some further ‘tweaking’ at the end that
is discussed shortly. Figure 12.6 emulates the preferred seriation of Bayliss et al.
(2013: 286) with some embellishment.
−2
−3
axis 2
−1
0
1
CA (row plot) of Anglo−Saxon male graves
Phases − Dated Graves
−4
p
q
r
s
t
−3
−2
−1
0
1
2
3
axis 1
Figure 12.6: A seriation of 6th and 7th century Anglo-Saxon graves from Britain.
Dashed vertical lines are phase boundaries. (Data source: Bayliss et al., 2013.)
Ignoring, for a brief moment, the embellishments (coloured dots, dashed vertical lines, and legend) this is absolutely typical of what gets published in archaeological seriation studies. You get as good a horseshoe pattern as you can reasonably
expect; it can be interpreted unambiguously as a chronological sequence; nobody
is going to argue with you. The ‘philosophy’ that underpins the way the CA was
conducted follows that espoused in Jensen and Høilund Nielsen (1997), the latter
being a co-author of the book under discussion. A condensed summary of the
‘philosophy’ is that an underlying seriation is assumed to exist at the outset; analysis proceeds iteratively as just described, omission of graves and types that do
145
not conform to a ‘seriation pattern’ being justified by the assumption that such
a pattern exists. Ideally this is also rationalized on archaeological grounds; for
example, some grave goods are of a type of some antiquity (compared to other
types in the burial assemblage) at the time the burial took place. As described
thus there is a resemblance to the ‘peeling’ process of Cool and Baxter (1999).
There are, however, differences. The Jensen/Høilund Nielsen approach assumes
a particular type of structure in the data, whereas Cool/Baxter do not, and an
emphasis is on identifying ‘outliers’ that ‘conceal’ the seriation. Cool and Baxter,
by contrast, envisage the use of CA for purposes other than seriation, where the
emphasis is on identifying patterns in the data, and removing the more obvious
structure to reveal more subtle features.
It’s possible to both understand the reasoning behind this approach and be
uneasy about it. Its application goes some way towards explaining why – if the
concluding paragraph of the previous section is at all valid – only ‘convincing’
seriations tend to see the light of day. It’s assumed that a seriation exists; artefact
types and cases that might cast doubt on this are excluded from a final presentation, so ending up with an ‘unconvincing’ published seriation can be quite difficult.
Some unease about the analytical approach described stems from the fact that it
is not always clear how much of the data originally collected have been excluded
from final publication because they don’t ‘fit in’, and why. One can be left wondering what any omitted data might usefully have to say, including doubts about the
apparent ‘certainty’ of a seriation. Ideally we’d like to see ‘warts and all’ presentations of seriations, in addition to the chosen picture, coupled with a full discussion
of the reasons for omissions – this might actually tell you something beyond the
seriation itself. We’re realistic enough to recognise that the exigencies of academic
publication may often militate against ‘full disclosure’ even when authors have
fully engaged with the issues raised in the course of their research.
As an historical aside Scandinavian scholars are usually credited with introducing CA to the English-speaking (and reading) archaeological world, or at least
popularising the method (Bølviken et al. 1982; Madsen 1988a,b). French archaeologists, publishing in that language, were on the scene earlier. Much of this was
associated the work of François Djindjian and collaborators who, if memory serves,
advocated an analytical approach not dissimilar to Jensen and Høilund Nielsen’s
(1997) at an earlier date. We’re not disputing the validity of what’s done, but you
would sometimes like a little more information in published applications.
Returning to the application that is the subject of this section, it has so far
been presented as a typical application of CA which, in some aspects, it is. The
embellishments in Figure 12.6, so far ignored, evidence a more singular story.
In addition to the kind of information normally used for this kind of exercise, 48
radiocarbon dates were available for 40 of the male graves, indicated in Figure 12.6.
This is the preferred seriation in Bayliss et al.2 . The seriation provides a relative
2
Their Figure 6.49 on p. 286. Figure 12.6 reproduces this closely, but with some embellishment
not in the original; R was used, rather than the software used in the book.
146
ordering of the graves and is used to suggest phasing for the data. What is novel
is the way Bayesian modelling is used in conjunction with the seriation to provide
date estimates for the phase boundaries. In Figure 12.6 the ordering of the dated
graves is determined by the ordering of the graves on the first CA axis. Some of
the phasing is rather ‘fine’ and some phases contain few dated graves3 .
What the phasing does is provide prior information about the relative chronological position of subsets of the dated graves that feeds into the modelling process.
Not all of the original data set contribute to the seriation and some of the graves
originally omitted have associated radiocarbon dates. The ‘almost’ final seriation
provides partial information on the relative chronology of these graves that allows
them to be fitted into what becomes the preferred seriation.
The combination of techniques used converts the relative chronology provided
by the seriation into an estimated absolute chronology. This provides finer dating
evidence for the sixth to seventh-centuries than that previously available, including
an estimate of the cessation of furnished Anglo-Saxon burials that places it in the
later seventh-century rather than the earlier eighth-century, as previously accepted.
This is an eye-catching conclusion (for Anglo-Saxon scholars at least) and is a
good example of innovative statistical methodology leading to archaeologically
important interpretations.
There are open questions, as the authors acknowledge. The preferred seriation
is based on calibrated radiocarbon dates that assume fish is not an important
component of diet. Were fish consumption non-negligible – and the evidence in
the book allows this reading – the estimated end-date would be later, though still
earlier than the previously presumed end. For some of the later, and dated, burials,
there is also an unresolved conflict with dates derived from numismatic evidence
which are later. Methodologically the way in which the relative dating provided
by the CA is interpreted in absolute terms is far superior to other work we are
aware of on similar lines.
The methodological development is, therefore, interesting and leads to substantively important archaeological conclusions. Unfortunately the book is, to put it
kindly, ‘heavy going’ in places; this is particularly true of Chapters 6 and 7 which
make considerable use of statistics and are central to the book. A major problem is
that far too much detail, that obscures the essentials, is provided; Baxter (2014a,
b) provides a commentary that attempts to separate out the essential material.
For a frank assessment of the book from an informed archaeological perspective
see Hills (2014).
3
Given the data the CA is straightforward and is all that is reported here other than labelling
the plot with the phased dates. The Bayesian analysis is methodologically and computationally
complex and best left to experts. We understand that some of the analyses reported in the book
took, literally, weeks to run so it’s not something you should attempt at home.
147
12.7
Flavian drinking-vessels
Usually CA is presented as a method for analysing tables of counts. It can, in
fact, be applied to any table of non-negative data and sometimes is, though the
‘theoretical’ justification for data other than counts may be less clear-cut and data
may be transformed before application. One purpose of this final example is to
illustrate the application of CA to fractional counts in the form of estimated vessel
equivalents (EVEs). The example is based on drinking vessels, current during the
Flavian period in England. A selection of vessels is illustrated in Figure 12.7.
(a)
(b)
(c)
Figure 12.7: (a) A Hofheim cup; (b) a facet-cut beaker; (c) a circus cup. (Sources:
(a) Courtesy of the Metropolitan Museum, New York. Cesnola Collection. Purchased
by subscription, 1874-76. (b) Courtesy of the J. Paul Getty Museum Los Angeles. (c)
Courtesy of the Metropolitan Museum, New York. Gift of Henry G. Marquand, 1881
(from Montagnole, France).)
Table 12.2, originally published in Cool and Baxter (1999), shows glass vessel
EVEs for seven types of drinking-vessel, current during the Flavian period in
England, from 10 sites ordered by their north-south orientation, with three from
the north, two from the Midlands, and five from the south.
In pottery studies, in Britain at least, since the seminal work of Orton (1975,
1993) it has been recognised that EVEs are the best way to quantify fragmented
data where the aim is to effect comparisons across assemblages. The merits of
EVEs are sometimes honoured more in the breach than the observance; they seem
not to be much used in pottery studies on the European mainland or the USA.
It means that a lot of pottery studies based on fragmented data that use other
methods do not lend themselves well to quantitative comparative studies. This
comment applies to measures such as fragment count, minimum number of vessels
represented and so on.
Moreno-García et al (1996) extended the thinking behind EVEs to animal
bone counts. This was based on a characterisation of bones that divided them
into recognisable zones. The equivalent of an EVE, for a single bone, was the
proportion of recognisable zones present. A pre-publication view of this was the
148
Table 12.2: Assemblage profiles (EVEs) for seven drinking vessel types, from the Flavian
period in England. ‘Order’ is geographical from north to south. ‘Sport’, ‘Rib’ and ‘Hof’
refer to mould-blown sport cups, mould-blown ribbed cups and Hofheim cups; ‘Tall’,
‘Ind’, ‘FC’ and ‘WC’ are beaker types, tall mould-blown, indented, facet-cut and wheelcut respectively (Source: Cool and Baxter, 1999).
Site
Carlisle
York
Castleford
Wroxeter
Caersws
Colchester
Gloucester
Caerleon
London
Fishbourne
Date
N-S Order
Sport
Tall
Rib
Hof
Ind
FC
WC
Total EVE
71-105
71-100
71-100
80-120
69-100
65-100
70–98
74-100
65-100
75-100
1
2
3
4
5
6
7
8
9
10
0.8
0
0
0
0
0.8
0
0
2.0
0
0.2
0
0.4
0
0.2
0
0.6
0
1
0.2
1
0.2
1.4
0.6
0.6
0
0
0
0
0
0
2.2
0
0.4
2.2
2.4
2.8
1.2
2.8
0
0.2
0
0.2
0
0
0.6
0
1.0
3.0
0.4
0.2
0
0.2
0.2
0
0.4
0
1.8
3.2
0.8
0.4
0.6
0.2
0.2
0
0.2
0
0.8
1.0
0
2.8
2.4
2.4
1.4
3.0
4.4
3.4
4.8
13.0
1.4
inspiration for Cool’s (1994) development of glass vessel EVEs (see, also, Cool
and Baxter 1996). Figure 12.8 shows the result of applying CA to the data of
Table 12.2.
CA (row plot) of Flavian drinking vessels
−0.5
8 9
3
10
−1
1.0
0.5
0.0
Rib
Sport
Ind
FC
−1.5
1
Tall
WC
−0.5
Dimension 2 (inertia = 37.0%)
0.0
0.5
6
4
−1.5
Hof
−1.0
5
1.5
North
Midlands
South
1.0
2
−1.0
Dimension 2 (inertia = 37.0%)
1.5
7
CA (column plot) of Flavian drinking vessels
0
1
2
0
Dimension 1 (inertia = 41.4%)
1
2
3
Dimension 1 (inertia = 41.4%)
Figure 12.8: Correspondence analysis row and column plots for the Flavian drinkingvessels of Table 12.2.
The manner of presentation chosen illustrates the use of adjacent rather than
superimposed plots. Both plots are divided into quadrants, so that a comparison
between them is much easier. The code used to obtain the plots is given towards
the end of the section.
With the exception of Site 1 (Carlisle) an almost perfect seriation of the data
is obtained in the row plot. Labelling sites by their north-south orientation, Site
149
2 (York) sits apart from others in its region but, though not perfect, the seriation
admits a spatial interpretation. Reading around the horseshoe, sites from the
north and Midlands are perfectly separated from southern sites on the first axis.
Cool and Baxter (1999: 90–91) can be referred to for a discussion of these
results, which admit more than one interpretation. One is that the north is characterised by earlier vessel forms and the south by later ones, and this may be the
‘rule rather than the exception’, contradicting ‘traditionally’ held views that the
earlier forms found in the north and Midlands were ‘isolated survivals’. An alternative interpretation is that drinkers in the north/Midlands favoured low cups
while those in the south preferred tall beakers.
Ribbed cups only appear in the northern and Midland assemblages, all of which
plot, along with these cups, to the left. Hofheim cups are not confined to any
particular region, but are particularly prominent in Sites 2, 5 and 7 (York, Caersws, Gloucester) which accordingly plot in a similar region towards the bottom.
The indented and facet-cut beakers are particularly prominent in the three most
southerly sites, 8 9 and 10 (Caerleon, London, Fishbourne), and also plot closely
in the north-east quadrant.
The analysis of these data raises an interesting and more general question. We
are aware that Cool and Baxter (1999) has been used for teaching purposes in a
leading British university archaeology department where students were invited to
criticise it. We’ve been told (in more than one personal communication) that it is
often commented on that the numbers (of EVEs) on which the analysis is based
are small and asked if this raises questions about the validity of any interpretation based on them. Both the observation and the question are legitimate. The
‘generality’ of the question arises because in archaeological data analysis one has
to work with the data to hand, and the sample sizes involved may often be quite
‘small’ according to the desiderata sometimes laid down for what constitutes an
adequate sample (not, incidentally, a question that has an easy answer). It can
also be argued that most archaeological data analysis is concerned with pattern
recognition in some form or other, using the terms ‘data’ and ‘pattern’ in a very
broad sense.
It is difficult to put the following thought into words, but it’s along the lines
that if there is obvious pattern in a data set and if a plausible archaeological
interpretation can be advanced for that pattern (the emphases are important here)
then this transcends any purely statistical concerns one might have about sample
sizes. That is, intelligent archaeological pattern recognition and interpretation may
‘trump’ statistical formalism when the two appear to conflict because of sample
size doubts.
Something of this kind, in more general terms, is expressed in Section 3.15 of
Doran and Hodson (1975). This was around the time when quantitative methodology began to be widely applied in archaeology, and those involved were grappling with the conflicting requirements of archaeological and statistical inference.
Some scholars rejected the use of formal statistical methodology entirely while
150
proponents of what was then the ‘New Archaeology’ rather exaggerated what was
possible with such methodology. Both positions are extreme; statistics are useful
but, in an applied context, should be subservient to archaeology.
The patterns in Table 12.2 revealed by the CA are undeniable. The numbers
are, however, small so the possibility exists that future discoveries may overturn
some of the interpretations offered. We have yet to see it in print, but are aware
that the paper has been criticised on the basis of finds made in the south, subsequent to Cool and Baxter (1999), that are supposed to contradict the suggestion
that drinkers in the south preferred tall vessels. It would be necessary to quantify
the new data and subject it to the kind of statistical analysis that would set it in the
context of the data from other regions to assess this claim. If, hypothetically, the
‘new’ glass amounts to 2 EVEs of ribbed bowls split equally between the southern
sites (where previously there were none) the interpretation does not change. To
change the broad picture you’d need somewhat larger EVE values spread around
small vessels in the south and/or find more tall vessels in the north/Midlands.
The R code used to produce Figure 12.8 follows.
fig12.8 - function() {
library(ca)
EVE <- Flavian[, 2:8] # extract EVE counts
CA <- ca(EVE)
# undertake the CA
# extract plotting coordinates
x1 <- CA$rowcoord[,1]; x2 <- CA$rowcoord[,2]
y1 <- CA$colcoord[,1]; y2 <- CA$colcoord[,2]
#Set up plot symbols and colours to be used
Col <- c(rep("pink", 3), rep("yellowgreen", 2), rep("skyblue", 5))
Sym <- c(rep(16, 3), rep(17, 2), rep(15, 5))
# Row plot
win.graph()
plot(x1, x2, asp = 1, pch = Sym, col = Col, cex = 3,
main = "CA (row plot) of Flavian drinking vessels")
text(x1, x2, 1:10, cex = 1)
abline(h=0, v=0, lty = 2, lwd = 1.5)
legend("topright", c("North", "Midlands", "South"),
col = c("pink", "yellowgreen", "skyblue"),
pch = c(16,17,15), cex = 1.2, bty = "n")
# Column plot
win.graph()
plot(y1,y2, asp = 1, type = "n",
main = "CA (column plot) of Flavian drinking vessels")
text(y1, y2, names(EVE), cex = 1.5)
abline(h=0, v=0, lty = 2, lwd = 1.5)
}
fig12.8()
This introduces one or two new features that have not previously been used.
These are not specific to CA and could equally well have been introduced in Chap151
ter 8. The site and EVE count data, other than the total, are read into a file
Flavian (i.e. only columns 1 and 4–10 are used). The code is given in summary
form, omitting presentational aspects covered elsewhere (e.g., text sizes) unless it
seems useful to make it explicit. The first two blocks of code set up the information
to be used in the plots. The idea was to use different plotting symbols, differently
coloured to distinguish between regions then label the points in a manner that
both distinguishes between sites and is informative about their geographical location (i.e. the ‘N-S Order’ in Table 12.2). To do this it’s best to use pale colours,
created in Col. If you want to have the site labels lying within the symbols you
need to play around with their sizes, so we’ve made explicit the cex arguments
used. The abline function adds the reference grid. Dashed (lty = 2) horizontal
and vertical lines (h=0, v=0) that intersect at the (0, 0) point are created, lwd
controlling the line width. In the column plot, type = "n" suppresses the printing of points and the text function replaces them with the names of the columns
contained in the data frame EVE.
A final comment is that we have used correspondence analysis for the graphical display of counted data, and this is the dominant use in the archaeological
literature. Where pattern exists it is usually possible to infer, from the positioning
of the row and column markers, what the more important determinants of the
pattern are. This can be done more formally and numerically by calculating a
variety of diagnostic statistics, which is possible using the ca function. Examples
are scattered throughout Greenacre (2007) including an account of the theory.
Shennan’s (1997: 327–341) case study of Danish Bronze Age hoards, which predates the development of the ca package, provides a detailed illustration of the use
of the statistics, and is the earliest archaeological example we have seen. Other
archaeological illustrations include Baxter (2015b: 138–143) and Lockyear (2016),
but the ‘technology’ has not been widely used.
152
Bibliography
Albarella, U., Tagliacozzo, A, Dobney, K. and Rowley-Conwy, P. 2006: Pig
hunting and husbandry in prehistoric Italy: a contribution to the domestication debate. Proceedings of the Prehistoric Society 72, 193–227.
Alberti, G. 2013a: An R script to facilitate correspondence analysis. a guide to
the use and the interpretation of results from an archaeological perspective.
Archeologia e Calcolatori 24, 25–53.
Alberti, G. 2013b: Making sense of contingency tables in archaeology: The
aid of correspondence analysis to intra-site activity areas research. Journal
of Data Science 11, 479–499.
Alberti, G. 2015: CAinterprTools: An R package to help interpreting Correspondence Analysis’ results. SoftwareX 1–2, 26-31.
Aldeias, V., Goldberg, P., Sandgathe, D., Berna, F., Dibble, H. L., McPherron, S. P., Turq, A. and Rezek, Z. 2012: Evidence for Neandertal use of
fire at Roc de Marsal (France). Journal of Archaeological Science 39, 2414–
2423.
Baddeley, A., Rubek, E. and Turner, R 2016: Spatial Point Patterns. Boca
Raton, FL: Chapman & Hall/CRC Press.
Barker, P., White, R., Pretty, K., Bird, H. and Corbishley, M. 1997: The
Baths Basilica Wroxeter: Excavations 1966–90. London: English Heritage
Archaeological Report 8.
Baxter, M. J. 1994: Exploratory Multivariate Analysis in Archaeology. Edinburgh: Edinburgh University Press.
Baxter, M. J. 2003: Statistics in Archaeology. London: Arnold.
Baxter, M. J. 2014a: Anglo-Saxon Chronology I – the male graves.
A commentary on Chapter 6 of Anglo-Saxon Graves and Grave
Goods of the 6th and 7th Centuries AD: A Chronological Framework. https://nottinghamtrent.academia.edu/MikeBaxter/Archaeology (accessed
10/10/2016).
153
Baxter, M. J. 2014b:
Anglo-Saxon Chronology II – the female
graves. A commentary on Chapter 7 of Anglo-Saxon Graves and
Grave Goods of the 6th and 7th Centuries AD: A Chronological Framework. https://nottinghamtrent.academia.edu/MikeBaxter/Archaeology (accessed 10/10/2016).
Baxter, M. J. 2015a: Exploratory Multivariate Analysis in Archaeology.
Reprint of Baxter (1994) with a new introduction. New York: Percheron
Press.
Baxter,
M. J. 2015b:
Notes on Quantitative Archaeology and R.
https://www.academia.edu/12545743/Notes_on_Quantitative_Archaeology_and_R
(accessed 18/05/2016).
Baxter, M. J. 2016: Multivariate Analysis of Archaeometric Data: An Introduction. https://www.academia.edu/24456912/Multivariate_Analysis_of _Archaeometric_Data_An_Introduction (accessed 28/07/2016).
Baxter, M. J. and Cool, H. E. M. 2008: Notes on the statistical analysis of
some loomweights from Pompeii. Archeologia e Calcolatori 19, 49–66.
Baxter, M. J. and Cool, H. E. M. 2010a: Detecting modes in low-dimensional
archaeological data. Journal of Archaeological Science 37, 2379–2385.
Baxter, M. J. and Cool, H. E. M. 2010b: Correspondence Analysis in R for
archaeologists: an educational account. Archeologia e Calcolatori 21, 211–
28.
Baxter, M. J. and Cool, H. E. M. 2016: Reinventing the wheel? Modelling
temporal uncertainty with applications to brooch distributions in Roman
Britain. Journal of Archaeological Science 66, 120–127.
Baxter, M. J., Beardah, C. C. and Wright, R. V. S. 1997: Some archaeological
applications of kernel density estimates. Journal of Archaeological Science
24, 347–354.
Baxter, M. J., Cool, H. E. M. and Anderson, M. A. 2010: Statistical analysis
of some loomweights from Pompeii: a postcript. Archeologia e Calcolatori
21, 185–200.
Bayliss, A., Hines, J., Høilund Nielsen, K. H., McCormac, G. and Scull, C.
2013: Anglo-Saxon Graves and Grave Goods of the 6th and 7th Centuries
AD: A Chronological Framework. London: Society for Medieval Archaeology.
Becker, R. A., Chambers, J. M. and Wilks, A. R. 1988: The S Language: A
Programming Environment for Data Analysis and Graphics. Murray Hill,
NJ: Bell Telephone Laboratories.
154
Binford, L. R. 1978: Dimensional analysis of behavior and site structure:
learning from an Eskimo hunting stand. American Antiquity 34, 330–361.
Blakelock, E. S., 2016: Never judge a gold object by its surface analysis: A
study of surface phenomena in a selection of gold objects from the Staffordshire Hoard. Archaeometry, in press. DOI: 10.1111/arcm.12209
Bølviken, E. E., Helskog, K., Holm-Olsen, I., Solheim, L. and Bertelsen, R.
1982: Correspondence analysis: an alternative to principal components.
World Archaeology 14, 41–60.
Chang, W. 2012: R Graphics Cookbook. Sebastopol, CA: O’Reilly.
Chester-Kadwell, M. 2009: Early Anglo-Saxon Communities in the Landscape
of Norfolk. Oxford: Archaeopress, British Archaeological Reports, British
Series 481.
Chirikure, S., Heimann, R. B. and Killick, D. 2010: The technology of tin
smelting in the Rooiberg Valley, Limpopo Province, South Africa, ca. 1650–
1850 CE. Journal of Archaeological Science 37, 1656–1669.
Christensen, A-M., Holm, P. M., Schuessler, U. and Petrasch, J. 2006: Indications of a major Neolithic trade route? An archaeometric geochemical
and Sr, Pb isotope study on amphibolitic raw material from present day
Europe. Applied Geochemistry 21, 1635–1655.
Cocis, H. 2015: Some remarks on the Roman necropolises of Potaissa. Journal
of Ancient History and Archeology 2, 58–66.
Conolly, J. and Lake, M. 2006: Geographical Information Systems in Archaeology. Cambridge: Cambridge University Press.
Cool, H. E. M. 1983: Roman Personal Ornaments Made of Metal, Excluding
Brooches, from Southern Britain. Cardiff: University of Wales Ph.D, Thesis.
(Available at http://ethos.bl.uk).
Cool, H. E. M. 1990: Roman metal hair pins from southern Britain. Archaeological Journal 147, 148–182.
Cool, H. E. M. 1994: The Quantification of Roman Vessel Glass Assemblages.
Unpublished report: English Heritage.
Cool, H. E. M. 2006: Eating and Drinking in Roman Britain. Cambridge:
Cambridge University Press.
Cool, H. E. M. 2013: Contextualising Metal-Detected Discoveries: Staffordshire Anglo-Saxon Hoard. Nottingham, UK: Barbican Research Associates.
155
Cool, H. E. M. (2016): Recreation or decoration: what were the glass counters
from Pompeii used for? Papers of the British School at Rome 84, 157–177.
Cool H.E.M. and Baxter M.J. 1995: Finds from the fortress: artefacts, buildings and correspondence analysis. In Wilcock, J. and Lockyear, K. (eds.),
Computer Applications and Quantitative Methods in Archaeology 1993. Oxford: Tempus Reparatum, British Archaeological Reports, International
Series 598, 177–182.
Cool, H. E. M. and Baxter, M. J. 1996: Quantifying glass assemblages. Annales du 13e Congrès de l’Association Internationale pour l’Histoire du
Verre, 93–101.
Cool, H. E. M. and Baxter, M. J. 1999: Peeling the onion: an approach
to comparing vessel glass assemblages. Journal of Roman Archaeology 12,
72–100.
Cool, H.E.M. and Baxter, M.J. 2016a: Brooches and Britannia. Britannia,
71–98.
Cool, H.E.M. and Baxter, M.J. 2016b: Exploring morphological bias in metaldetected finds. Antiquity (to appear).
Cool, H. E. M. and Price, J. 1995: Roman Vessel Glass from Excavations
in Colchester, 1971-85. Colchester: Colchester Archaeological Reports 8.
http://cat.essex.ac.uk/reports/CAR-report-0008.pdf (accessed 22/10/2016).
Cool, H. E. M., Lloyd Morgan, G. and Hooley, D. 1995: Finds from the
Fortress. The Archaeology of York 17:7. York: Council for British Archaeology.
De La Fuente, G. A. 2011: Urns, bowls, and ollas: pottery-making practices and technical identity in the southern Andes during the late period
(ca. a.d. 900-a.d. 1450) (Catamarca, Northwestern Argentine region, Argentina). Latin American Antiquity 22, 224–252.
De Reu, J, Bourgeois, J., Bats, M., De Smedt, P., Gelorini, V., Zwertvaegher,
A., Antrop, M., De Maeyer, P., Finke, P., Van Meirvenne, M., Verniers,
J. and Crombé, P. 2013: Beyond the unknown: understanding prehistoric
patterns in the urbanised landscape of Flanders. Journal of Historical Geography 40, 1–15.
Discamps, E. and Castamango, S. 2015: Improving mortality profile analysis in zooarchaeology: a revised zoning for ternary diagrams. Journal of
Archaeological Science 58, 62–76.
Doran, J. E. and Hodson, F. R. 1975: Mathematics and Computers in Archaeology. Edinburgh: Edinburgh University Press.
156
Drennan, R.D. 1996: Statistics for Archaeologists. New York: Plenum Press.
Drennan, R.D. 2009: Statistics for Archaeologists: Second Edition. New York:
Springer.
Duval, C., Lepetz, S., Horard-Herbin, M-P., and Cucchi, T. 2015: Did Romanization impact Gallic pig morphology? New insights from molar geometric
morphometrics. Journal of Archaeological Science 57, 345–354.
Egri, M. 2007: The use of amphorae for interpreting patterns of consumption. In Croxford, B., Ray, N., Roth, R. and White, N. (eds.), TRAC 2006:
Proceedings of the Sixteenth Annual Theoretical Roman Archaeology Conference. Oxford: Oxbow Books, 43–58.
Greenacre, M. J. 2007: Correspondence Analysis in Practice: Second Edition.
Boca Raton, FL: Chapman & Hall/CRC Press.
Hammon, A. 2011: Understanding the Romano-British-Early-Medieval transition: a zooarchaeological perspective from Wroxeter (Viroconium Cornoviorum). Britannia 42, 275-305.
Hein, A., Kilikoglou, V. and Kassianidou, V. 2007: Chemical and mineralogical examination of metallurgical ceramics from a Late Bronze Age copper
smelting site in Cyprus. Journal of Archaeological Science 34, 141–154.
Hesse, R. 2011: Reconsidering animal husbandry and diet in the northwest
provinces. Journal of Roman Archaeology 24, 215–248.
Hills, C. 2014: Review of Anglo-Saxon Graves and Grave Goods of the 6th
and 7th Centuries AD: a chronological framework. By Bayliss Alex, Hines
John, Høilund Nielsen Karen, McCormac Gerry and Scull Christopher. The
Antiquaries Journal 94, 370–372.
Hintze, J. L. and Nelson, R. D. 1998: Violin plots: A box plot-density trace
synergism. The American Statistician 52, 181–184.
Hiscock, P. 2003: Quantitative exploration of size variation and the extent
of reduction in Sydney Basin assemblages: A tale from the Henry Lawson
Drive rockshelter. Australian Archaeology 57, 64–74.
Howarth, R. J. 1996: Sources for a history of the ternary diagram. The British
Journal for the History of Science 29, 337–356.
Hubert, M. and Vandervieren, E. 2008: An adjusted boxplot for skewed distributions. Computational Statistics and Data Analysis 52, 5186–5201.
157
Hughes, R. E., Högberg, A. and Olausson, D. 2010: Sourcing flint from Sweden and Denmark: A pilot study employing non-destructive energy dispersive X-ray fluorescence spectrometry. Journal of Nordic Archaeological
Science 17, 15–25.
Jennings, T. A. 2008: San Patrice: An example of Late Paleoindian adaptive
versatility in South-Central North America. American Antiquity 73, 53–
559.
Jensen, C.K. and Nielsen, K.H. (eds.) 1997: Burial and Society: The Chronological and Social Analysis of Archaeological Burial Data. Aarhus: Aarhus
University Press.
Joy, J. 2014: ‘Fire burn and cauldron bubble’: Iron Age and Early Roman
cauldrons of Britain and Ireland. Proceedings of the Prehistoric Society 80,
327–362.
Kampstra, P. 2008: Beanplot: A boxplot alternative for visual comparison of
distributions. Journal of Statistical Software Code Snippets 28(1), 1–9
https://www.jstatsoft.org/article/view/v028c01 (accessed 18/05/2016).
Keeler, D. 2007: Intrasite spatial analysis of a Late Upper Paleolithic French
site using geographic information systems. Journal of World Anthropology:
Occasional Papers 3.1, 1–40.
King, A. C. 1984: Animal bones and the dietary identity of military and
civilian groups in Roman Britain, Germany and Gaul. In Blagg, T. F. C.
and King. A. C. (eds), Military and Civilian in Roman Britain Oxford:
British Archaeological Reports, British Series 136, 187–217.
King, A. C. 1999a: Diet in the Roman world: a regional inter-site comparison
of the mammal bones. Journal of Roman Archaeology 12, 168–202.
King, A. C. 1999b: Animals and the Roman army: the evidence of the animal
bones. In Goldsworthy, A and Haynes, I (eds.), The Roman Army as a
Community. Journal of Romans Archaeology Supplement 34. Portsmouth,
RI: Journal of Roman Archaeology, 200–11.
King, A. C. 2005: Animal remains from temples in Roman Britain. Britannia
36, 329–369.
Leach, P. 1993: Pottery. In Woodward, A. and Leach, P. (eds.), The Uley
Shrines: Excavation of a Ritual Complex on West Hill, Uley, Gloucestershire: 1977–9. English Heritage Archaeological Report 17. London: English
Heritage, 219–249.
158
Levitan, B. 1993: Vertebrate remains. In Woodward, A. and Leach, P. (eds.),
The Uley Shrines: Excavation of a Ritual Complex on West Hill, Uley,
Gloucestershire: 1977–9. English Heritage Archaeological Report 17. London: English Heritage, 257–301.
Liebmann, M., Ferguson, T. J. and Preucel, R. W. 2005: Pueblo settlement,
architecture, and social change in the Pueblo Revolt Era, A.D. 1680 to
1696. Journal of Field Archaeology 30, 45–60.
Lockyear, K. 1994: ‘Ranter’s Corner’ – PUPs’. Archaeological Computing
Newsletter 41, 1–2.
Lockyear, K. 2013: Applying bootstrapped correspondence analysis to archaeological data. Journal of Archaeological Science 40, 4744–4753.
Lockyear,
K.
2016:
Simplifying_complexity.
https://www.academia.edu/28521469/ (accessed 06/10/2016).
Mackay, A. 2011: Nature and significance of the Howiesons Poort to postHowiesons Poort transition at Klein Kliphuis rockshelter, South Africa.
Journal of Archaeological Science 38, 1430–1440.
Mackreth, D. F. 2011: Brooches in Late Iron Age and Roman Britain. Oxford:
Oxbow Books.
Madsen, T. (ed.) 1988a: Multivariate Archaeology. Aarhus: Aarhus University
Press.
Madsen, T. 1988b: Multivariate distances and archaeology. In Madsen, T.
(ed.), Multivariate Archaeology. Aarhus: Aarhus University Press, 7–27.
McGill, R„ Tukey, J. W. and Larsen, W. 1978: Variations of box plots. The
American Statistician 32, 12–16.
McMahon, T. C. 2007: Discerning prehistoric landscapes in Colorado and
the Mesa Verde region using a kernel density estimate (KDE) Method. In
Clark, J. T. and Hagemeister, E. M. (eds.), Digital Discovery. Exploring
New Frontiers in Human Heritage, CAA 2006. Budapest: Archaeolingua,
151–166.
Mårtensson, L., Nosch, M-L. and Strand E. A. 2009: Shape of things: understanding a loom weight. Oxford Journal of Archaeology 28, 373–398.
Meighan, C. W. 1959: A new method for the seriation of archaeological collections. American Antiquity 25, 203–211.
Mills, B. J. 2007: Performing the feast: visual display and suprahousehold
commensalism in the Puebloan Southwest. American Antiquity 72, 210–
239.
159
Monna, F., Jebrane, A., Gabillot, M., Laffont, R., Specht, M., Bohard, B.,
Camizuli, E., Petit, C., Chateau, C. and Alibert, P. 2013: Morphometry of
Middle Bronze Age palstaves. Part II – spatial distribution of shapes in two
typological groups, implications for production and exportation. Journal of
Archaeological Science 40, 507–516.
Morehart, C. T. and Eisenberg, D. T. A. 2010: Prosperity, power, and change:
Modeling maize at Postclassic Xaltocan, Mexico. Journal of Anthropological
Archaeology 29, 94–112.
Moreno-García, M., Orton, C., Rackham, J. 1996: A new statistical tool for
comparing animal bone assemblages. Journal of Archaeological Science 23,
437–453.
Murrell, P. 2005: R Graphics. Boca Raton, FL: Chapman & Hall/CRC Press.
Murrell, P. 2011: R Graphics: Second Edition. Boca Raton, FL: Chapman &
Hall/CRC Press.
Nenadic, O. and Greenacre, M. 2007: Correspondence Analysis in R, with
two- and three-dimensional graphics: the ca package. Journal of Statistical
Software 20(3), 1–13.
Nikita, E., Mattingly, D. and Mirazón Lahr, M. 2012: Three-dimensional
cranial shape analyses and gene flow in North Africa. Journal of Anthropological Archaeology 31, 564–572.
Oron, M. and Goren-Inbar, N. 2014: Mousterian intra-site spatial patterning
at Quneitra, Golan Heights. Quaternary International 331, 186–202.
Orton, C. 1975: Quantitative pottery studies: some progress, problems and
prospects. Science and Archaeology 16, 30–35.
Orton, C. 1980: Mathematics in Archaeology. London: Collins.
Orton, C. 1993: How many pots make 5 - an historical review of pottery
quantification. Archaeometry 35, 169-184.
Panich, L. M. 2016: Beyond the colonial curtain: Investigating indigenous use
of obsidian in Spanish California through the pXRF analysis of artifacts
from Mission Santa Clara. Journal of Archaeological Science: Reports 5,
521–530.
Peeples, M. A., and Schachner, G. 2012: Refining correspondence analysisbased ceramic seriation of regional data sets. Journal of Archaeological Science 39, 2818–2827.
160
Plaza, M. T. and Martinón-Torres, M. 2015: Metallurgical traditions under
Inka rule: a technological study of metals and technical ceramics from the
Aconcagua Valley, Central Chile. Journal of Archaeological Science 54, 86–
98.
Plummer, T. W., Bishop, L. C. and Hertel, F. 2008: Habitat preference of
extant African bovids based on astragalus morphology: operationalizing
ecomorphology for palaeoenvironmental reconstruction. Journal of Archaeological Science 35, 3016–3027.
Pollard, A. M., Ditchfield, P., Piva, E., Wallis, S., Falys, C. and Ford, S. 2012:
‘Sprouting like cockle amongst the wheat’: The St Brice’s day massacre and
the isotopic analysis of human bones from St John’s College, Oxford. Oxford
Journal of Archaeology 31, 83–102.
Radivojević, M., Rehren, T., Pernicka, E., Šljivar, D., Brauns, M. and Borić,
D. 2010: On the origins of extractive metallurgy: new evidence from Europe. Journal of Archaeological Science 37, 2775–2787.
Robeerst, A. 2005: Interaction and exchange in food production in the Nijmegen region frontier area during the Early Roman period. In Bruhn, J.,
Croxford, B. and Grigoropoulos, G. (eds.), TRAC 2004: Proceedings of
the Fourteenth Annual Theoretical Roman Archaeology Conference. Oxford:
Oxbow Books, 43–58.
Sarkar, D. 2008: Lattice: Multivariate Data Visualization with R. New York:
Springer.
Sayer, D. and Weinhold, M. 2013: A GIS investigation of four early
Anglo-Saxon cemeteries: Ripley’s K-function analysis of spatial groupings
amongst graves. Social Science Computer Review 31, 71–89.
Scarry, C. M. and Reitz, E. J. 2005: Changes in foodways at the Parkin site,
Arkansas. Southeastern Archaeology 24, 107–121.
Scharra-Liška, G., Cichocki, O. and Wiltschke-Schrotta, K. 2016: Wooden
coffins in the Avar-period cemetery in Frohsdorf, Lower Austria. Open Archaeology 1, 54–78.
Schmidt, I., Bradtmöller, M., Kehl, M., Pastoors, A., Tafelmaier, Y.,
Weninger, B. and Weniger, G.-C. 2012: Rapid climate change and variability of settlement patterns in Iberia during the Late Pleistocene. Quaternary
International 274, 179–204.
Schörner, G. 2007: New images for old rituals: Stelae of Saturn and personal
cult in Roman North Africa. In Croxford, B., Ray, N. Roth, R. and White,
N. (eds.), TRAC 2006: Proceedings of the Sixteenth Annual Theoretical
Roman Archaeology Conference. Oxford: Oxbow Books, 92–102.
161
Seetah, K., Cucchi, T., Dobney, K. and Barker, G. 2014: A geometric morphometric re-evaluation of the use of dental form to explore differences in
horse (Equus caballus) populations and its potential zooarchaeological application. Journal of Archaeological Science 41, 904–910.
Shennan, S. 1988: Quantifying Archaeology. Edinburgh: Edinburgh University Press.
Shennan, S. 1997: Quantifying Archaeology: Second Edition. Edinburgh: Edinburgh University Press.
Silverman, B. W. 1986: Density Estimation for Statistics and Data Analysis.
London: Chapman & Hall.
Simmonds, A., Marquez-Grant, N. and Loe, L. 2008: Life and Death in a
Roman City. Excavation of a Roman Cemetery with a Mass Grave at 120122 London Road, Gloucester, Oxford Archaeology Monograph 6. Oxford:
Oxford Archaeological Unit.
Stafford C. R. 1991: Archaic Period logistical foraging strategies in WestCentral Illinois. Midcontinental Journal of Archaeology 16, 21–246.
Stafford C. R. 1994: Structural changes in Archaic landscape use in the dissected uplands of Southwestern Indiana. American Antiquity 59, 219–237.
Steele, T. E. and Weaver, T. D. 2002: The modified triangular graph: a
refined method for comparing mortality profiles in archaeological samples.
Journal of Archaeological Science 29, 317–322.
Stiner, M. C. 1990: The use of mortality patterns in archaeological studies of
hominid predatory adaptations. Journal of Anthropological Archaeology 9,
305–351.
Stiner, M. C. 1994: Honor Among Thieves: a Zooarchaeological Study of
Neanderthal Ecology. Princeton: Princeton University Press.
Thornton, C. P. and Rehren, T. 2009: A truly refractory crucible from fourth
millennium Tepe Hissar, Northeast Iran. Journal of Archaeological Science
36, 2700–2712.
Tufte, E .R. 1983: The Visual Display of Quantitative Information. Cheshire,
CT: Graphics Press.
Tukey, J. W. 1977: Exploratory Data Analysis. Reading, MA: Addison-Wesley.
Tukey, J. W. 1993: Graphic comparisons of several linked aspects. Alternatives and suggested principles. Journal of Computational and Graphical
Statistics 2, 1–33.
162
Tupman, C. 2005: The cupae of Iberia in their monumental contex: a study
of the relationship between social status and commemoration with barrelshaped and semi-cylindrical tombstones. In Bruhn, J., Croxford, B. and
Grigoropoulos, G. (eds.), TRAC 2004: Proceedings of the Fourteenth Annual Theoretical Roman Archaeology Conference. Oxford: Oxbow Books,
119–131.
Unwin, A. 2015: Graphical Data Analysis with R. Boca Raton, FL: Chapman
& Hall/CRC Press.
Venables, W. N. and Ripley, B. D. 2002: Modern Applied Statistics with S:
Fourth Edition. New York: Springer.
Venables, W. N., Smith, D. M. and the R Core Team 2016: An Introduction to R: Notes on R: A Programming Environment for Data Analysis
and Graphics Version 3.3.1. https://cran.r-project.org/doc/manuals/r-release/Rintro.pdf (accessed 10/10/2016).
Wainwright, G. J. 1984: The pressure of the past. Proceedings of the Prehistoric Society 50, 1–22.
Wand, M. P. and Jones, M. C. 1995: Kernel Smoothing. London: Chapman
& Hall.
Weaver, T. D., Boyko, R. H. and Steele, T. E. 2011: Cross-platform program for likelihood-based statistical comparisons of mortality profiles on a
triangular graph. Journal of Archaeological Science 38, 2420–2423.
Weber, M. 2012: The devil is in the (Samian) detail – potter’s stamps on
samian ware and their implications for the Roman provincial economy. In
Duggan, M., McIntosh, F. and Rohl, D, J. (eds.), TRAC 2011: Proceedings
of the Twenty First Theoretical Roman Archaeology Conference. Oxford:
Oxbow Books, 60–75.
Whallon, R. 1987: Simple statistics. In Aldenderfer, M. S. (ed.), Quantitative Research in Archaeology. Newbury Park, California: Sage Publications,
135–150.
Wheatley, D. and Gillings, M. 2002: Spatial Technology and Archaeology.
London: Taylor and Francis.
White, N. 2007: Catering for the cultural identities of the deceased in Roman
Britain: Interpretive potential and problems. In Croxford, B., Ray, N. Roth,
R. and White, N. (eds.), TRAC 2006: Proceedings of the Sixteenth Annual
Theoretical Roman Archaeology Conference. Oxford: Oxbow Books, 115–
132.
163
Wickham, H, 2009: ggplot2. New York: Springer.
Wickham, H, 2016: ggplot2: Second Edition. New York: Springer.
Wickham, H. and Stryjewski, L. 2011: 40 years of boxplots.
http://vita.had.co.nz/papers/boxplots.pdf (accessed 10/10/2016).
Wilkinson, L. 2005: The Grammar of Graphics. New York: Springer.
Woodward, A. and Leach, P. (eds.) 1993: The Uley Shrines: Excavation of
a Ritual Complex on West Hill, Uley, Gloucestershire: 1977–9. English
Heritage Archaeological Report 17. London: English Heritage.
Wurz, S., le Roux, N. J., Gardner, S. and Deacon, H. J. 2003: Discriminating
between the end products of the earlier Middle Stone Age sub-stages at
Klasies River using biplot methodology. Journal of Archaeological Science
30, 1107–1126.
164
Index
artefact type
amphorae, 89, 99
brooches, 132
cauldrons, 86
faunal remains, 90, 102, 113, 122
glass counters, 94, 141
glass drinking vessels, 149
glass pillar-moulded bowls, 37
glasscounters, 109
gold fittings, 127
grave goods), 144
hairpins, 31, 39, 58
inscribed funerary altars, 116
loomweights, 53, 59, 65, 73
pottery (samian ware), 116
stelae, 115
stone tools, 120
aspect ratio, 109, 136, 138
choice of
using ggplot, 68
awful
three-dimensional
barplots, 115–118
pie charts, 89
barplots, 37, 98–118
alternative to pie chart, 92
bad practice, 113–118
clustered vs. stacked, 103, 109
one-way data, 99–101
superfluity of, 100, 102
three-dimensional
avoid, 115
two-way data, 102–109
beanplots, 64
and kernel density estimates, 65
construction, 65
examples, 65, 66
multiple, 66
biplot, 132
boxplots, 64–72
as five-number summary, 64
bad for multimodal data, 67
construction, 64
controlling box size, 67
for skewed data, 71
notched, 69
correspondence analysis, 130–152
archaeological literature on, 131
basic idea, 130
biplot, 132
inertia as measure of fit, 132, 136
outliers, 136
seriation, 130
chronological, 144
data sets
Anglo-Saxon graves, 144
analysis, 145–147
Iberian funerary altars, 117
analysis, 117
Iron Age/Romano-British brooches,
133
analysis, 135–141
IronAge/Roman cauldrons, 87
analysis, 87–88
Lower Danubian Roman amphorae,
89
analysis, 89
Palaeolithic stone tools from Ksar
Akil, 121
analysis, 121–122
Pompeian glass counter colours, 95
165
analysis, 96–97, 141–144
Pompeian glass counter diameters,
109
analysis, 110–112
Pompeian loomweights, 53, 54, 66
analysis, 53–57, 59–63, 65–68, 70,
73–84
Roman glass pillar-moulded bowls,
37
analysis, 37–39
Roman North African stelae, 115
analysis, 115
Romano-British amphorae, 100
analysis, 100–101
Romano-British faunal assemblages,
122
analysis, 123–127
Romano-British glass drinking vessels, 149
analysis, 149–152
Romano-British hairpin lengths, 28,
31
analysis, 33–36, 39–42, 58, 64, 65
species presence on Romano-British
sites, 102, 103, 107
analysis, 102–109, 113–114
species remains from Wroxeter, 90
analysis, 90–94
Staffordshire Hoard gold, 127
analysis, 128, 129
stamped samian ware from Castleford (UK), 116
analysis, 116
descriptive statistics, 22–30
computation in R, 28
computation inR, 30
five-number-summary
and boxplots, 28
measures of dispersion, 23
interquartile range, 24
mean absolute deviation, 25
range, 24
standard deviation, 25
variance, 25
measures of location, 23
mean, 24
median, 24
mode, 24
trimmed mean, 24
misuse of term ‘average’, 26
quartiles, 24
skewed data
difficulty of summarising, 27
estimated vessel equivalents (EVEs), 148
Excel
awful barplots, 118
awful pie charts
three-dimensional, 89
failure to customise, 9, 99
problems with histograms, 99
widely used in archaeology, 2, 9
frequency polygons, 43–44
histograms, 31–42
as specialised form of barplot, 98
construction, 33
density histograms, 35
dependence on bin-width, 34
dependence on origin, 34
examples of, 26
superimposed with KDE, 35
kernel density estimate
and violin plots, 65
kernel density estimates, 51–63
and beanplots, 65
bandwidth, 67
adjusting default, 56
choice of, 52
dependence on, 52
comparison with histograms, 52
construction, 51–52
superimposed on density histogram,
35
two-dimensional
contoured, 62, 73
166
vectors, 11
extracting subsets of data, 12
functions, 13–14
metal-detected artefacts
graphics packages
bias in recovery, 141
ggplot2, 14–19
multimodality, 24, 26, 67, 82, 110
grid, 14
antimode, 26, 76
lattice, 14
base R, 14
outliers, 23, 79
legend construction, 40–42
effects of, 27
line types, 77
in correspondence analysis, 136, 137
packages, 14
removal of, 56
plotting symbols, 75
types of data
period
numeric, 12
Anglo-Saxon (Britain), 127, 144
user
interfaces
Iron Age and Romano-British, 132
Rcmdr, 9
Palaeolithic (Lebanon), 120
Rstudio, 9
Roman (Lower Danube), 89
user-defined functions, 45–50
Roman Britain, Northern Ireland, 86
R functions
Roman Iberia, 116
coord_fixed, 68
Roman Italy (Pompeii), 53, 59, 65,
geom_freqpoly, 43
73, 94, 109, 141
theme_showarrows, 124
Roman North Africa, 115
abline, 77, 152
Romano-British, 31, 37, 39, 58, 90,
adjbox, 71
99, 102, 113, 116, 122
arrows, 138
pie charts, 85–97
as.matrix, 12
three-dimensional
as.numeric, 80
avoid, 89
axis, 101
Portable Antiquities Scheme, 140–141
bandwidth.nrd, 60
pseudo-3-dimensional plots, 89, 115–118
barplot, 38, 101
quantification
beanplot, 64
of broken artefacts, 148–149
boxplot, 64
ca, 135
R, see also R functions, see also R packcolors, 19
ages
colours, 19
acquiring R, 10
contour, 62
colour, 19–20
coord_fixed, 109, 140
palettes, 81
corresp, 135
data import, 10–11
density, 57
data structures
fivenum, 29
data frames, 11
geo_point, 19
factor, 12
geom_bar, 107
matrix, 12
geom_boxplot, 66
missing data, treatment of, 11
diagonal bandwidth matrix, 60
non-diagonal bandwidth matrix, 62
167
geom_histogram, 42
geom_point, 75
geom_violin, 66
ggpairs, 84
ggplot, 58
grid, 35
hist, 33
image, 61
is.data.frame, 11
kde2d, 60
kde, 63
legend, 40, 101
lines, 57, 78
lm, 78
loess, 78
mad, 29
max, 29
mean, 29
median, 29
min, 29
pairs, 82
par, 40
persp, 61
pie, 85
plot.acomp, 120
plot, 57, 138, 144
points, 63
predict, 78
print, 106
qplot, 35
quantile, 29
read.table, 11
rep, 37, 140
scale_colour_manual, 18, 58, 79
scale_fill_brewer, 81
scale_fill_manual, 42, 109
scale_shape_manual, 18, 140
scale_size_manual, 125
scatterplotMatrix, 84
sd, 29
segments, 77, 144
seq, 30, 41
summary, 29
text, 101
theme, 19, 36
var, 29
vioplot, 64
R functions
library, 48
triax.plot, 120
R packages
GGally, 84
MASS, 135
RColorBrewer, 81
beanplot, 64
car, 84
ca, 135
compositions, 120
ggplot2
violin plots, 69
ggplot2, 35
ggtern, 93, 120
grid, 42
ks, 62
plotrix, 120
reshape2, 15
robustbase, 71
splines, 14
vioplot, 64
robust statistics, 23
Romanization, 102, 124
scatterplots, 73–84
adding lines to, 76–78
scatterplot matrices, 82–84
smoothers
linear regression, 77
loess curves, 77
seriation, 120
correspondence analysis, 130, 144–
147
spatial, 149–150
with ternary diagrams, 122
ternary diagrams, 119–128
alternative to pie chart, 93
history of, 119–120
168
seriation, 120, 122
types of data
categorical, 23, 98
continuous, 22, 31
crosstabulated, 23
discrete, 22, 98
nominal, 98
ordinal, 23, 38, 98
violin plots, 64
and kernel density estimates, 65
construction, 65
examples, 65
in ggplot2, 66
multiple, 66
169