Academia.eduAcademia.edu
Basic Statistical Graphics for Archaeology with R: Life Beyond Excel Mike Baxter Hilary Cool Barbican Research Associates Nottingham Basic Statistical Graphics for Archaeology with R: Life Beyond Excel Mike Baxter* and Hilary Cool** * ** Nottingham Trent University, School of Science and Technology (Emeritus) Barbican Research Associates, 16 Lady Bay Road, Nottingham NG2 5BJ, UK October 2016 Contents Preface viii 1 Introduction 1.1 Why statistics? 1.2 Why R? 1.3 The need for graphics, or not 1.4 How this book was written 1.5 Structure of the text 1.6 Additional resources – data sets and R code 1 1 2 2 4 5 6 2 R – an introduction 2.1 Introduction 2.2 R basics 2.2.1 R – getting started 2.2.2 Data import 2.2.3 Data structures 2.3 Functions and packages 2.3.1 Functions 2.3.2 Packages – general 2.3.3 Packages – graphics 2.3.4 Plot construction – ggplot2 2.3.5 Colour 2.4 Don’t panic 8 8 10 10 10 11 13 13 14 14 16 19 20 3 Descriptive statistics 3.1 Introduction 3.2 Types of data 3.3 Summary statistics 3.3.1 Definitions 3.3.2 Usage (and abusage) 3.4 Obtaining summary statistics in R 22 22 22 23 23 26 28 i 4 Histograms 4.1 Introduction 4.2 Histograms in base R 4.3 Histograms using ggplot2 4.4 Histograms from count data 4.5 Histograms with unequal bin-widths 4.6 Using histograms for comparison 4.7 Frequency polygons 31 31 33 35 36 37 39 43 5 User-defined functions 5.1 User-defined functions 5.1.1 Introduction 5.1.2 Writing functions – a very basic introduction 5.1.3 More realistic illustrations 5.2 Summary 45 45 45 46 47 50 6 Kernel density estimates 6.1 Introduction 6.2 Example – loomweights from Pompeii 6.3 Comparing KDEs 6.4 Two-dimensional (2D) KDEs 51 51 53 58 59 7 Boxplots and related graphics 7.1 Introduction 7.2 Construction and interpretation 7.3 Using plots for comparing groups 7.4 Discussion 64 64 64 65 70 8 Scatterplots 8.1 Introduction 8.2 Presentational aspects 8.2.1 Plotting symbols, axis labels, axis limits 8.2.2 Adding lines to plots 8.2.3 Points 8.2.4 Scatterplot matrices 73 73 74 74 76 79 82 9 Pie charts 9.1 Introduction 9.2 Examples 9.2.1 Pointless pie charts 9.2.2 Dishonest pie charts 9.2.3 Could do better 9.2.4 From bad to worse? 85 85 86 86 89 90 94 ii 10 Barplots 10.1 Introduction 10.2 One-way data 10.3 Two-way data 10.3.1 Preliminary considerations 10.3.2 Barplots for two-way data in base R 10.3.3 Barplots in ggplot2 10.4 Examples, including things to avoid 10.4.1 Glass ‘counters’ and their function 10.4.2 Chartjunk I 10.4.3 Chartjunk II 98 98 99 102 102 103 106 109 109 113 115 11 Ternary diagrams 11.1 Introduction 11.2 Examples 11.2.1 Palaeolithic assemblages 11.2.2 Romano-British civilian faunal assemblages 11.2.3 Staffordshire Hoard gold compositions 119 119 120 120 122 127 12 Correspondence analysis 12.1 Introduction 12.2 Correspondence analysis – basics 12.3 Some data 12.4 R basics 12.4.1 ‘Default’ analyses 12.4.2 Customising plots 12.4.3 Interpretation 12.5 Glass ‘counters’ from Pompeii 12.6 Anglo-Saxon male graves and seriation 12.7 Flavian drinking-vessels 130 130 131 132 135 135 136 140 141 144 148 References 153 Index 165 iii List of Figures 3.1 3.2 Some possible distributional ‘shapes’ for continuous data. A histogram and KDE for late Romano-British hairpin lengths. 26 29 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 Romano-British hairpins. A Roman pillar moulded bowl. Histograms for the Romano-British hairpin lengths I. Histograms for the Romano-British hairpin lengths II. Histograms for the Romano-British hairpin lengths using ggplot2. Different ways of representing the data from Table 4.3. Using stacked histograms for comparison. Using superimposed histograms for comparison. Frequency polygons for Romano-British hairpin lengths. 32 32 33 34 35 38 39 40 43 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 Construction of a kernel density estimate. 51 The effect of bandwidth on KDE construction. 52 KDEs of the weights of loomweights I. 53 A view of Insula VI.1, Pompeii. 55 Loomweights from Insula VI.1, Pompeii. 55 KDEs of the weights of loomweights II. 56 KDEs and histograms of the weights of loomweights. 57 KDEs comparing lengths of early and late Romano-British hairpins. 58 Two-dimensional KDEs using ggplot2. 59 Two-dimensional KDEs using base R. 61 Contour plots based on 2D KDEs using base R. 62 7.1 7.2 7.3 7.4 7.5 7.6 Basic boxplots, violin plots and beanplots. Boxplots and beanplots used to compare weights of loomweights. Kernel density estimates of weights of phased loomweights. Boxplots and violin plots used to compare weights of loomweights. Boxplots and violin plots for Romano-British hairpin lengths. Boxplots for skewed data. 65 66 67 68 69 72 8.1 8.2 8.3 Default scatterplots for loomweight heights and weights. Enhanced scatterplots for loomweight heights and weights. Basic code for a base R scatterplot. 74 74 75 iv 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.11 8.12 8.13 8.14 Available symbols for plotting points in R. Basic code for a ggplot2 scatterplot. Adding lines to plots I. Available lines for plotting in R. Adding lines to plots II. Adding points to plots. Plotting multiple groups I. Plotting multiple groups II. A scatterplot matrix for loomweight dimensions using base R. An enhanced scatterplot matrix for loomweight dimensions. A ggpairs scatterplot for loomweight dimensions. 75 76 76 77 78 79 80 81 82 83 84 9.1 9.2 Pie charts of the distribution of contexts by cauldron group. Pie charts showing the distribution of amphora types in the Lower Danube region. Pie charts for faunal remains from Wroxeter. Soay sheep and Dexter cattle. A clustered barplot for faunal remains from Wroxeter. A ternary diagram for faunal remains from Wroxeter. Pie charts for counter colours by phase from Pompeii, Insula VI.1. 88 9.3 9.4 9.5 9.6 9.7 10.1 Default and enhanced barplots for amphora presence on RomanoBritish sites. 10.2 Default barplots for species presence on Romano-British sites. 10.3 Enhanced barplots for species presence on Romano-British sites using base R. 10.4 ‘Default’ barplots for species presence on Romano-British sites using ggplot2. 10.5 A modified clustered barplot for species presence on Romano-British sites using ggplot2. 10.6 Glass counters from Gloucester and Pompeii. 10.7 A clustered barplot for glass ‘counter’ dimensions from different contexts. 10.8 An alternative (and poorly) ‘enhanced’ barplot for amphora presence on Romano-British sites. 10.9 A version of Figure 10.3a showing the visual impact of grid lines. 10.10 Monthly distribution of sacrifices to Saturn. 10.11 Examples of unnecessary three-dimensional barplots. 11.1 Ternary diagrams for the assemblage data of Table 11.1. 11.2 ‘Default’ and modified ternary diagrams for Romano-British civilian faunal assemblages. 11.3 A ‘modified’ ternary diagram for Romano-British civilian faunal assemblages II. v 89 90 91 92 93 96 100 104 104 107 108 110 111 113 114 115 116 121 123 125 11.4 A ternary diagram for Romano-British civilian faunal assemblages overlaid with a two-dimensional KDE. 11.5 Ternary diagrams for Romano-British civilian faunal assemblages for different site types overlaid with two-dimensional KDEs. 11.6 Objects from the Staffordshire Hoard. 11.7 A ternary diagram for Au/Ag/Cu compositions of artefacts from the Staffordshire Hoard. 12.1 A selection of Romano-British bow brooches of 1st to 4th century date. 12.2 Basic correspondence analyses for the Romano-British brooch data set. 12.3 An enhanced correspondence analyses for the Romano-British brooch data set using the ca function and base R. 12.4 An enhanced correspondence analyses for the Romano-British brooch data set using the ca function and ggplot2. 12.5 A correspondence analysis of glass counter colours by phase from Insula VI.1, Pompeii, Italy. 12.7 Examples of Flavian glass vessels. 12.8 Correspondence analysis, Flavian drinking-vessel glass assemblages. vi 126 126 128 129 134 135 137 138 142 148 149 List of Tables 3.1 Lengths of late Romano-British copper alloy hairpins. 28 4.1 4.2 4.3 Lengths of Romano-British copper alloy hairpins. The Romano-British hairpin data as entered into R. Chronology of Roman glass pillar-moulded bowls. 31 33 37 6.1 Loomweight dimensions. 54 7.1 Phased loomweight data with six variables. 66 9.1 9.2 9.3 Counts of Iron Age and early Roman cauldrons by context. Percentages of major domesticates from Wroxeter by phase. Colours of glass ‘counters’ by phase, from Insula VI.1, Pompeii. 87 90 95 10.1 Amphora as a proportion of total pottery assemblages by site. 10.2 Later first- to mid-second century animal bone assemblages from Romano-British sites. 10.3 The data of Table 10.2 recast for use with the barplot function in base R. 10.4 The data of Table 10.2 recast for use with ggplot2. 10.5 Glass ‘counter’ sizes from Insula VI.1 Pompeii and other contexts. 10.6 Individuals commemorated with and setting up funerary altars. 100 102 103 107 109 117 11.1 Counts of cores, blanks and tools from middle levels of the palaeolithic site at Ksar Akil (Lebanon). 121 11.2 Faunal remains from Romano-British civilian sites. 122 11.3 Au/Ag/Cu compositions for artefacts from the Staffordshire Hoard. 127 12.1 Regional counts of late Iron Age and Roman brooches. 12.2 Assemblage profiles for Flavian drinking-vessel glass. vii 133 149 Preface As the late, great Terry Pratchett put it, in his Discworld novel Hogfather, “Everything starts somewhere, though many physicists disagree. But people have always been dimly aware of the problem with the start of things.” The genesis of this text is multiple. But the idea of producing it, to the best of our recollection, crystallized during a conversation in a pub – as many good and bad ideas do. The first author is an academic statistician with reasonably extensive youthful experience of the practicalities of archaeological excavation and its aftermath. The second author is an archaeologist who has used statistical methodology in her work since her university days. We were working on separate projects with the finishing line in sight and vaguely wondering what to do next. We’ve published jointly since the 1990s, latterly, when it comes to statistics, including graphical presentation using the open-source software R about which more is said in Chapter 2. Some of the simpler methodology that we’ve used, with respect to graphical presentation of the kind that is the staple of archaeological publications that touch this kind of thing, should be straightforward. The archaeological literature is littered with statistical graphics that should have been strangled at birth or, in some cases, not conceived at all. We resort to Excel from time-to-time and take the trouble to customise the graphs. We thought that a text that discussed how to produce the most common types of graph that appear in archaeological publications, with detailed instructions on how to construct them in R, might be useful. The immediate thought at the time was that it would enable the second author to produce her own graphs in R without having to trouble the first author. This would be a solipsistic reason for inflicting a text on an unsuspecting world, though this has never stopped anyone. There is a more serious purpose – of the kind that occurs to you on your second glass of wine (or beer or cup of tea or whatever). You might also conceive of our pub conversation as a ‘conception’ arising, in a less precisely datable fashion, from mutual long-standing dissatisfaction with some of of the statistical graphics to be found in the archaeological literature. With apologies for the bad language used this is best conveyed by some of the kinds of things we’ve said. ‘If I see another bl**dy three-dimensional exploded pie-chart I’ll scream’. ‘The editor(s) who allowed this monstrous, threedimensional, optically disturbing bar chart to sully their publication should be viii shot’. ‘Mummy, why do they use half a page to display a pie/bar chart with tow categories? Couldn’t they just cite the numbers?’. That sort of thing. It’s possible that we might be exaggerating a bit, but possibly not and certainly not as much as you would wish. Somewhat over 20 years ago Kris Lockyear (Lockyear, 1994), in the appropriately named Ranter’s Corner of the now defunct Archaeological Computing Newsletter, sounded off, entertainingly, more in anger than sorrow, about problems with graphics in archaeological publications. The article was entitled PUPs (pretty useless pictures); the acronym PPPs (pretty pointless pictures) was also used, as well as others coined to be equally derogatory. Re-reading it – it is partly a product of its time in terms of references to the technology then available – a lot of his critique is, depressingly, still applicable. Everything passes but, over the time-scale we’re talking about, everything doesn’t necessarily change, or at least for the better. PPPs and PUPs still abound. And things could be done about it. Quite a lot of the more heinous graphics to be seen in archaeological publications are produced using Excel or software purposely designed to emulate the worst that Excel has to offer. For many purposes Excel is quite capable of delivering decent graphs, but it has to be controlled/customised and not everyone bothers. The royal ‘We’ is used a lot in the text; this is supposed to be bad practice (along with over-use of the first-person singular) in academic texts, and is quite deliberate. One reason is the text is opinionated in places, and opinions differ (particularly among academics); we thought it best to make it quite clear when disputable opinion was being expressed. The other reason is that we’ve been at some pains to use real data for illustrative purposes, a lot of it based on our own research; it seemed silly to refer to this in neutral terms as if it had nothing to do with us, so hence ‘We’. The other decision made was to use real data and examples of what we considered to be bad ‘graphical practice’, drawn from examples in texts and journals lying around in our house – it proved to be distressingly easy, as is discussed in the introductory chapter. What, for some perverse reason, struck us as amusing was that the richest source of awful graphics proved to be the annual proceedings of the Theoretical Roman Archaeology Conference (TRAC ), of which we have an almost complete set. We’ve resisted the temptation to cite everything, but enough examples are included in the text to indicate what we mean. What amused us – and we shall make it clear that we are not criticising the scholarship of authors singled out for attention – is that a lot of the really awful graphical stuff was produced by authors who have passed through the hallowed portals of our more prestigious universities – most notably Cambridge – usually as PhD students or post-doctoral, and have gone on to lectureships in other noted universities. What’s worrying is that their supervisors obviously did nothing to discourage the graphics used and that bad habits will perhaps be condoned in future generations. To quote the poet Philip Larkin out of context ‘Man hands on ix misery to man/It deepens like the coastal shelf’. If this text has a serious purpose it is to encourage archaeologists to give more thought to the statistical graphics that they use, possibly even to have the courage to eschew them on occasion. The emphasis is on how you might do things in R, which isn’t as ‘comfortable’ as Excel and other menu-driven software, but allows more control and forces you to think. x Chapter 1 Introduction 1.1 Why statistics? Archaeology is a data-rich subject. Forget anything you might have read about data being ‘theory-laden’ – data exist independently of theory, though ‘theory’ may inform what you choose to pay attention to and how you interpret it. A piece of pot is a piece of pot; you can measure its dimensions, classify it by fabric, determine its chemical composition, establish its petrography and so on. With luck you may be able to assign a function or date to it, albeit with a degree of uncertainty. If it’s recovered from an excavation you know where it comes from; that is, you know what the spatial, and possibly temporal, relationship is to other artefacts of both the same and different types. If you’re dealing with an assemblage, of the kind generated by a serious excavation, you can count occurrences of types, classify them by context and date, and make comparisons with other assemblages. All this generates numbers and you want to do something with them, or ought to. Stratigraphy, one of the bedrocks of practical archaeology, is inherently numerical. You try to place contexts/building etc., sometimes with only partial success, into an ordered (numerical) sequence that reflects their temporal relationship. It is, we suppose, possible to live in the realm of ‘pure theory’ (a theoretical womb perhaps, from which you don’t emerge into reality). No PhD dissertation is complete without an obligatory chapter on theory, before getting to the real business on the basis of which the PhD is awarded. Aspirants to university lectureships in archaeology may possibly find that competence in ‘theory’ is one of the more important things needed for the job. We’ve nothing against theory. Some of our best friends do theory. The second author has been accused of doing theory, albeit in an unobtrusive way; and the first author of possibly understanding some of it. But . . . One view of what theory sometimes tries to do is that it involves an interpretive synthesis of large amounts of information from excavated archaeological sites, distinguished both spatially and temporally. This isn’t the whole story. But archaeological excavation is at the heart of much of archaeology as an academic 1 discipline. This is, perhaps, too blindingly obvious to say explicitly. And you need to interpret the information to be gleaned from any individual site. And this involves finding things, cataloguing them, counting them, classifying them, measuring them, interpreting them and so on. And then you might think it useful to do something with the numbers thus generated. And then – dreaded word – statistics come into play. Which is what this text is about – more specifically, graphical presentation. 1.2 Why R? Excluding publications in archaeological science journals and the like, the statistical methods that appear in what we shall call ‘straight’ archaeological publications, where they do appear at all, tend to be at the more basic but important end of the scale. Apparently ‘simple’ graphics such as pie charts, barplots, histograms, scatterplots and so on. It’s this kind of thing on which the text largely concentrates, though we have strayed from our original self-imposed remit. The motivations for writing the text are discussed in the preface. The original thought was to provide a detailed ‘how to’ guide about producing the commoner kind of graphics in the open-source software R. The software is discussed in Chapter 2. A lot of what you see in the literature is obviously produced using menu-driven software, quite often Excel; R, while you do need to learn it, has a reputation for being difficult to learn that we think can be exaggerated. Part of the thinking is that R offers more control over the graphics than other software does. You’re also forced to think about presentational issues and whether you need the graphic at all. Excel has a lot to answer for, but it’s probably down to the fact that users don’t exercise their right to customise graphs and/or are seriously lacking in taste. Even if the merits of R don’t attract you, think several times before choosing a ‘default’ option in some conveniently available menu-driven software package. 1.3 The need for graphics, or not What follows in later chapters is sometimes opinionated. Ironically, given the emphasis on graphical presentation, one message of the text is that what you see is often unnecessary and, literally, a waste of space. That a ‘picture is worth a thousand words’ is a cliché. Like all good clichés it contains an element of truth but discretion in the use of graphics should be exercised1 . It’s criminal to waste half a page of printed text on a pie chart or barplot with just one category (this, 1 If you believe Wikipedia the origins go back to American journalism and advertising from the 1910s, which might sow an element of doubt in your mind. ‘A picture paints a thousand words’ might be a more familar rendition and seems to stem from much the same period, possibly ‘popularised’, if that is the right term, in a song by the ‘soft-rock’ band Bread in the 1970s called ‘If’, not to be confused with Rudyard Kipling’s poem with the same title. 2 unfortunately, is not a joke) or even two or three categories. Many experts in data visualization (which we’re not) would like to dispense with pie charts entirely. They are also very rude about psuedo-three-dimensional pie charts, barplots and histograms which should be banned from academic publications. This is illustrated with unfortunately real examples in Chapters 9 and 10. A lot of what’s seen in the literature, illustrated graphically, could be explained in a simple sentence or two, or conveyed more effectively by commentary on a small table. You need to trust the intended audience to be capable of comprehending things like ‘of the 1000 artefacts studied the overwhelming majority (99.5%) were of type X’. We have real examples in mind here, some to be noted in later chapters. A specific instance comes from Wainwright’s (1984: 8) presidential address to ‘The Prehistoric Society’. It shows a pie chart of the regional distribution of scheduled monuments in England, by period. It is stated that the chart shows 49% of prehistoric date, 43% medieval, 7% Roman and 3% post-medieval. Surely this is comprehensible without a chart that occupies half a page; if you really must have a graphic a barplot would have been better. This example could readily be multiplied. If a graphic is needed, and it often helps, give thought to it. Considerable thought goes into the choice of artefacts for illustration, and how plans and sections will be presented. Graphics summarising data deserve just as much care. We, the authors, are on dangerous territory here since the disclaimer that we have no special expertise in data visualization has already been issued. There are plenty of experts around, not all of whom necessarily agree about everything. Our position, and we exaggerate a little, is along the lines that ‘we don’t know much about art but we know what we don’t like’. What this means will become apparent later. You may identify problems with some of the graphics presented here; the ‘syntax’, inappropriate choice of colour and so on. It’s the first thing we’ll notice when it’s made public. However, the idea is to show how things might be done using R, including what we think should be avoided, and it’s up to the user to make stylistic/aesthetic decisions about exact presentation, which may depend on the medium in which they wish to publish. What’s being discussed here is graphics with an intended audience. But you can undertake graphical data analysis for your own edification, to understand your data, without immediate thoughts of publication. That is, there is a distinction to be made between what we shall call ‘exploratory’ and ‘presentational’ graphical analysis. Unwin’s (2015) recent and excellent book on the subject, Graphical Data Analysis with R, emphasises the former. Here we have both aspects in mind and this has informed the way the text is structured. There are different graphics systems available in R. If a quick preliminary look at the data is all that is needed we often choose to use base R; it’s fine for presentational purposes as well but, increasingly, the package of choice for presentation seems to be ggplot2. This is discussed in more detail in Chapter 2. The latter 3 system is geared towards colour, though this can be done in base R as well. We sense a growing consensus that the future of academic publication is that it will be online so colour will be no problem; but it hasn’t happened yet. For black-andwhite graphics we have sometimes found it more convenient to use base R; you can customise ggplot2 to do the same, but it can require some effort. For some kinds of graphic, though not necessarily those most commonly seen in the archaeological literature, ggplot2 is undoubtedly the best choice. The strategy adopted in much of what follows is to present the minimal code in both base R and ggplot2 necessary to see what things look like, then give detailed code that illustrates the presentational options available. Our own experience in learning and using R is that (a) it helps to be motivated by a data set of your own that you’re keen to interrogate, (b) imitate existing code that seems to do what you want to do and, if you can’t, (c) search online or read a book or two. Hence the detailed code provided here, but it’s the motivation that counts. 1.4 How this book was written The first author drafted the text while the second author was engaged in completing a long-term project. When this was complete she read through the text, correcting, adding and deleting as appropriate. Most importanty she checked all the R code thoroughly to ensure that it worked (not always) and whether she understood it (not always). This led to major revisions in places. Apart from rewriting some code, to make it easier to understand, we have devoted more space to explicitly advocating the routine use of user-defined functions. This topic now has a chapter of its own (Chapter 5), following a first introduction to code for graphical analysis – histograms – in Chapter 4. The basic idea is that you write a body of code to produce a graph but, for reasons discussed in Chapter 5, it is often more convenient to embed it in a function. The graph is then produced only when the function is executed. In the original draft the bodies of code were presented, but not explicitly within a user-defined function. We urged the use of such functions on the reader, but obviously not emphatically enough. This, we hope, is now remedied. We have now explicitly presented some of the code embedded within functions so, in addition to Chapter 5, readers can see what is required. Some code is still not shown within a function, but readers who wish to emulate our analyses and code are urged to incorporate such code within a function. See Section 1.6 for additional comment on this. Another source of confusion, when the second author read the original draft, was the appropriate way to prepare the data for analysis. The issue is that we have presented code for analysis in both base R and ggplot2, and for the same data they may (depending on the analysis) require the data to be differently formatted, This can also differ from the form most natural for publication. R has some powerful data manipulation facilities so that you can convert from 4 one format to another within R, but this is not necessarily for beginners. Accordingly we have recommended that data be prepared in an appropriate format outside of R, in the format needed for any particular analysis, before importing it into R. This is most explicit in Section 10.3, where the same data are presented in three formats, one for publication, one for base R and one for ggplot2. To facilitate the use of these notes, either to reproduce our own analyses or use them as the basis for your own data, we are making available an Excel file containing all the data used, including some data sets too large to publish in the text, and a text file containing all the code. More dtails is provided in Section 1.6. It should be possible to copy-and-paste these into R. We have tried to make it clearer in the text than was originally the case, at the risk of some repetition, what format the data needs to be in. An obvious point, but one that led to some confusion when the second author reviewed the code, is the name that you give your data file in R needs to match that in the function constructed to analyse it! 1.5 Structure of the text Chapter 2 necessarily provides enough information to get started with R and about things we’ve found useful or which recur in the text; for those who like detailed and systematic treatments references are provided. Chapter 3 discusses descriptive statistics, their uses, and how to obtain them in R. This is not the main focus of the text but it helps to understand some of the graphics to follow. Chapters 4, 9 and 10 deal with histograms, pie charts and barplots respectively – the kind of statistical graphics that, collectively, dominate the ‘straight’ archaeological literature where statistics is used at all. Chapter 6 deals with kernel density estimates; these aren’t exactly mainstream yet, but are becoming so in some specialised areas of application and can be an appealing alternative to the histogram for some purposes. Chapter 7 deals with boxplots and ‘variants’ of them; boxplots are now widely used, the variants – ‘violin plots’ and ‘beanplots’ – are hardly visible in the archaeological literature but address some of the problems you may have with boxplots. Chapter 8 deals with scatterplots, another common graphic to be seen. Chapter 11 illustrates the construction and use of ternary diagrams. This is something of a ‘niche’ subject – several niches in fact – but they are widely used. Chapter 12, on correspondence analysis, might be regarded as an indulgence that strays beyond our original remit, which was to confine ourselves to the simpler methods of univariate and bivariate data display. Chapters 4, and 6 to 8, deal with methods mostly appropriate for continuous data, things like height and weight. The chapters on pie charts and barplots deal with categorical data where, for example, artefacts are classified by type. If two classifications are used, such as type and context for example, they can be crosstabulated, to see how the distribution of types changes across contexts. If the number of categories within each classification ‘system’ is small barplots can be a good way of looking at the data; if the number of categories is at all ‘large’ 5 correspondence analysis is an attractive alternative for looking at the tabulated data, widely used, and – we think – readily explained and easy to implement, so hence its inclusion. Apart from providing detailed R code we have made an effort to illustrate everything using real, mostly published, data that derives either from our own work or on that to be found in the books and journals on our bookshelves. The archaeological context, including conclusions to be drawn from an analysis is usually summarised, sometimes in detail. There is something of a bias towards Roman material, reflecting our own interests and the content of papers, other than our own, that we have drawn on. The methodology is general, and independent of the specific context. With very few exceptions the data are drawn from what we’ve called ‘straight’ archaeological publications as opposed to archaeological science publications2 . For the latter, parts of Baxter (2016) cover some of the same ground as here but in much less detail, mostly without R code, and with a specific focus on multivariate methods and archaeometry. We made the decision, after serious thought, to reproduce examples of what we consider to be bad graphical practice, with critical comment. This involves citing specific papers. It was distressingly easy to find these; less than an hour rummaging around our bookshelves and the detritus on our floors located most of what is presented. We’d like to make it absolutely clear that we do not, in any way, wish to impugn the archaeological scholarship exhibited in the articles singled out for this sort of treatment. But some of the graphics make you want to weep. The text attempts to explain what we think some of the problems are and what might be done about it. We anticipate that readers familiar and comfortable with the software they use will not be that tempted to move to R. It’s not the only software package that can produce decent graphics, though we would discourage the use of Excel. We hope, however, that the examples and discussion in the text will encourage readers to give more thought to graphical presentation than is sometimes evident, possibly to the extent of recognising when a space-filling graphic based on simple data with a clear message is not always necessary. 1.6 Additional resources – data sets and R code For those interested in pursuing R some additional resources relating to the data sets and code used have been made available, along with this text, on our academic.edu web pages, and on the Barbican Research Associates website.3 2 The distinction is made with some unease but we assume readers will know what we intend. http://nottinghamtrent.academia.edu/MikeBaxter http://independent.academia.edu/HilaryCool http://www.barbicanra.co.uk/resources.html 3 6 Data sets All the data sets analysed for which we provide code are contained in worksheets in an Excel file that can be accessed on the websites cited above, Data Sets for ‘Basic Statistical Graphics for Archaeology with R: Life Beyond Excel’. To use these in the way that we envisage, using the code that is also available, the different data sets needed to be imported into R, as described in Section 2.2.2, with identical names to those provided for the worksheets. Because of their size not all of the data are given in full in this text. R code A Word file, Figures for ‘Basic Statistical Graphics for Archaeology with R: Life Beyond Excel’ - R code is also available on the websites noted. Provided the data sets have been successfully imported it should be possible to copy and paste the complete chunk of code for each figure into R to get the figure in the text. In the text we have urged the use of user-defined functions for all but the smplest and shortest bodies of code (Chapter 5). Not all the code in the text itself has been presented in this way, and it is sometimes incomplete with presentational arguments left implicit. The initial thought was to encourage users to complete the code, if not already present, and then embed it in a function. In the event, based on the experience of the second author in testing and trying to understand the code, it was thought better to make available the complete code used, already embedded in a user-defined function. The idea is that this will provide an entrée into R for those unfamiliar with it. The intention is to show what can be done, and how, before coming to grips with analysis of you own data. To emulate analyses illustrated, the minimum required would be to change the name of the data set used to match your own, modify labels, and possibly play around with presentational aspects. The excitement begins when you want to tweak analyses and presentational options not covered in this text. If it isn’t obvious from the R help facilities what to do there’s plenty of online help, and books, available for guidance. It’s quite easy to want to something that seems quite simple, but turns out to be more challenging that you might expect! But you don’t have the same facilities available, and advice, with most menu-driven software. 7 Chapter 2 R – an introduction 2.1 Introduction R is a powerful open-source software system, now the package of choice of many applied statisticians, and others, who engage in data analysis. It can be thought of as the ‘son of S’, a language developed in the 1980s that, in the sub-title of the text that introduced it, was described as a ‘programming environment for data analysis and graphics’ (Becker et al. 1988). The classic texts of Venables and Ripley, Modern Applied Statistics with S dealt, as the title implies, with a wide range of applied statistical methods using S. The last and fourth edition of the text (Venables and Ripley, 2002) encompassed R, which was developed in the 1990s, and remains a valuable source of reference we consult often. It is aimed, though, primarily at statisticians. R has the merits of being both an exceptional tool for data analysis and being free. Since the early 2000s it has grown into something of an ‘industry standard’. There are numerous texts on its use pitched at all levels, directed at different specialisations, some concentrating on the more technical and computational aspects of the language. Although the situation is changing R has yet to be widely used in some areas of archaeological data analysis and its adoption has been patchy. Baxter (2015b) is an online book-length introduction to the use of R for quantitative data analysis in archaeology that touches on some of the material discussed in this text. It does not pretend to be a systematic treatment of R but includes a lot of data analysis with associated code. Baxter (2016) is a more specialised treatment of multivariate methods in archaeometric data analysis but includes a discussion of simpler methods of graphical analysis with some code. David Carlson’s useful website1 has a more systematic introduction, with a focus on how to use R to implement the methodologies in the textbooks of Shennan (1997) and Drennan (2009). A book is promised. It’s far easier to find applications in publications such as the Journal of Archae1 http://people.tamu.edu/ dcarlson/ (accessed 10/10/2016) 8 ological Science than it is in ‘straight’ archaeological journals such as American Antiquity, Antiquity, Antiquaries Journal, The Archaeological Journal, Britannia, Journal of Roman Archaeology, Medieval Archaeology, Proceedings of the Prehistoric Society etc., to name only some of those on our bookshelves or which we consult regularly. If there is a default in this latter group for data analysis it’s probably Excel. If you customise things properly you can get perfectly acceptable graphics with Excel but lots of authors don’t. There are plenty of other examples around, using other software the origins of which are often obscure, that produce far worse graphics than anything that can be achieved in Excel. The presumed reason for this is the accessibility (in the sense that it’s free or what’s available in the environment you work in) of the software used. Since R is free and designed, among other things, to produce decent graphics by people who know what a ‘decent’ graphic is it’s a very obvious and better alternative to what’s often used. A frequently cited reason for the reluctance of some potential users to engage with R is that it is command-driven, rather than the menu-driven software many users are reared on. That is, it’s not ‘user-friendly’. This may be a particular problem for statistical software where the subject matter can be seen as challenging before you even engage with implementation. Our attitude is that if you can write a sentence, with due attention to syntax and capitalisation, you can do it. That said, in the first instance, it can be helpful to emulate code produced by others. There is lots around – it’s a good way of learning, and the reason detailed code is supplied here. The attractions of menu-driven software are obvious. You can produce a barplot (possibly quite an awful one) in Excel almost instantaneously. It takes more effort in R to produce a plot fit for publication than you might wish. But you can exercise more control over its appearance which forces you to think more about it, possibly even concluding that it’s unnecessary (as many published barplots are). In fact, because of the perceived problem, more user-friendly interfaces with R exist. David Carlson’s website exploits Rcmdr, a graphical user interface. RStudio2 is increasingly recommended and there are other options (see, for example, Baddeley et al. 2016, 20). Our experience with R predates the development of RStudio by some years and we haven’t used it much but we’ve seen it suggested that this is where many people new to R might wish to start from. Ben Marwick’s CRAN Task View: Archaeological Science usefully summarises a wide range of R resources potentially useful for archaeological data analysis well beyond what is covered in this text3 . Although the first author has used R for about 15 years, and S before that, we don’t claim any particular expertise in the subject compared to what we recognise as ‘expert’, and this isn’t false modesty. We tend to work with quite ‘small’ data sets that use code which works for us (‘does the job’ if you like) and which we 2 3 https://www.rstudio.com/ (accessed 10/10/2016) https://github.com/benmarwick/ctv-archaeology (accessed 10/10/2016) 9 hopefully understand. That is, by the highest professional standards, some of what follows may not be ‘efficient’ or too readily generalised – anyone with an aptitude for programming can probably improve on some of the code. R develops and is updated fairly rapidly, so sometimes you don’t pick up on changes, or exploit them, as quickly as might be hoped. As discussed in more detail later, ggplot2 is a powerful graphics software package developed for use with R that has now been around for some time (Wickham 2009) (a second edition of which, Wickham 2016, has recently appeared) and has an enthusiastic following; we’ve only started using it regularly fairly recently so some of what’s to come has involved a learning experience on our part. R is available on several different platforms; our experience is entirely with Windows; if you use Linux or Macs some of what follows may not work as described but there will be plenty of online advice. The next section outlines the essentials needed to get started with R, and things we think it might be useful to know at the outset. The treatment is informal; attention will be drawn to more thorough and ‘respectable’ treatments of topics covered which the reader may wish to consult if they get interested in R. 2.2 2.2.1 R basics R – getting started Search online for CRAN R (the Comprehensive R Archive Network) or use http://cran.r-project.org/ to get started. You are directed to pages that tell you how to install R on various platforms, and more information, if needed, is provided in the FAQs. R is updated frequently. Versions 3.1.2 to 3.3.1 were used for the analyses in the text. Material on CRAN includes Venables et al. (2016)4 which is about a 100 page introduction to the language, though there is little in the way of data analysis. A comprehensive list of over 150 books on the subject is available5 as well as free material6 . David Carlson’s website, noted earlier, includes a ‘blow-by-blow’ account of the steps involved in installation. 2.2.2 Data import Once installed the next thing is to get some data into R and do something with it. You’re faced with a prompt > and – the way things are done in this text – something needs to be typed after it to make things happen. Small amounts of data can be entered directly. Suppose you have a set of numbers (1, 2, 3, 4). Type 4 https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf (accessed 10/10/2016) https://www.r-project.org/doc/bib/R-books.html (accessed 10/10/2016) 6 https://cran.r-project.org/other-docs.html (accessed 10/10/2016) 5 10 x <- c(1,2,3,4) then mean(x) will return the (arithmetic) mean of the data (Section 3.3). You could also do something like m <- mean(x) then type m; this stores the value in m for any later use. The object x is a vector and the term variable will also be used. It consists entirely of numbers and is an example of a numeric variable. This and other terminology to be introduced will recur later in the text. More usually you will have a (possibly large) pre-prepared data set available for study and it needs to be imported. There is extensive documentation on data import/export7 which, depending on how your data are stored and formatted, you may need to consult. Given that we do not usually deal with extremely large data sets, and they are usually in an Excel spreadsheet or readily converted from a database into that form, we usually do something that is both simple and quick, but frowned upon by aficionados. That’s just to copy the spreadsheet, or the part of it you want to import, go into R, and type something like mydata <- read.table(file = "clipboard", header = T) This can be done in different ways. Using read.table and providing a path to a comma-delimited file that contains the data (in place of "clipboard") is often recommended. However it’s done it results in what’s called a data frame (or dataframe), discussed in more detail in Section 2.2.3. It is assumed here that headers naming the columns are given, with the caveat to follow. Spaces in headers must be avoided (and if other illegal characters are used an error message in R will inform you). Missing data should be coded as NA and there should be no blank entries in the data file. A rectangular table of data is expected. You have, in fact, two options. If all the columns have headers the row names are assigned as 1, 2, 3, . . .; if the first column doesn’t have a header (i.e. the top-left corner entry in the spreadsheet is blank) the entries in it are assigned as row names. Which is to be preferred depends on how exactly you want to analyse the data and is illustrated in context in later chapters. 2.2.3 Data structures Data can be ‘structured’ in various ways in R. It matters because some analyses one might wish to undertake depend on the data having a particular structure. What follows is a very simplified discussion, limited to what’s needed later in the text. See some of the references cited earlier for systematic discussions. As we’ve treated data entry above, the data are stored in an object, mydata in the example, that is a data frame. You can query whether this is the case using is.data.frame(mydata). It will be rectangular and consist of p columns. 7 https://cran.r-project.org/doc/manuals/r-release/R-data.html (accessed 10/10/2016). 11 Those that are numeric consist purely of numbers; columns containing non-numeric information, such as a context or artefact type identifier, will be treated as a factor . Factors are important but can be a pain for a variety of reasons, even at the basic level they are used for in this text. One reason is that, for the purposes of data entry, the names you give them may not be those you wish to use in graphical or tabular displays. Another is that some of the tools (functions) available for analysis will, by default, order the factors alphabetically in a graphical display and this may not be what you want. Sorting this kind of thing out is ‘messy’; we discuss how to do it when the problems arise in later chapters. Chang (2012: 343–352) usefully discusses the various issues and remedies involved. For some purposes it is useful to extract subsets of variables from a data frame. To anticipate later material, if the data frame mydata contains a variable, Weight, that you want to extract and it is the third column in the data frame then Weight <- mydata$Weight or Weight <- mydata[,3] will do it. If you want to select a subset of the data, say columns 2, 3, 4, then newdata <- mydata[, 2:4] or newdata <- mydata[,c(2,3,4)] does it. The latter form here is more general since it allows you to select noncontiguous columns. Rows can be similarly selected by placing the selection before the comma in [,], and you can combine row and column selections in an obvious way. An issue that sometimes arises is that what should be treated as a factor is sometimes recorded numerically, for example in a stratigraphic sequence 1, 2, 3, etc. This may need to be converted explicitly to a factor and can be important for some uses, including that of the package ggplot2 to be discussed. If, for example, the data frame mydata contained a variable Context, coded numerically but to be treated as a factor, you can use mydata$Context <- as.factor(mydata$Context) Illustrative examples follow later. Some analyses that also follow later require the data to be recognised as a matrix . This is a rectangular array of data consisting solely of numerical values. Visually a data frame can look like this but sometimes irritating error messages, to the effect that a matrix or some other structure is needed, will be provided. If the contents of mydata are numerical and this sort of thing happens use newdata <- as.matrix(mydata) If the data frame consists of numeric variables and factors extract those numeric columns needed, into newdata. Use is.matrix(newdata) to see if it has the structure you want; if not, convert it using as.matrix. 12 2.3 2.3.1 Functions and packages Functions Once you’ve entered some data you do something with it by calling a function to which you supply arguments. If, for example, you have a variable x that you want to produce a histogram of (Chapter 4) then hist(x) will do it. This is using the function hist. If you wanted to save the outcome of a call for future use myhist <- hist(x) will do it. As it stands this will also plot the histogram; if you wanted to suppress the plot use the argument plot = FALSE thus8 . myhist <- hist(x, plot = FALSE) To get help on a function type ?name (e.g., ?hist). This will provide a list of the required and optional arguments, their defaults and – if you are lucky – illustrative examples that you can learn from by emulating with your own data. If you are not so lucky plenty of online advice, not all of it sound, is available. For the purposes of this text the arguments supplied to a function can often be split into two groups, the ‘business’ and ‘presentational’ parts. We think of the ‘business part’ as the minimum needed to get something suitable for your own personal inspection. This can be both very minimal and quick. If two variables, x and y are available, say weight and height, and you want to see how they are related, plot(x, y) may be sufficient for your purposes. If the graph is deemed to be worthy of public exposure attention needs to be given to its appearance. plot(x, y, pch=16, col="red", xlab="Weight (g)", ylab = "Height (cm)", main = "A graph of height against weight", cex= 1.2, cex.axis = 1.2, cex.labs= 1.3, cex.main = 1.4} will endow the graph with axis labels, a title (main), and plotting characters (pch) that are solid circles coloured red (col). The character expansion (cex) arguments control the sizes of the points, axis text, axis labels and title; the appropriate choice will depend on the context in which the graph is to be presented. Plots may be further embellished by using additional functions, placed after the initial command, that add lines, extra points or text and so on. Later chapters provide examples. At the lower, but important, levels of statistical endeavour that are mostly the subject of this text, preparing a graphic that is suitable for presentation is often far more time-consuming than the initial data exploration. This can be off-putting until you get used to it, but has the merit of allowing far more control that is sometimes exercised in graphs published in the archaeological literature. It’s also possible to write your own functions, which ultimately can save a lot of time if you are undertaking repetitive analyses or writing lengthy code. This 8 This can be abbreviated, as in similar functions, to plot = F; plot = TRUE is the default and TRUE can be abbreviated to T. 13 will be discussed in context at greater length in Chapter 5 once we get round to some real data analysis. It is strongly recommended that you read this and get into the habit of embedding all but the simplest code in a user-defined function. It makes it a lot easier to correct errors and add code, without having to retype code at the terminal. 2.3.2 Packages – general Packages are collections of functions and, commonly, data sets. David Carlson’s archdata package, for example, consists of data sets that have been used in the archaeological literature though, in other packages, the emphasis is usually on functions. Packages exist that consist of a single function. Download R and you have immediate access to several packages that will be referred to as ‘base R’. Other packages are also bundled with R but need to be loaded before you use them. For example, if the splines package is needed the following may be used (comments follow #). library(splines) library(help = splines) # loads splines # lists available functions There are a vast number – over 6,000 and counting – of user-contributed packages that need to be installed before they can be loaded. There’s more than one way of doing this; what’s described is what we typically do on our PCs. In R, from the Packages menu, select Set CRAN mirror to choose a site to import from then, from the same menu, select Install package(s) and select the package you need. This can then be loaded whenever you want it; you should only need to install it once. 2.3.3 Packages – graphics We’re used to base R and have found it more than adequate for producing graphs fit for publication for many years. A variety of graphics packages for R that extend its functionality, or are intended to improve on it or make things easier for the user, exist. Those usually cited are the grid system developed by Murrell (2005, 2011), lattice (Sarkar 2008) and ggplot2 (Wickham 2016). The later packages draw on elements of the earlier ones, as well as base R itself. The ggplot2 package has something of a cult following. It is increasingly the package of choice in texts that place some emphasis on graphics in R (e.g., Chang 2012; Unwin 2015). We had to make a choice about what graphics system(s) we’d favour in producing this text. In the event we’ve opted – where the choice exists – for a ‘parallel’ presentation of base R and ggplot2 graphics. There are a number of reasons for this, of varying merit. One is that we were familiar, and comfortable, with base R which, as already noted, is more than adequate for producing publication-quality graphs, particularly if they have to be in black-andwhite. For the purpose of writing this text we had to ‘learn’ how to use ggplot2 14 – it is quite clearly superior to base R for some (but not all) purposes – which managed to be both enjoyable and sometimes frustrating at the same time. These are the less meritorious reasons for choosing the form of presentation adopted. There are other reasons to be mentioned, but first a little more detail about ggplot2 is provided. In the words of its creator, Hadley Wickham, ‘it is unlike most other graphics packages because it has a deep underlying grammar’ (Wickham 2016: 3). The ‘grammar’ is derived from a text, The Grammar of Graphics, of Wilkinson (2005). We must admit to not having read this but our (possibly simple-minded) understanding, is that a graph is broken down into its constituent elements (the different ‘atoms’ of which it is composed if you wish) and ggplot2 is intended to allow complete control over the different elements. If you stray from menu-driven software you have to learn the language to ue. Although Wickham (2016: 3 claims ggplot2 is ‘eaasy to learn’, we don’t entirely agree. For those with an interest in, and aptitude for, things computational learning it is presumably not a problem; for the innocent end-user not immersed in such things who simply want to produce a nice graph it can be a barrier9 . The package is also quite prescriptive in some ways, has defaults not all of which you may like, and doesn’t always easily allow you to do what you want. You can override the defaults, though it can take time to work out how to do this. Sometimes, with ingenuity, it’s possible to coerce it into doing what you want if your vision doesn’t accord with the way ggplot2 is designed. As far as ‘prescription’ goes the data must usually be in the form of a data frame; if it is not naturally in this format and you need to bind different data sets/columns/rows of data you need to ensure that what emerges is a data frame. For some purposes the ‘natural’ data structure comes in the form of an n × p data frame, possibly with identifying information such as ‘type’, ‘origin’, etc., but this may need to be converted to ‘long’ format. That is, the numeric data are stacked to form a column vector, with identifying information similarly treated. The reshape2 package provides functions to facilitate this; Chang (2012, Chapter 15) usefully covers this, as well as other aspects of data manipulation, including the construction and modification of data frames. The approach adopted in this text is to suggest that the data be prepared in the format you want to use outside of R. This may differ according to the graphics system you use and the kind of graph it’s intended to produce, and data manipulation within R is sometimes unavoidable. There is a difference between what you might want to do for the purposes of initial data exploration and what you might want to do for publication. For the first purpose base R can be quicker; once you get used to it ggplot2 can be quick as well, especially if you have a ‘template’ to accomplish repetitive tasks, but it typically involves a little more coding effort. This is discussed in more detail in the next subsection. 9 There are plenty of discussion forums on the Web that are a valuable resource if you want advice about how to do something specific. A problem, though, is that some of the people who populate these seem more interested in demonstrating their own ‘computational nous’ than providing a simple answer, which often exists, to a simple question. 15 For some analytical purposes that involve a graphic, and with a possible eye to publication, we think it doesn’t much matter whether base R or ggplot2 is used. For some purposes ggplot2 is clearly superior. We will make it clear in the text where we have views about the superiority or otherwise of either approach, but provide examples and code for both so readers can make up their own minds. It partly depends on how the outside world is to be privileged to view your work should it be made public. For online publication colour is now natural (Section 2.3.5) and ggplot2 is more suited to this though you may wish to exercise control over the colours. It’s straightforward enough to produce colour in base R. For black-and-white publication, and alas despite rumours to the contrary this is still often unavoidable, some effort needs to be invested in making ggplot2 comply, though it’s not too onerous10 . A ‘selling point’ of ggplot2 is that legend construction is relatively painless compared to base R. We don’t entirely agree and, to judge from online traffic, are not alone. For fairly simple graphs there is not a problem, once you work out how to control legend placement. For more complicated graphs legends can pop-up all over the place, can look horrible, and it’s sometimes far from obvious why this is happening and how it can be controlled. Chang (2012) provides a lot of useful practical guidance; otherwise go online for help. Legend construction in base R can be tedious; we don’t have any problems with it, but this, like everything, is probably a function of familiarity. 2.3.4 Plot construction – ggplot2 For a detailed account of ggplot2 see Wickham (2016). Chang (2012) is a useful practical guide, while Unwin (2015) is laden with analyses of real data and code that can usefully be emulated. What follows here is more simplistic. Later chapters provide the code for a variety of plots in both base R and ggplot2. The former will be dealt with as we go along. In this section a more general account of plot construction in ggplot2, as we see and use it, is provided to ‘set the scene’ for what follows. In the beginning you need to have installed ggplot2 and other packages such as grid, and loaded them into R using the library function as described in Section 2.3.2. Depending on what you do you may get error messages telling you some necessary package is missing, in which case install and load it. For the purposes of exposition we shall suppose we have a data set, data1, consisting of two variables, X and Y and a factor, Z, with two categories that might, for example, be two different contexts/periods. For a specific example, think of X and Y as the length and width of some appropriate artefact type recovered from two different sites. 10 As an aside, it’s obvious from online traffic that some ggplot2 enthusiasts (and others) think that letterpress is, or should be, a thing of the past and that everything should be online. They may be living in the future but not in the real world as presently constituted, at least as far as archaeological academic publication is concerned. 16 All the ggplot2 graphs in the text begin with, as a minimum, something like ggplot(data1, aes(x = X)) where the first argument to the ggplot function, data1, specifies the data set that will be used. The aes argument, which is shorthand for aesthetics takes a little getting used to but is important as it controls the appearance of the graph that will be born from subsequent commands to be issued. In the present instance it’s establishing that subsequent operations are to be performed on the variable X – you don’t get a plot at this stage. To get a plot you need to specify what you want; what follows is again minimal but suppose a histogram is what’s needed (Chapter 4); ggplot(data1, aes(x = X) + geom_histogram() will do it, though the result is not especially satisfactory. See Chapter 4 for details about how the plot can be customised by adding arguments to geom_histogram(). Note the use of + to add additional commands; as presented above a plot will emerge. If you need to add additional commands – normally the case – just stick a + at the end of the line. A lot of the code in the text is presented in this kind of format. Other ways of getting the same result are p <- ggplot(data1, aes(x=X)) + geom_histogram() p which will store the code in an object, p, and only produce the output when you type p. The same result is produced using p <- ggplot(data1, aes(x=X)) p <- p + geom_histogram() p which builds up the code incrementally before printing the result. What’s been presented so far is what we think of as the ‘business part’ which has already been mentioned in the context of base R. In most of the examples to follow most of the code is presentational. Since each element of a ggplot2 graph can be controlled separately, and since some of the necessary code can be rather ‘wordy’, it can make for tedious coding. Once you’ve mastered it, though, and developed a ‘template’, it can be copied from one analysis to the next. Now to complicate things a little, suppose that what’s needed is a plot of X against Y, dealt with at much greater length in Chapter 8. p <- ggplot(data1, aes(x=X, y=Y)) p <- p + geom_point() p 17 does the job. It’s not terribly exciting at this point, but what’s going on, in general, is that you create an ‘invisible’ background framework then bring a graph into being by adding all the things you want it to show, in this case just the points. The fun really begins when you want to differentiate between points according to the level of the factor, Z, they are associated with. It’s easy in principle p <- ggplot(data1, aes(x = X, y = Y, colour=Z, shape=Z)) p <- p + geom_point() p maps colours and symbol shapes to the points according to the level of the factor Z11 . At this point you may decide you don’t like the default colours (we don’t) and may want to change them, and possibly the symbol shapes. The next bit of code shows how this can be done. p p p p <<<<- ggplot(data1, aes(x = X, y = Y, colour=Z, shape=Z)) p + geom_point() p + scale_colour_manual(values = c("blue", "red")) p + scale_shape_manual(values = c(16,17)) This is for illustration – more detail occurs later. Colour is discussed in the section to follow and symbol shapes (and line types) are covered in Chapter 8. And so you build up the plot, customising as you wish. The sort of code you end up with may look like what follows. p <- ggplot(data1, aes(x = X, y = Y, colour=Z, shape=Z)) p <- p + geom_point() p <- p + scale_colour_manual(values = c("blue", "red")) p <- p + scale_shape_manual(values = c(16,17)) p <- p + xlab("Suitable label") + ggtitle("Suitable title") + theme(panel.grid.minor = element_blank()) + theme(axis.text=element_text(size=12), axis.title=element_text(size=14)) + theme(plot.title = element_text(size = 16, face="bold")) + # control position and size of legend and text theme(legend.position="top", legend.text=element_text(size=14), legend.title=element_text(size=16), legend.key.height=unit(.8, "cm"), legend.key.width=unit(.8, "cm")) Some of what’s now to be said will be repeated, without apology, in later chapters – a bit of verbal redundancy never did anyone any harm. By default ggplot2 graphs come with a light grey background and major and minor grid lines. As these things go they’re not obtrusive, but we find the minor grid lines a bit ‘fussy’ and get rid of them with the first of the theme commands – this is entirely a matter of taste and you can get rid of everything and construct a black-and-white graph if you want. 11 The American spelling color will also work. 18 The next two theme commands just control aspects of the appearance of the axis, axes labels and title – other options are available. The price you pay for the control you can exercise is the ‘wordiness’ of the necessary code. There’s more than one way of getting the kind of plot that the above code would produce. You can, for example, specify the aesthetics within geom_point rather than ggplot and aesthetics can be specified outside, rather than within aes(). We’ll meet examples later. 2.3.5 Colour Whereas, in this text at least, the choice of plotting symbols (and line type) mainly arises in the context of scatterplots (and hence are introduced in Chapter 8) colour is ubiquitous and discussed briefly here. It is, or can be, a complex topic and we largely only touch on what’s needed in the text. More detailed treatments are available in Section 6.6.2 of Wickham (2016) and Chapter 12 of Chang (2012) and in online material related to Chang’s book12 . Earl F. Glynn’s web page is also a valuable resource13 . For the most part we do simple and usually specify colours of bars or points by name. We shall only discuss this here; the reader is refered to the sources cited for a more thorough discussion. If you type colors() or colours() a list of 657 colours that are immediately available is obtained, numbered according to their position in the list. Thus colour 450 is magenta. If, as a simple example, you wanted a plot of x against y, with a solid circular plotting symbol coloured magenta then plot(x, y, pch = 16, col = "magenta") does what you want. Replacing "magenta" with its position in the list, 450, or using hexadecimal code, "#FF00FF", has the same effect. For the last of these you need to know the code; Glynn’s web page is valuable in this respect since it provides colour charts to show what the colours look like and how they contrast, including a listing of the hexadecimal codes. A variety of colour palettes are available in base R and in contributed packages. We do not make much use of these in the text, but will draw attention to them when we do. Chapter 12 of Chang (2012) (or the associated web page) is recommended for those wishing to pursue this in more detail. It’s possible to get side-tracked by colour theory - both its scientific and psychological aspects. Section 6.6.2 of Wickham (2016) and Glynn provide brief accounts. Wickham (216: 142) counsels against the use of ‘overwhelming’ bright colours on bars, suggesting that subtle colours are better. Where we appear to have ignored this advice think of it as an indication of how things are done rather than a specific recommendation. 12 13 http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/ (accessed 10/10/2016) http://research.stowers-institute.org/efg/R/Color/Chart/index.htm (accessed 10/10/2016) 19 Other general points in no particular order of importance include the following. • It’s possible to produce transparent figures in various ways, perhaps most simply by adding two extra numbers at the end of a hexadecimal code to control the degree of transparency. We comment in the text when we do this, but haven’t used it much. • Most authors who deal with the subject point out that the colours you see on your screen won’t necessarily be the same if you publish in other media. The message, we think, is that if you want colour printing check what it is likely to look like before submitting anything if it’s likely to be a concern. • Don’t neglect grey-scale colours; these may be needed for black-and-white publication, and can look perfectly OK regardless. • Both Wickham (2016) and Chang (2012) (and others) draw attention to the fact that a non-negligible number of potential readers of your text (about 10%, mostly male, the weaker of the species) will suffer some form of colour blindness. Apparently if you submit a text for publication, with colour, and it’s seen by three male referees there is about a 22% chance that at least one of them will be colour blind. Chang provides a useful discussion on how to alleviate the problem • Because of this last issuse the juxtaposition of red and green is frowned upon. Another approach cited is ‘redundancy’ in the graphical presentation. If you want to differentiate plotting symbols using different colours use different shapes as well in case you have readers who can’t differentiate the colours. They discuss various strategies for dealing with this and draw attention to packages with palettes designed to be colour-blind ‘friendly’. They suggest that the juxtaposition of red and green in a plot should be avoided. Another strategy when dealing with graphs that employ point symbols is to use redundant coding and endow the points with different shapes as well as colours. 2.4 Don’t panic Advice that is always good in many circumstances. Arthur C. Clarke said that Douglas Adams’ use of it in The Hitchhikers Guide to the Galaxy was perhaps the best advice that could be given to humanity. When you begin using R from scratch, by typing in the commands, you will almost inevitably get error messages. Quite frequently these will be typing errors; R doesn’t like spelling mistakes and is case sensitive. Common errors that the second author found herself making included misreading and mistyping commands that required the underscore character. So typing geom.hist or geom-hist instead of 20 the correct geom_hist. Also, given the proximity of zero (0) and the letter o in the standard English keyboard, she often discovered she had typed o rather than 0. The error messages attempt to tell you what is wrong, but it is always useful to read with care what you have typed in to see if you have made similar slips of the typing fingers. Once you get used to the R process, you will find yourself making them less often. The second author found this process had started to kick in by about Chapter 6 as she worked her way through the text. Cutting and pasting the code either directly from this document (if it’s complete) or from the text file that is also provided (Section 1.6) is, of course, a way to avoid such errors. At some point, though, you will want to start applying it to your own data and changing defaults. At that point the error messages may well start. Don’t despair. Trouble shooting starts with checking for typing errors. You should also check whether you have imported the data correctly, defined the variables correctly and so on. Checking data and variables is easily done by typing in the name of the data set (e.g., pins) or the variable name (e.g., length). The package will helpfully print out what it thinks the data set pins and the variable length consist of. The other thing that can happen is you get ‘trapped’ when constructing code. Instead of getting the prompt sign (>) inviting you to continue you get a plus sign (+) which tells you the current piece of code is unfinished. This can happen because you’ve omitted a closing ), in which case type it and all will be well. If it’s worse than this press the Esc key to escape, return to the prompt, and start again. The problem with making typing errors is the reason why we advocate using functions as described in Chapter 5. Once you have edited them into a state where you know they work (see Section 5.1.2), you will be able to use them again and again to produce different graphs with different data sets. It becomes very easy to create your own house style or, indeed, the house style of any publication you are writing for. 21 Chapter 3 Descriptive statistics 3.1 Introduction Descriptive or summary statistics are numbers intended to summarise some property of a set of data. The main emphasis in this text is on the graphical representation of data but an acquaintance with descriptive statistics helps to describe what these show. The graphs themselves may also indicate which descriptive statistics are useful summaries. Most readers should be aquainted with much of what is presented in what follows – they are the staple in the opening chapters of introductory statistics texts, including those written for archaeologists such as Shennan (1997) and Drennan (2009). The discussion will therefore be fairly brief with an emphasis on what’s relevant for later chapters. Forward reference is made to later chapters that illustrate the discussion and some readers may wish to skim this chapter in the first instance and refer back as needed. Section 3.2 discusses the types of data encountered in the text; Section 3.3 gives details about the definitions of summary statistics and their use; Section 3.4 indicates how they can be calculated in R 3.2 Types of data Text books can go on at some length about the different kinds of data that can arise in applications. What follows is restricted to what’s needed for later chapters. Continuous data: This will be used for any variable that can, in principle, have any numerical value within some plausible range. The height of a pot in centimetres, for example, could be recorded as 10.123456 etc. In practice there is a limit to the accuracy of measurement so this would be recorded as (or rounded to) something like 10.12 or even 10. Discrete data: This will be used for numerical data that can only assume a distinct set of values, usually the integers 0, 1, 2, etc. An example might be 22 the number of different decorative motifs on an artefact. Where continuous data are rounded to integers covering a reasonably small range it may be convenient to treat them as discrete data (Section 10.4.1). Categorical data: The frequency (counts) of observations within a set of classes (categories) is available, such as the colour of glass beads (black, white, blue etc.). In the instance given the ordering of the categories is arbitrary. If it’s possible to order the categories (e.g., small, medium, large) the data are ordinal . If data are unordered categorical it may often be sensible to rearrange the data, by frequency of counts, for example, or any other sensible archaeological criterion. This would not usually be advisable for ordinal data. Continuous data for single variables is the main concern of Chapters 4, 6 and 7. Chapter 8 deals with the graphics for investigating the relationship between two continuous variables. Chapter 11 deals with the special case of ternary diagrams which can be used, after suitable data transformation, with either continuous data for three variables or three-category categorical data. Categorical data is dealt with in Chapters 9, 10 and 12. There is more of an emphasis in these chapters on dealing with crosstabulated , or crossclassified, two-way tables of data based on two categorical variables. 3.3 3.3.1 Summary statistics Definitions The summary statistics to be defined are mostly applicable to continuous or discrete data. It is usual to distinguish between two kinds of statistic. Measures of location: A measure of location is an attempt to summarise what the ‘central tendency’ or typical value for a data set is. The ones most commonly encountered are the mean, median and mode. Measures of dispersion: A measure of dispersion attempts to summarise the spread of a set of data. The standard deviation (or its square, the variance), the range, and the interquartile range (IQR) are perhaps the most familiar. These will now be described using words for the most part; for more formal presentations see any elementary statistics text (or Wikipedia). The subsection that follows discusses their use. Some of the statistics to be discussed can be very sensitive to extreme values in a data set (outliers), particularly with small samples. Statistics which try to avoid this problem are often referred to as robust. The following list is of the most common measures of location. 23 Mean: More properly called the arithmetic mean1 , this is just the sum of a set of numbers divided by the sample size, n. It can be sensitive to outliers in the data, particularly for small n. Median: This is the middle value in a set of numbers ordered by size. If n is an even number there will be two central values and their mean is calculated. The median is an example of a robust statistic since it will be little affected by extreme values, if at all. Mode: This is the most commonly occurring value in a set of data. It can be fairly useless with a set of continuous data where every number may be different or only ‘accidently’ the same because of rounding. If the data are grouped into classes (as in histograms, discussed in Chapter 4) it makes sense to talk of a modal class and this terminology can be used with one-way categorical data. If the distribution of the data is smoothed using some technique such as kernel density estimation (Chapter 6) the mode will be apparent as a peak in the data. If there are several peaks the data are multimodal . Sometimes a distinction may be made between the highest peak, the major mode, and minor modes. Trimmed means: These aren’t much used in archaeology. An arbitrary percentage of the smallest and largest values in the data is removed and the mean recalculated. Typically the same small number of cases at each end of the scale is trimmed in the hope that this will produce a robust statistic unaffected by outliers. The mean is probably the most familiar but also the least ‘intuitive’ of the first three measures listed. It’s the ‘centre of gravity’ of a set of numbers. Measures of dispersion can be a bit trickier to define. Range: The largest and smallest values in a data set are the maximum and minimum (rather obviously). The range is the difference between them. It perhaps the most obvious measure of dispersion but is very sensitive to outliers, though it can be of interest in its own right Interquartile range (IQR): The IQR is a more robust measure of spread; quartiles need to be defined first. Order the data by size then find three numbers that divide the ordered data into four equal-sized groups as closely as possible. The middle one of the numbers is the median; the value which separates out the smallest ‘quarter’ of numbers from the rest is the first quartile (or lower quartile); the value separating out the largest ‘quarter’ is the third 1 It’s possible to define other kinds of mean such as geometric and harmonic means and they do sometimes occur in the archaeological literature, but mean will be used in the text as defined here. 24 quartile (or upper quartile)2 . The IQR is the difference between the third and first quartiles; it encompasses the central 50% of the data and is often used as a measure of dispersion in conjunction with the median as a measure of location3 . Standard deviation: The standard deviation is often used as a measure of dispersion to accompany the mean as a measure of location. Difference each observation from the mean; square it (so everything has a positive sign) and calculate the mean of the resulting numbers (i.e. the mean of squared deviations from the mean!). This is the variance; the standard deviation (sd) is the square-root of the variance and is in the same units as the mean4 . As with the mean it’s not ‘intuitively obvious’ but makes sense once you think about it. By definition the number obtained is positive; the more spread out the data are the larger will be the average (squared) distance from the mean, so large values indicate a more dispersed set of data. The caveat should be entered that the variance and standard deviation, like the mean, can be sensitive to outliers in the data. Mean absolute deviation: If you had to design a measure of dispersion from scratch the average of differences from the mean seems more ‘natural’ to use. You have to ignore the sign of the differences (i.e. use absolute values) otherwise the result will be zero. This is the mean absolute deviation, otherwise ‘MAD’. It can be found in archaeological publications but is not that widely used. This is partly because the standard deviation (and variance) is mathematically easier to handle, with nice properties in some circumstances, and their use is hallowed by tradition. 2 The ‘proper’ definition of quartile exercises some people, though most readers may find the discussion esoteric. As just defined the quartiles are numbers that, along with the median, divide the data into roughly four equal parts. An alternative view is that a quartile is an interval; so, for example, the first quartile is the interval between the minimum and the number that separates out the lower quarter of the data. We use the term in its first sense. 3 We’re skimming over the fact that how to divide a data set into ‘quarters’ is not immediately obvious for inconvenient values of n and different ways of doing so exist. This is touched on again in Section 3.4; it shouldn’t matter much if n is reasonably large. You may see the terms 25th and 75th quantile or percentile used to refer to the first and third (or lower and upper ) quartiles. This terminology can be extended to numbers that separate out fractions of the data other than the smallest quarter and smallest three-quarters. 4 It’s not important for the purposes of this text, but what has just been described is what’s used if you are dealing with numbers that constitute a population. If you conceive of the numbers as a random sample from a population and want to draw statistical inferences about properties of the population the mean of the squared deviations is usually calculated using a divisor of (n − 1) rather than n. Since we’re not discussing statistical inference there’s no need to go into the reasons for this. It doesn’t much matter if n is at all ‘large’ and good introductory statistics texts will alert you to why this is done if it interests you. 25 3.3.2 Usage (and abusage) 60 6 7 8 9 10 (a) 11 12 13 8 10 12 14 16 6 Frequency 0 0 0 0 2 20 20 50 4 100 Frequency 40 Frequency 8 60 40 Frequency 150 10 80 80 12 200 It’s common to see the (arithmetic) mean referred to as the average of a set of numbers. Don’t do this. Some people who use the term ‘average’ – and the usage appears in both the popular and unpopular press – appear to be referring to the mode or median, so it is a vague and often undefined term5 . Most of the statistics listed above are widely used; they are not always useful. One of the reasons for graphical analysis, and why it should be de rigueur, is that apart from revealing pattern/stucture in the data, and any problems you have with it, it is informative about the kind of summary statistics that may be useful for describing the data if that’s the way you wish to go6 . Figure 3.1, which uses continuous data, illustrates a variety of ‘shapes’ that data might have, using histograms (Chapter 4). The data have been randomly generated to display the features emphasised. 0 (b) 20 40 60 80 100 (c) 8 10 12 14 16 18 20 (d) Figure 3.1: Some possible distributional ‘shapes’ for continuous data; (a) symmetric unimodal, (b) bimodal, (c) skewed, (d) with an obvious outlier. 1. Figure 3.1a is a sample from a symmetrical distribution (normal/Gaussian). The mean, median and mode/modal class will be very similar and data looking like this are frequently summarised using the mean and standard deviation and this may often be all that is needed. 2. Figure 3.1b illustrates a bimodal distribution. If the aim is to summarise what a ‘typical’ value looks like then the mean and median would be fairly useless as they will be closest to the least typical values (the antimode). The data could be summarised by stating that it is bimodal, with an indication 5 You’re allowed to use the word in everyday conversation. Depending on context the person you’re talking to will either think that they (and you) understand what you’re saying or, if they care about numeracy and it matters, ask for clarification. 6 In view of some of the content of later chapters, this is not the same as saying the graphical analyses undertaken, initially in the privacy of your own study, should necessarily be inflicted on an unwitting public in the form of a published paper. The gist of some of what’s to follow is that the graphics may inform you but that the message they have to impart can often be conveyed in a more efficient form than a redundant pie chart or barplot. 26 of where the modes lie. The usual measures of dispersion are also useless and such data are often best presented in graphical form. 3. Figure 3.1c shows a skewed distribution with a long right-tail. Such data are quite common. For the ‘shape’ illustrated the mode will be smaller than the median and the median smaller than the mean. There is no single ‘correct’ measure of location or dispersion; they tell different stories, all are equally valid. Such a data set is often best summarised graphically. 4. Figure 3.1d shows an obvious outlier. The other figures have used a sample size of 1000; this uses a sample size of 30 (from a normal distribution) with an added outlier. Faced with this kind of thing in practice, and if it’s unanticipated, the first reaction should probably be ‘What!’ quickly followed by ‘Why?’. Depending on how close you are to the data collection it may become apparent that it’s a measurement or recording error that you may be able to do something about. If it’s genuine you get interested; if obviously not and you can’t explain why then you probably discard it. If it is ‘real’ and you want a summary of the data several options are open. The mean and standard deviation will be misleading and, if you want to summarise the data using them omit the offending observation(s) and report on it (them) separately. This is why robust statistics are advocated in some areas of the statistical literature; if you use the median and IQR for summary purposes they are largely unaffected by extremes. The above is by way of saying that, unless the data are very ‘well behaved’, continuous data are often best summarised in the form of a graphic. We are rather more ambivalent about the use of graphical methods for displaying simple categorical data sets, at least as witnessed to by countless examples in the archaeological literature (Chapters 9 and 10). Given data on a single variable, classified into a small number of categories, anything the data have to tell you that you want to draw attention to can frequently be described in a single sentence or by commentary on a small table rather than the often large amount of space a pointless graphic occupies. Pie charts and barplots, probably the most common type of graphic to be found in the archaeological literature, where the data fall into just one category (yes, such examples exist!) or two or three, are usually unnecessary if the story is obvious. Ignoring aesthetic considerations – frequently pie charts and barplots offend your sensibility – charts/plots are not necessarily technically incorrect (they can be) but are not always needed7 . Some graphical displays need to be understood in terms of the statistics that the display embodies. Boxplots, for example and as typically used (Chapter 7), 7 Possibly the intended audience is judged to be incapable of understanding a sentence to the effect that ‘Most (95%) of artefacts are of Category A with other categories rare or not represented.’. Or ‘Two categories, A and B, dominate the assemblage (95%) in roughly equal proportions, with the six other categories represented by three or fewer examples’. We have real published examples in mind here. 27 show the median, first and third quartiles, and maximum and minimum values. This is sometimes called a ‘five-number summary’. Refinements, such as devices for indicating potential ‘outliers’, are usually in place. Boxplots can be uninformative if the data are multimodal, and misleading (in terms of the number of ‘outliers’ suggested) if the data are skewed. Violin plots and beanplots, which we don’t recall seeing in archaeological publications, can be more informative than the quite common use of boxplots (Chapter 7). The main point here is that you should look at data in more than one way, graphically, and the graphics will inform you about the most useful way to display or summarise the data. 3.4 Obtaining summary statistics in R It’s straightforward to obtain individual descriptive statistics in R, as it should be in any decent statistical software package (Excel does not fall into this category). In R create the data set of interest. For illustrative purposes the data of Table 3.1 will be used; this shows the lengths (mm) of 35 late Romano-British hairpins, and will be discussed in more substantive detail in Chapter 4 in connection with Table 4.1 where data on early hairpins are also provided. For brevity of presentation these will be read into R in a variable (vector) x; in practice a more informative name is advisable. Table 3.1: Lengths (mm) of late Romano-British copper alloy hairpins from southern Britain (Cool, 1990). 51 62 70 80 52 63 70 80 54 63 70 82 56 63 70 82 57 65 71 87 58 65 74 60 66 75 60 67 77 61 68 78 62 68 78 It is advisable, if not essential, to examine the data graphically before doing anything even as simple as calculating a mean. To anticipate the material of Chapters 4, and 6 Figure 3.2 presents a histogram and kernel density estimate (KDE) of the data. There is nothing dramatically untoward about the graphs in the way of obvious outliers or multimodality so that, for example, using the mean and standard deviation as summary statistics is not ruled out. The distribution is, however, not especially symmetrical so that, for example, different measures of location will produce different results and it is worth looking at more than one. Detailed aspects of graph construction and interpretation are discussed in the relevant chapters; for the record the defaults hist(x) and plot(density(x)) were used to produce the graphs; they’ve been tidied up a bit for presentational purpose, and code is given in later chapters, but this is not essential if you just want a quick exploratory look at the data. 28 Lengths (mm) of later Romano−British hairpins 0.02 Density 0.01 4 0 0.00 2 Frequency 6 0.03 8 0.04 Lengths (mm) of later Romano−British hairpins 50 60 70 80 90 40 50 Length(mm) 60 70 80 90 100 Length(mm) (a) (b) Figure 3.2: (a) A histogram and (b) a kernel density estimate for late Romano-British hairpin lengths. Once you’ve decided what’s sensible and what statistics you want to look at how do you get them? Several ‘short-cuts’ exist. A quick way to obtain the mean, median, quartiles, maximum and minimum in R is to use summary(x) which returns the following Min. 1st Qu. 51.00 61.50 Median 67.00 Mean 3rd Qu. 67.57 74.50 Max. 87.00 while fivenum(x) produces the same statistics other than the mean and is less usefully labelled. These are available in base R. Other functions that produce differing sets of descriptive statistics are available in various packages. Rather than attempting a comprehensive summary here the best advice is to search the web to see what’s around8 . Some of these functions include statistics that relate to the skewness and kurtosis of a set of data; we’re not aware that they are much reported in archaeological studies – they relate to the shape of a distribution which is usually best appreciated graphically. If individual statistics are of interest there is usually a function to produce them. For example range(x) and IQR(x) produce the the range and interquartile range; min(x), max(x), mean(x), median(x), mad(x), sd(x) and var(x) are all pretty self-explanatory. For trimmed means, if you want one, use something like mean(x, trim = .025) which in this case would trim 0.025 of the observations from each end of the data set. For the quartiles use quantile(x); the minimum, maximum and median will also be obtained. The quantile function can be used to determine values that divide the data into other than four parts. For example, quantile(x, probs = seq(0, 1, 0.1)) will do its best to identfy values that divide the data set into 10 8 e.g., http://www.statmethods.net/stats/descriptives.html 29 roughly equal-sized groups - it’s not very sensible to try this with a sample size of 35! The seq(0, 1, 0.1) function generates values between 0 and 1 at intervals of 0.1 and, in effect, the quantile function then finds values that separate out the lower (10% (i.e. .1), 20% (.2) of the data and so on. There are nine different ways in which the quartiles can be produced in the quantile function. The choice does actually make some difference here, though nothing to get excited about. If it does make much of a difference it will probably be because a small data set is being used and the quartiles are fairly meaningless. 30 Chapter 4 Histograms 4.1 Introduction Histograms are one of the most familiar methods for presenting continuous data. For illustration unpublished data from Cool (1983) will be used. They are the lengths (mm) of 90 copper alloy hairpins from southern Britain, 55 classified as early and 35 as late on archaeological grounds (see Cool, 1990, for a review of the use of such hairpins). The data are given in Table 4.1. Table 4.1: Lengths (mm) of Romano-British copper alloy hairpins from southern Britain (Cool, 1990). The upper set is early and the lower set late. 54 90 94 97 104 115 56 92 94 98 104 116 74 92 94 98 104 123 84 92 95 100 104 128 85 92 95 100 105 134 85 93 95 100 107 87 93 96 100 108 88 93 96 101 108 89 93 97 102 111 90 93 97 103 115 51 62 70 80 52 63 70 80 54 63 70 82 56 63 70 82 57 65 71 87 58 65 74 60 66 75 60 67 77 61 68 78 62 68 78 The data are continuous since, in principle, any length within realistic limits is possible; in practice data are always rounded, in this case to the nearest millimetre. A data frame, pins, is created consisting of two columns, Length and Date, the latter being coded as either Early or Late. The first and last few rows as entered into R are shown in Table 4.2. A variable for length only can be created using either Length <- pins$Length or Length <- pins[,1] where $name will extract the variable with the name given and [,k] will extract the kth column; simlarly Date <- pins$Date can be used to create a Date variable. 31 Figures 4.1 and 4.2 show examples of Roman artefacts of the kind that are subject to the analyses in this chapter. The hairpins data is also used in Chapter 7. Figure 4.1: A selection of Romano-British hair pins in copper alloy, bone and jet from (left to right) Worcester, Kenchester, Leicester, Winchester and York. (Source: authors). Figure 4.2: A pillar moulded bowl from Thonon-les-Bains, France. (Source: J. Paul Getty Museum, Los Angeles). 32 Table 4.2: The Romano-British hairpin data as entered into R 4.2 Length Date 54 56 74 .. . Early Early Early .. . 82 82 87 Late Late Late Histograms in base R The base R histogram is produced using the hist function hist(Length). It is shown in Figure 4.3a. Frequency 15 0 0 5 5 10 Frequency 10 20 25 15 Histogram of Length 60 80 100 120 60 140 80 100 120 Romano−British hairpin lengths (mm) Length (a) (b) Figure 4.3: Histograms for the Romano-British hairpin lengths. (a) The R default using hist; (b) the same data with modified bin-widths, labels and colour. Histograms are constructed by dividing an interval that covers the range of the data into classes, otherwise known as bins. The lower boundary of the histogram is the origin (or anchor point). The number of data values that fall within each bin is counted and the resultant frequencies represented in the histogram as contiguous bars. The defining feature of a histogram is that the area of the bar is proportional to the frequency but if, as is invariably the case in statistical software packages, bin-widths are the same then the height of the bar is also proportional to the frequency. A problem with histograms is that the appearance, and hence interpretation, can be affected in a non-trivial way by the choice of origin and bin-width (Whallon 33 1987). Software defaults are not always satisfactory; to our eye there are often too few bins so the data are oversmoothed. It is straightforward to deal with this in R; the possibilities can be explored using the breaks argument of hist. Figure 4.3b was produced using the following code hist(Length, breaks = 20, main = " ", col = "skyblue", xlab = "Romano-British hairpin lengths (mm)", cex.main = 1.4, cex.lab = 1.3, cex.axis = 1.2) 0.020 Density 0.015 0.010 4 0 0.000 0.005 2 Frequency 6 0.025 0.030 8 where 20 bins are specified. You won’t necessarily get exactly this number because R will determine sensible boundaries between bins (i.e. nicely rounded numbers) that may result in a slightly different number of bins. Other uses of breaks are illustrated later. The opportunity is also taken to show how labels can be changed and bars coloured. Using main = " " eliminates the original title, but add text between the quotation marks if it’s needed. Colour can be specified, via the col argument, in various ways, most simply by providing a name as shown. Type colours() or colors() to see what’s available (Section 2.3.5). The cex (character expansion) arguments control the size of the title, labels and axis text. What’s appropriate will depend on the scale on which a graph is intended to be viewed. 60 80 100 120 60 Romano−British hairpin lengths (mm) 80 100 120 Romano−British hairpin lengths (mm) (a) (b) Figure 4.4: Histograms for the Romano-British hairpin lengths (a) showing the effect of increasing the number of bins (b) on a probability rather than frequency scale with a superimposed kernel density estimate. Figure 4.4 shows some variations on Figure 4.3. The effect of increasing the number of bins specified to 30 is illustrated in Figure 4.4a. This perhaps makes the existence of two different size classes even more apparent than in Figure 4.3b. It will be seen later that these classes can be related to date with the later hairpins tending to be smaller. The colour used was grey80. Figure 4.4b was produced using the same code as Figure 4.3b with some additions. The default is to use a frequency scale so that the total area of the histogram 34 will be equivalent to the total frequency. It is sometimes useful to rescale the histogram so that the total area is 1. Add the argument freq = FALSE to the previous code to do this. Technically the density histogram so obtained is a non-parametric density estimate of the distribution of the data. The colour of the borders of the bars is changed to that of the bars using the argument border = "skyblue". Superimposed on the plots is a kernel density estimate (KDE) which can be thought of as a smoothed form of density histogram. These will be discussed in Chapter 6; for now it is sufficient to note that they have presentational advantages compared to histograms. The KDE was produced from the following code, placed after the hist command. lines(density(Length, bw = 4), lwd = 2, col = "red") where the lwd argument controls the line width. 4.3 Histograms using ggplot2 Use library(ggplot2) to load the package and library(grid) to control aspects of later graphs. The data will be used in the form given in Table 4.2 and will be called pins. A quick way of getting a histogram is qplot(Length, data = pins, bins = 20)] which is equivalent to ggplot(data = pins, aes(x = Length)) + geom_histogram(bins = 20) Figure 4.5a shows the output from these. Romano−British hairpin lengths 10 count count 10 5 5 0 0 50 75 100 50 125 75 100 125 lengths (mm) Length (a) (b) Figure 4.5: Histograms for the Romano-British hairpin lengths using ggplot2. (a) The default output, except that 20 rather than 30 bins are used (b) a customized version. 35 Using qplot is simpler but ggplot is more extensible and it is this that will be exploited here. Figure 4.5a is not especially satisfactory (the default for bins is 30). To add labels using qplot the arguments main and xlab can be used exactly as in previous examples. If ggplot is used customization of the histogram is more complex. Figure 4.5b shows a customized histogram using the code following. fig5.b <- function() { library(ggplot2); library(grid) ggplot(data = pins, aes(x = Length)) + geom_histogram(bins = 20, fill = "white", colour = "black") + # remove minor grid lines theme(panel.grid.minor = element_blank()) + # add labels, title etc. and control their appearance xlab("lengths (mm)") + ggtitle("Romano-British hairpin lengths") + theme(axis.text=element_text(size=12), axis.title=element_text(size=14)) + theme(plot.title = element_text(size = 16, face="bold")) } fig5.b() The code needed to produce the graph has been embedded within a user-defined function (Chapter 5). We note that it is usually more efficient to code all but the smallest examples in this way. In principle every element of the graph can be separately controlled using the theme function. Everything after the first block of four lines is ‘presentational’. It can make for rather lengthy code but once you settle on a preferred ‘style’ it’s easy enough to cut and paste from one analysis to the next. With an appropriate changes of labels the ‘style’ above will be used for most further ggplot2 graphics without always explicitly giving it and will be indicated by an ellipsis (. . .). New presentational features will be introduced as they arise. 4.4 Histograms from count data Construction of the foregoing histograms presumes access to the raw data. The software produces the counts within bins and plots the histogram. On occasion you may only have access to the counts and the boundaries or mid-points of the bins. Thus the histogram in Figure 4.3b was based on the following counts 4 6 8 8 4 5 5 6 15 11 8 3 3 1 1 1 1 and intervals defined by 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 giving rise to mid-points 52.5 57.5 62.5 67.5 72.5 77.5 82.5 87.5 92.5 97.5 102.5 107.5 112.5 117.5 122.5 127.5 132.5. 36 To use these you need to create variables within R thus Counts <- c(4, 6, 8, 8, 4, 5, 5, 6, 15, 11, 8, 3, 3, 1, 1, 1, 1) with Breaks and Midpoints similarly constructed. One way to construct a histogram from the data, in base R one is Length <- rep(Midpoints, Counts) then use the code needed to produce Figure 4.3b except that breaks = 20 is replaced with breaks = Breaks. The replication function, rep, takes each midpoint in turn and replicates it according to the count in the associated interval. The breaks = Breaks argument in the hist function then imposes the interval boundaries that would be obtained had the raw data been available. 4.5 Histograms with unequal bin-widths Occasions can arise,rarely, where it may be desirable to use unequal bin-widths. If you have the raw data define a variable Breaks with the interval boundaries you want, then use breaks = Breaks in the hist function. A more detailed example using data that have already been grouped into bins of unequal width now follows. This is based on Baxter (2015b: 52–54) using data originally discussed in Cool and Price (1995: 15–19). Table 4.3: Chronology of Roman glass pillar-moulded bowls found during excavations at Colchester, 1971-1985. The data are ordinal, but the periods are of different lengths. See Cool and Price (1995: 15–19) and the text for discussion. Period Date Width Midpoint Percentage I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI XVII XVIII 43-50 51-60 61-70 71-80 81-90 91-100 101-125 126-150 151-175 176-200 201-225 226-250 251-275 276-300 301-325 326-350 351-375 376-400 8 10 10 10 10 10 25 25 25 25 25 25 25 25 25 25 25 25 46 55 65 75 85 95 112 137 162 187 212 237 262 287 312 337 362 387 11 22 9 8 6 5 9 5 4 3 2 2 2 3 2 1 1 1 The data in Table 4.3 relate to Roman pillar-moulded bowls found in excavations at Colchester, UK. Eighteen periods with associated date ranges were identified. If the date-span is ignored the data can be represented as a barplot. This is 37 10 0 5 percentage 15 20 shown in Figure 4.6a. The data are treated as categorical and ordinal since their chronological sequence is known. This is quite common with chronological data; the situation here where the date-span can be specified is seen less often. I II III IV V VI VII VIII IX X XI XII XIII XIV XV 50 XVI XVII period 100 150 200 250 300 350 400 period (a) (b) Figure 4.6: Different ways of representing the data from Table 4.3. (a) The barplot respects the ordinal nature of the data, but not the fact that periods are of different spans. (b) The histogram to the right does respect this. We can do better because information on the date-span of the periods, in terms of actual dates and the width of the spans, is available. The earlier periods are more tightly defined than later ones. Pillar-moulded bowls are an early form and occur predominantly in the earlier periods. Knowing the dates and spans means that the data can be treated as continuous and represented in the form of a histogram. The data are ‘pre-binned’, and the midpoints and intervals are known, so the methods discussed in the previous section can be used to produce a histogram (Figure 4.6b). The barplot was produced using the barplot function (Section 10.2). An alternative approach to producing the histogram that facilitates the labelling of bars by period is described in Chapter 4 of Baxter (2015b). This exploits the barplot function. Read the data of Table 4.3 into a data frame pillar then use Percentage <- pillar$Percentage; Period <- pillar$Period; Width <- pillar$Width barplot(Percentage/Width, space = 0, width = Width, names.arg = Period, col = "skyblue", xlab = "Period", axes = F, cex.names = 0.5) The zero spacing between bars, space = 0, ensures bars are contiguous, and widths of bars, width = Width, corresponding to the date-spans are specified. The Percentage variable is rescaled by dividing by Width so that the area of a bar corresponds to the percentage. The argument names.arg = Period controls 38 the labelling and cex.names = 0.5 the size of the labels. Because of the unequal widths a sensible scale for the vertical axis is not available. The histogram emphasises the decline in use over time more obviously. The quite sharp and steady decline in usage is readily apparent. The barplot does not do this, because it was constructed without use of the spans which are longer for the later periods. There is the suggestion in the barplot of a secondary peak in Period VII that exists simply because the span is longer than earlier periods and, for the same reason, the sharp decline in usage over time is less apparent. 4.6 Using histograms for comparison It’s possible to use histograms for comparative purposes though we would not usually recommend this. One possibility is to ‘stack’ histograms. This is unwieldy for more than two or three histograms and often impractical with more than this number. Albarella et al. (2006), for example, needs 10 pages, most consisting of five histograms, to compare (mostly) pig lower-tooth measurements from Italian prehistoric contexts. Figure 4.7 shows an example comparing lengths of the early and late Romano-British hairpins using ggplot2. Romano−British hairpin lengths 10 Early count 5 0 10 Late 5 0 50 75 100 125 lengths (mm) Figure 4.7: Early and late hairpin lengths compared using stacked histograms. We use Length as originally defined in Section 4.1 (i.e. individual values). The following code to follow creates a function fig4.7. fig4.7 <- ggplot(data = pins, aes(x = Length)) + 39 geom_histogram(bins = 20, fill = "white", colour = "black") + facet_grid(Date ~ .) + ... You then need to type fig4.7() to get the graph (Chapter 5). The same presentational arguments as in the code associated with Figure 4.5 were used. In base R the simplest thing is to generate the histograms separately and format them when typesetting the document. Alternatively, within R, sandwich the commands that generate them between par(mfrow = c(2, 1)) and par(mfrow = c(1, 1)). The first of these generates a 2 × 1 grid for the plots, the second returns to the default state. If you do it this way you need to ensure that the range of the histogram is the same in both plots. This can be done using the xlim argument in the commands generating the histograms, (e.g., xlim = c(40, 140)). For an example and R code see Chapter 3 of Baxter (2015b). We are not illustrating this here because ggplot2 is more convenient. The other way to compare histograms is to superimpose them. Figure 4.8 shows the results using both base R and ggplot2 (a) (b) Figure 4.8: The early and late Romano-British hairpin lengths compared using superimposed histograms with (a) base R, (b) ggplot2. We shall explain the code in detail as it is the first occasion we have had to use plot legends. The base R code is Length <- pins$Length; Date <- pins$Date Early <- Length[Date == "Early"] Late <- Length[Date == "Late"] Breaks <- seq(50, 140, 5) hist(Early, breaks = Breaks, ylim = c(0, 15), col = "#FF000060", main = " ", xlab = "Romano-British hairpin lengths (mm)", cex.lab = 1.3, cex.axis = 1.2) 40 hist(Late, breaks = Breaks, col = "#0000FF60",add = T) legend("topright", c("Early", "Late"), fill = c("#FF000060", "#0000FF60"), bty = "n", cex = 1.5) where the first two lines create variables for the early and late hairpins. The third line creates the break points to be used for both histograms using the seq function. This defines a range of 50 mm to 140 mm with breaks at intervals of 5 mm. The main idea is straightforward; create two histograms and superimpose one on the other using add = T in the second hist command. The col arguments use hexadecimal code, the last two numbers of which control the degree of transparency. The ylim argument specifies the y-axis range and needs to be chosen to accommodate both histograms. The first three legend arguments specify the legend position, labels and fill colour. The default is to place the legend in a box; we usually prefer to omit this using bty = "n". The cex argument controls the text size of the legend. The R help on legend lists a lot of options and it takes getting used to. Looking at lots of examples is probably the best way to go. One of the ‘selling points’ of ggplot2 is that it takes the pain out of legend construction. To judge from online traffic not everyone would agree; you can modify the defaults but not always in an obvious way. The code for ggplot2 in Figure 4.8b provides some idea of what is involved. We have placed the code in a function fig4.8b which produces the graph when fig4.8b() is typed (see Chapter 5). fig4.8b <- function() { library(ggplot2); library(grid) p <- ggplot(data = pins, aes(x = Length, fill = Date)) + geom_histogram(binwidth = 5, boundary = 50, position = "identity", alpha = 0.6) + scale_fill_manual(values = c("red", "blue")) + xlab("Romano-British hairpin lengths (mm)") + theme(panel.grid.minor = element_blank()) + theme(axis.text=element_text(size=12), axis.title=element_text(size=14)) + theme(plot.title = element_text(size = 16, face="bold")) + # control position and size of legend and text theme(legend.position="top", legend.text=element_text(size=14), legend.title=element_text(size=16), legend.key.height=unit(.8, "cm"), legend.key.width=unit(.8, "cm")) p } fig4.8b() 41 In addition to ggplot2 the grid package is needed for aspects of the construction. In the initial ggplot2 command aes(x = Length, fill = Date) specifies that the Length variables from the pins data frame is to be used for analysis and split into subsets determined by the value of Date. What happens in this example is that the two values for Date, Early and Late, are mapped to different fill colours for the two histograms subsequently produced. In the geom_histogram function the binwidth and boundary arguments control the bin-width and origin of the histogram. They are chosen here to be the same as those used in the base R plot. The alpha argument controls the transparency of the colours used. Everything that follows the geom_histogram command is presentational. The code to create the plot is held in p which needs to be typed at the end to print the plot once gghist is executed. The use of colour was discussed in Section 2.3.5; Chapter 12 of Chang (2012) provides examples that might be used as ‘templates’. To our eyes the default colours in ggplot2 sometimes seem ‘washed out’. They can be replaced with your own choice using the scale_fill_manual function as shown. When the data are subsetted, as here, ggplot will automatically produce a legend. How you modify the appearance of a legend is not intuitively obvious; Chapter 10 of Chang (2012) provides practical guidance. The default is to place the legend to the right; we often prefer to have it at the top and legend.position="top" does this. Replace "top" with "none" if you don’t want a legend. The next two components of the code control the text and title size of the legend. The final two components control the size of the legend key – the boxes containing the fill colours in this instance. The fill is an example of an aesthetic (aes). Other aesthetics we shall meet are colour, linetype, shape (of data points) and size. As with legend in base R all this takes time to get used to though, once you have painstakingly constructed a few plots, presentational code that is not specific to a particular analysis can be ‘recycled’. Code can be built up incrementally as p <- ggplot(data = pins, aes(x = Length, fill = Date)) p <- p + geom_histogram(binwidth = 5, boundary = 50, position = "identity", alpha = 0.6) p <- p + scale_fill_manual(values = c("red", "blue")) P <- p + ... p where the ellipsis (. . .) indicates remaining presentational code; it is often convenient to do this. The plots work well enough here because the two groups are well-separated. With more than two groups better methods are available (Chapter 6). 42 4.7 Frequency polygons In Chapter 6 kernel density estimates (KDEs) are discussed, including their use as an alternative to the histogram. A difference that needs to be appreciated is that the histogram is a direct representation of the (grouped) raw data, whereas a KDE is an estimate of the distribution from which the data are presumed to be a sample. Other things being equal, we think the KDE has aesthetic advantages, particularly when it comes to comparing the distribution of several sets of data. For this purpose, and if one wishes to retain the ‘directness’ of the histogram representation, another alternative is to use frequency polygons. Put very simply, these can be constructed by connecting the mid-points of the upper parts of the bars from which a histogram is constructed, dispensing with the bars entirely. This makes the comparison of more than one set of data more straightforward. If the data are pre-binned a frequency polygon can be constructed directly, without the need for the ‘tricks’ required to represent such data in the form of a histogram (Section 4.4). As with a histogram the precise appearance will depend on the choice of bin width, which can be controlled. It is most convenient to use ggplot2, exploiting the geom_freqpoly function and this is all that is attempted here. Figure 4.9a shows a frequency polygon for all the hairpin lengths in pins; Figure 4.9b contrasts the frequency polygons for the early and late hairpins. For consistency with earlier histograms a bin width of 5mm is used. Romano−British hairpin lengths Romano−British hairpin lengths Date Early Late 15 15 10 Count Count 10 5 5 0 0 60 90 120 60 Lengths (mm) 90 120 Lengths (mm) (a) (b) Figure 4.9: Frequency polygons for Romano-British hairpin lengths (a) using all the pins, (b) distinguishing between early and late pins. The code used for Figure 4.9a was ggplot(pins, aes(x = Length)) + geom_freqpoly(binwidth = 5) + ... and for Figure 4.9b 43 ggplot(pins, aes(x = Length, colour = Date)) + geom_freqpoly(binwidth = 5, size = 1.2) + ... omitting presentational arguments. The advantage of the frequency polygons compared to the analogous histograms in Figure 4.8 may not be especially obvious here; there are only two groups which are well-separated. With more groups and less clear separation frequency polygons or KDEs are to be preferred. 44 Chapter 5 User-defined functions 5.1 5.1.1 User-defined functions Introduction In Section 2.3.1 it was promised that user-defined functions would be introduced in context. This is it, having first introduced some code in the previous chapter, including an example of a user-defined function in connection with Figure 4.8b, we elaborate on their construction here. The basic idea is straightforward. The structure of a function is name <- function(arguments) {code} where nothing immediately happens until typing name(arguments) executes the function. In this text code is constructed so that the output is a graph. In general it is the use of arguments which endows functions with some of their power, since they allow the same kind of analysis to be undertaken on different data sets without rewriting the code. Although we indicate how this can be done, in practice for much of the code in the text we have left the arguments blank. This is partly because we have treated analyses as ‘once-off’; it would be possible to generalise the code in some instances, but at the expense of complicating matters, and we have largely chosen not to do this. There are still good reasons for embedding code in a function, even if it is ‘once-off’. Readers new to, or fairly unfamiliar with, R who have taken the trouble to work through the previous examples will have realised that it is easy to make typing errors, tedious to correct them, and you have to repeat the exercise every time you want to construct a histogram. This is particularly irksome if the code is at all lengthy. If embedded with a function, as explained in the next section, all that’s needed is to locate and correct errors; the rest of the code does not need re-typing. 45 5.1.2 Writing functions – a very basic introduction If you type ?mean when in R you will see that the function mean is defined as mean(x, trim = 0, na.rm = FALSE, ...) where trim is the amount of trimming to apply if you want a trimmed mean (Section 3.3)1 . The arguments to the function are x and trim. It may help to think of these as two types of argument, this applying to most other functions that use arguments. The first argument, x can be thought of as a ‘dummy’ argument with no default so has to be replaced with some data when the function is called; trim has a default value so can be ignored in the call to the function unless you want to vary it. Thus, given a variable weight, mean(weight) will calculate the (arithmetic) mean and mean(weight, trim = .025) omits the 5% most extreme cases, 2.5% at each end. Suppose you want to write your own function, exploiting mean to calculate a mean. You wouldn’t actually do this but we shall present several approaches that anticipate what occurs later in the text. Begin with my.mean <- function() {} my.mean <- edit(my.mean) which will bring up an edit screen. What you then do is sandwich the necessary code between the curly brackets. Arguments for the function, if you need them, are placed within the rounded brackets. If there are errors in the code, which is likely if it is at all lengthy, R will report this. It will also fail to save what you’ve done, so you need to take precautions to avoid having to repeat things. There is more than one approach to this; our usual practice is to copy the code before exiting and, if things go wrong, go back into edit mode, paste the text back in, and correct it. The error messages you get from R are usually, but not always, helpful in this respect. The advantage of putting code within a function is that you only need to locate and correct errors, you don’t need to re-type anything else. We now illustrate some possible approaches. Suppose you know exactly what you want, the mean of weight with trim = .025. Create the following function my.mean <- function() { m <- mean(weight, trim = .025) m } my.mean() 1 For the record na.rm = FALSE says that missing values (NA) should be retained in the calculation; if there are any the result will be NA. To get a mean that ignores missing values replace FALSE with TRUE. This and the following ellipsis are ignored for present purposes. 46 which will return the mean once you execute the function using my.mean() (there is more than one way of returning results). Functions like this, without arguments, are useful for ‘once-off’ analyses where the code is at all lengthy and plenty of examples follow in later chapters. We have sometimes explicitly presented these as functions and sometimes as code not embedded within a function. We recommend, however, that if you want to emulate anything that follows, you get into the habit of placing code within a function. See Section 1.6 for additional help on this. This is more general. my.mean <- function(data, TRIM = 0) { m <- mean(data, trim = TRIM) m } In this instance, within the body of the function data is a ‘dummy’ argument that will be replaced by whatever variable is specified when the function is called. The argument trim will be replaced by whatever value we choose to give TRIM, but it has a default of 0. Thus my.mean(weight) will return the mean of weight, and my.mean(weight, TRIM = .025) will replace TRIM within the body of the function with 0.025 and produce a trimmed mean. 5.1.3 More realistic illustrations Code for base R Suppose you have a data set that you want a histogram of; we will use Length as the example. If a quick look at the data is all that is needed then, as already seen, something like hist(Length, breaks = 20) is needed. Here the first argument Length is the data, and the second argument is the (approximate) number of bins needed. You can play around with the latter until a satisfactory histogram is obtained and can also change the variable for which a histogram is required. For something more suitable for publication you’d want to change some of the default arguments such as hist(Length, 20, col = "grey80", main = "", xlab = "Hairpin lengths (mm)") which removes the default title main, adds a more informative label for the x-axis, and colours the bars of the histogram a light grey. More flexibility is provided if a function, my.hist is created to do the work. We proceed by listing some possibilities. 47 my.hist <- function(data, nbins = 20, COL = "skyblue", XLAB = "My text") { hist(data, nbins, col = COL, main = "", xlab = XLAB) } then my.hist(Length, 20, COL = "grey80", XLAB = "Hairpin lengths (mm)") will produce a histogram identical to that produced by the previous code. One advantage of doing things this way is flexibility. The arguments in the hist function, where not fixed, are replaced by those in the call to the function. It’s necessary to replace data with the name of the data set of interest; other arguments have defaults but can be varied as wished making it easy to experiment with different numbers of bins and colour, for example. The other advantage is generalisation. To anticipate later chapters, data on the weights of loomweights from excavations at Pompeii are provided. Suppose these are held in a variable Weight and that a histogram is needed. my.hist(Weight, 30, COL = "red", XLAB = "Weight (g)") will get you going. All the base R code in the text was embedded in functions, though this is not always shown explicitly. If you want to emulate the examples it’s recommended that you do the same. The example given above is intentionally simple, to illustrate the main ideas. For more complicated code extra paraphenalia is usually desirable. A list of some of the things we do now follows – it should be emphasised that this is what we are accustomed to doing rather than ‘best practice’. • It helps to annotate code if it’s at all complicated and you might need to return to it in the distant future (this is the voice of bitter experience speaking!). • We normally start the code with graphics.off(), which has the effect of clearing the screen of any graphs already there. • Then we load the packages needed, using the library() function, within your own function. • If the aim of a function is to produce a single graph then forget the next bit. But you can construct a function to ‘multi-task’ and produce several graphs, which we often do, and you need to save them. Our approach, and there are others, is to precede each block of code that produces a graph with win.graph(). When you execute the function this will result in all the graphs appearing on the screen. The reason we do it this way is that the graphs can be saved in Postscript (eps) format – which is what we need for subsequent use, frequently incorporation into LaTex documents. Other 48 formats, including PDF, Png, Bmp, TIFF and JPEG, are also available. If you click on the graph a menu appears in the top left (on our system) which allows you to save it in the format of your choice. Code for ggplot2 Here we draw on code given in Section 4.3, but embed it within functions. The following the code chunks all have the same effect. my.histGG <- function() { library(ggplot2); library(grid) ggplot(data = pins, aes(x = Length)) + geom_histogram(bins = 20) } my.histGG() my.histGG <- function() { library(ggplot2); library(grid) p <- ggplot(data = pins, aes(x = Length)) + geom_histogram(bins = 20) p } my.histGG() my.histGG <- function() { library(ggplot2); library(grid) p <- ggplot(data = pins, aes(x = Length)) p <- p + geom_histogram(bins = 20) p } my.histGG() Presentational arguments have been omitted; once they are added, as can be seen for the code given for Figure 4.5b, this can result in moderately wordy code, which is one reason for placing it in a function. Not that to get the histogram my.histGG() is typed. We emphasise this because the second author, who had no previous experience of user-defined R functions, when checking the code in the text and exiting from an edit session with code in either of the last two formats, thought that the final p would produce a graph. It doesn’t, because it is internal to the my.histGG function and only produces a plot when my.histGG() is invoked. We’ve kept things simple above by using a specific example. It can be generalized by providing arguments that control the data set, aesthetics, number of bins, labels, titles and so on. This can become a bit messy if there are many aesthetics to specify. We do this if we want to use a function frequently with different data sets, but it is sometimes convenient, if usage is only occasional, to copy and edit the ‘source’ function according to the needs of the moment. 49 5.2 Summary As noted, this is intended as a basic guide to writing your own functions in R. Decent introductory text books will cover the topic much more thoroughly, but tend to do this late in the text. This is because the early part of a text devotes far more time to the mechanics of R, data manipulation, and examples of a wide variety of statistical analyses, often involving fairly small chunks of code. That is, the treatment is more extensive and thorough, and less specialised, than anything attempted here. Readers are advised to acquire a decent introductory text on the subject – this is partly a matter of taste so we shall not recommend any particular text. We think, however, that for the purposes of this text an early acquaintance with user-defined functions is desirable. As stated already, it’s useful if only to facilitate emulation of code that occurs in later chapters, should you wish to do so. 50 Chapter 6 Kernel density estimates 6.1 Introduction At their simplest kernel density estimates (KDEs) can be thought of as smoothed histograms. Though mathematically quite complex (Baxter 2003/2015a: 29–37) they are readily explained and easily constructed in R. An illustration has already been given in Figure 4.4. To illustrate their construction a small data set (3.7, 4.2, 6.5, 6.8, 7.0) is used. Symmetrical ‘bumps’ are placed at each data point and ‘added up’ to get the KDE. Figure 6.1 shows this. 0 2 4 6 8 10 x Figure 6.1: The plot shows how a KDE is constructed; see the text for details. The KDE is rescaled so that the area under the curve produced is 1 – that is, it is an estimate of the probability density of which the data are a sample. It is thus an alternative to a density histogram for displaying data. For those wanting a mathematical treatment see Silverman (1986) or Wand and Jones (1995). 51 The bumps are called kernels and also have the property that the area they define is 1 (i.e. they are also probability densities). Different types of kernel are available but the choice isn’t usually important. A normal (Gaussian) distribution as the kernel (the famous bell curve) is common and will be used. What is important is that the spread of the curve is measured by what is called the bandwidth or window-width. A large value will oversmooth and a small value will undersmooth the data, losing information in the first case and failing to achieve useful ‘simplification’ in the second case. Figure 6.2 illustrates. Rules are available for making a ‘good’ choice of bandwidth but experimenting with different values and making a subjective choice is perfectly acceptable. To our eyes default KDEs often seem oversmoothed; the default can be adjusted downwards until it induces spurious features in the KDE, often manifest as unsightly bumps. 0 2 4 6 8 10 0 x 2 4 6 8 10 x (a) (b) Figure 6.2: The plot illustrates the importance of bandwidth choice on KDE construction using (a) a larger and (b) a smaller bandwidth than in Figure 6.1. Compared to a histogram a KDE does not depend on an origin; the need to choose a bandwidth is analogous to bin-width choice in a histogram. It is in the nature of KDE construction that the KDE will ‘spill-over’ into regions beyond the observed range of the data. If the data are conceived of as a sample from a population this is not problematic if the spill-over region is plausible. If the data are bounded – weight is bounded below by zero, for example – and the KDEs spill-over into an ‘impossible’ region this can give rise to unease. One response is not to worry about it; another is to truncate the KDE at a boundary, though this then does not produce a ‘true’ KDE. Producing a proper KDE that respects the boundary is tricky and beyond the scope of these notes. 52 6.2 Example – loomweights from Pompeii The data in Table 6.1 are the weights and heights of complete loomweights found in excavations at Insula VI.1, Pompeii, Italy. Analyses of these data were initially reported in Baxter and Cool (2008) based on 95 loomweights. The subsequent discovery of a further 42 loomweights (in areas of the finds store where they probably shouldn’t have been located) led to an updated paper (Baxter et al. 2010a). The 2008 paper suggested that the weights of the loomweights had a clear bimodal distribution, a conclusion strengthened when using the larger sample in the second paper. The opportunity was taken in the 2010 paper to explore the possible implications of the bimodality in terms of the nature of the threads that might have been manufactured, drawing on experimental work published by Mårtensson et al. (2009) in the interval between the two papers. In 2010 and 2011 a few more loomweights emerged from the store and it is these, 143 in all, that are reported in Table 6.1. The site and some loomweights from it are illustrated in Figures 6.4 and 6.5 and additional analysis of the data is provided in Chapters 7 and 8. Figure 6.3 shows the default KDEs produced in base R and ggplot2. The data frame, loomweights.all, that was used contains two columns corresponding to Weight and Height. For some analyses these variables were extracted from the data frame, with those names, in the manner described in Section 4.1. density.default(x = Weight) density 0.002 0.001 Density 0.003 0.003 0.002 0.000 0.001 0 200 400 600 800 0.000 N = 143 Bandwidth = 34.79 200 400 600 Weight (a) (b) Figure 6.3: Kernel density estimates of the weights of 143 loomweights from Table 6.1 using (a) base R, (b) ggplot2. The code used was plot(density(loomweights.all$Weight)) for Figure 6.3a and ggplot(loomweights.all, aes(x=Weight)) + geom_density() for Figure 6.3b. 53 Table 6.1: Weights (g) and heights (mm) of complete loomweights from the excavation of Insula VI.1, Pompeii, Italy. The site and loomweights from it are shown in Figures 6.4 and 6.5. Weight Height Weight Height 96 106 107 110 111 112 113 117 118 119 133 133 133 134 136 136 137 138 138 139 140 142 143 143 146 152 158 159 159 160 163 163 164 166 166 166 166 167 170 171 173 175 178 180 180 182 187 189 191 193 198 199 201 202 202 202 204 207 211 214 216 222 226 228 230 233 235 239 240 242 247 255 69 78 57 74 62 71 69 55 64 56 58 60 70 67 64 58 65 63 74 64 70 74 79 72 61 68 69 83 72 59 64 86 81 57 71 72 61 73 66 69 70 72 87 82 88 68 84 88 60 97 91 79 88 81 95 87 69 87 89 87 100 93 90 96 99 99 78 100 84 88 97 103 256 257 258 260 260 266 267 269 269 273 276 277 280 281 283 284 284 286 288 288 290 290 291 296 296 297 298 298 300 301 303 304 304 306 306 310 313 314 316 318 321 322 324 327 327 328 337 340 342 343 343 348 349 350 351 355 356 371 373 374 383 388 391 420 420 449 467 484 564 618 737 100 100 89 87 98 100 98 92 89 94 102 95 87 104 107 61 91 107 72 110 116 101 107 99 114 103 97 98 98 100 98 91 98 95 97 88 118 97 70 99 112 100 100 104 101 112 95 118 107 87 100 104 111 110 91 94 122 103 112 101 108 99 118 116 128 120 105 116 114 119 127 54 Figure 6.4: A view of Insula VI.1, Pompeii, Vesuvius in the background. (Source: authors). Figure 6.5: Loomweights from Insula VI.1, Pompeii, illustrating their range of sizes. That to the far left was probably used for votive purposes. (Source: authors). 55 Note that the base R plot lists the bandwidth used, while ggplot2 truncates the KDE at the limits of the data. Both plots have a rather bumpy right tail, the bumps being attributable to eight outlying points. The main interest lies in what is happening in the bulk of the data so it’s sensible to remove these. The data in Table 6.1 are ordered by weight. It can be seen that there is something of a gap between 391 and 420 with the data very dispersed thereafter, so 8/143 values greater than 400 will be be omitted. Define a new data frame as loomweights <- loomweights.all[1:135, ] and Weight as Weight <- loomweights$Weight; Height is defined as Height <- loomweights$Weight. The variables will be used in this form from henceforth. Figure 6.6 shows modified versions of the plots in Figure 6.3. The default bandwidth of 26.85 gives the dashed red line; multiply this by 0.6 to get a bandwidth of 16.11 which produces the solid blue line. The value of 16.11 is produced if the argument adjust = 0.6 is added to the density command and will be illustrated in the code that follows later. 0.005 0.005 Bandwidth 0.003 0.004 0.002 density 0.003 0.002 0.001 Density 0.004 26.85 16.11 0.000 0.001 100 200 300 0.000 400 0 Weight 100 200 300 400 weight (a) (b) Figure 6.6: Kernel density estimates of the weights of loomweights, omitting outliers, using (a) base R, (b) ggplot2. See the text for details. The reduction in the bandwidth compared to the default is quite sharp; it was obtained after some experimentation. As the bandwidth is gradually reduced the KDE initially remains smooth, while increasingly emphasizing the bimodality. The KDE starts to look bumpy once values much lower than about 0.6 are used. The base R code is kde_lw <- density(Weight, adjust = 0.6) plot(kde_lw, xlab = "Weight", main = "", lwd = 3, col = "blue", cex.axis = 1.2, cex.lab = 1.3) lines(density(Weight, adjust = 1), col = "red", lwd = 3, lty = 2) legend("topright", c("26.85", "16.11"), col = c("red", "blue"), lwd = c(3, 3), lty = c(2, 1), title = "Bandwidth", bty = "n", cex = 1.5) 56 The first line generates the KDE using the density function without plotting it and illustrates the use of the adjust argument. If desired, the bandwidth can be extracted with kde_lw$bw. The plot function plots the first KDE and the lines function adds the second KDE to the plot. The lty = 2 argument specifies line type 2, a dashed line; lty = "dashed" is also permissible. The ggplot2 analysis was constructed by separately generating two densities, with the default bandwidth (the dashed red line) and an adjustment factor of 0.5 (the blue line), and plotting them on the same graph. The way ggplot2 works deliberately discourages manually added legends, hence the absence of one. The code is ggplot(loomweights, aes(x=Weight)) + geom_line(stat = "density", adjust=0.5, colour="blue", size=1.3) + xlim(0, 450) + geom_line(stat="density", adjust=1.0, colour="red", linetype="dashed", size=1.3) ... 0.005 0.006 where the ellipsis (. . .) represents the addition of relevant presentational code similar to that illustrated in connection with Figure 4.8b. It is sometimes useful to superimpose the KDE on the corresponding histogram as in Figure 6.7. density 0.003 0.002 Density 0.004 0.004 0.000 0.001 0.002 0 100 200 300 0.000 400 0 Weights (g) 100 200 300 400 weight (a) (b) Figure 6.7: Kernel density estimates of the weights of loomweights, omitting outliers, superimposed on a histogram using (a) base R, (b) ggplot2. See the text for details. The code for base R has already been discussed in connection with Figure 4.4b. The ggplot2 code is ggplot(loomweights, aes(x = Weight, y = ..density..)) + geom_histogram(binwidth = 20, fill = "skyblue") + geom_density(adjust = 0.6, size = 1.1) + xlim(0,450) + ... Here xlim(0,450) controls the range of the x -axis and is chosen to avoid truncation of the KDEs. 57 6.3 Comparing KDEs For illustration the pins data from Chapter 4 is revisited. Figure 6.8 shows the base R and ggplot2 graphs. They show the difference between the early and late pins well. The ggplot2 histogram in Figure 4.8b shows the separation just as well; which plot you prefer is perhaps a matter of taste. The KDEs have, in our view, greater aesthetic appeal and are more suitable if more than two groups are to be compared. An example is provided in Section 7.3. Date 0.02 Density 0.03 0.02 0.01 Density Late 0.04 0.03 0.04 Early Late Early 0.00 0.01 40 60 80 100 120 0.00 140 60 Romano−British hairpin lengths (mm) 90 120 Length (mm) (a) (b) Figure 6.8: Kernel density estimates comparing lengths of early and late Romano-British hairpins using (a) base R, (b) ggplot2. The following R code was used (base R) # Base R plot(density(Early), lwd = 3, main = "", xlab = "Romano-British hairpin lengths (mm)", col = "blue", xlim = c(40, 140), cex.main = 1.4, cex.lab = 1.3, cex.axis = 1.2) lines(density(Late), lwd = 2, col = "red") legend("topright", c("Early", "Late"), col = c("blue", "red"), lwd = c(3,3), lty = c(1,1), bty = "n", cex = 1.5) For ggplot2 the following was used. # ggplot2 ggplot(pins, aes(x = Length, colour = Date)) + geom_density(size = 1.3) + scale_colour_manual(values = c("blue", "red")) + ... The scale_colour_manual function function shows how the default colours can be modified. The colour = Date component of the aes argument of the ggplot function maps colours to the early and late dates and you may wish to change them. 58 6.4 Two-dimensional (2D) KDEs 120 120 100 100 height height A few years ago we’d probably have omitted a treatment of 2D KDEs, but their use in archaeology is increasingly common and implementation is now straightforward. Different forms of bivariate KDEs are illustrated in Figure 6.9 for the loomweight height and weight data using ggplot2. These show the two main modes in the data with heavier weights tending to be larger, as might be expected. 80 60 80 60 40 40 100 200 300 400 100 weight 200 300 400 weight (a) (b) 120 height 100 80 60 40 100 200 300 400 weight (c) (d) Figure 6.9: Different approaches to representing two-dimensional KDEs using ggplot2 for the weight and height of loomweights. The idea is similar to one-dimensional KDEs. Each point in the plane defined by the two variables is associated with a bump in the form of a ‘hill’. The hill is a probability density function whose shape, in the most usual applications, is defined by separate bandwidths for the two variables. The hill may be elongated, and typically the axes are aligned with those of the plot. The use of a bivariate normal distribution for the kernels (hills) is common. Baxter et al. (1997) introduced the use of 2D KDEs into the archaeological literature; the use of KDEs of any kind was still uncommon in this literature at the start of the 2000s (Baxter 2003: 36–37) but the situation has changed since 59 then. One of the examples in Baxter et al. (1997) was an intra-site analysis of the spatial locations of artefacts at the Mask Site studied by Binford (1978). In their influential introductions to Geographical Information Systems (GIS) for archaeologists both Wheatley and Gillings (2002: 186–187) and Conolly and Lake (2006: 175–178) note the potential of 2D KDEs for spatial interpolation. Examples at the intra-site level involving the study of artefact distributions include Keeler (2007) at the Late Upper Palaeolithic site of Verberie (France); Aldeias et al. (2012) at the late Middle Palaeolithic site of Roc de Marsal (France); and Oron and Goren-Inbar (2014) at the Mousterian site of Quneitra in the Golan Heights. Sayer and Weinhold (2013) investigated spatial clustering among graves in a number of Anglo-Saxon cemeteries. At larger spatial scales McMahon (2007) investigated the distribution of cultural resources in Colorado (USA); Schmidt et al. (2012) studied the settlement patterns in Iberia of late Middle Palaeolithic and early Upper Palaeolithic sites; Monna et al. (2013) compared the density of the spatial distribution of Breton and Norman Middle Bronze Age palstaves across Northern France; and De Reu et al. (2013) mapped the distribution pattern of Bronze Age barrows in the highly urbanized landscape of north-western Belgium. Most, though not all, of these use GIS. Returning to the code for Figure 6.9 the following was used. ggplot(loomweights, aes(x=Weight, y=Height)) + geom_point(size=3) + #stat_density2d(aes(colour="red"), size=1.1, h=c(100,20)) + # a #stat_density2d(aes(fill=..level..), geom="polygon", h=c(100,20)) + # b #stat_density2d(aes(fill =..density..), geom="raster", contour=FALSE, # h=c(100,20)) + # c # c #stat_density2d(aes(alpha =..density..), geom="tile", contour=FALSE, # h=c(100, 20)) + xlim(50,450) + ylim(40,130) + ... # d # d To get Figures 6.9a,b,c,d remove the relevant comment signs, #, before the appropriate stat_density2d command. Other than changing the contour colour to red in Figure 6.9a, and the h=c(100,20) argument, defaults are used. You can, if you wish, omit the latter without harm, but it may be helpful to go into a little more detail as it shows how to control aspects of the appearance of the plots. The numbers 100 and 20 are the bandwidths used for the KDE. The coordinates used to generate the plots are derived from the kde2d function from the MASS library associated with the book Modern Applied Statistics with S (Venables and Ripley, 2002). The bandwidths in the kde2d function can be controlled using h(a,b). The defaults can be found using bandwidth.nrd(Weight) and bandwidth.nrd(Height) which gives (rounded) values of (126, 29) in this instance. These have been changed to (100, 20) in the example to show how they can be modified if you wish – it doesn’t make too much difference in this instance. 60 1e−04 80 Densit Height 100 120 The ouput from kde2d can be passed to base R functions to visualize the density. Thus Figure 6.10 uses the image and persp functions to visualize the output. Contour plots in base R are illustrated in Figure 6.11. y 100 ht 60 0e+00 We ig 200 He 40 igh 100 200 300 t 50 400 400 Weight (a) (b) Figure 6.10: (a) An image and (b) a perspective plot representing two-dimensional KDEs using base R for the weight and height of loomweights. The default colouring for the image plot has been used; it may be compared with Figure 6.9. Three-dimensional plots are not supported in ggplot2 so if you like this kind of thing something like persp or functions that do a similar job in other packages are needed. The code used was library(MASS) # Code for Figure 6.10a dens <- kde2d(Weight, Height, h = c(120, 20), n = 100, lims = c(50, 450, 40, 130)) image(dens, xlab = "Weight", ylab = "Height") # Code for Figure 6.10b persp(dens, col = "green", theta = 30, phi = 25, ticktype = "detailed", nticks = 3, xlab = "Height", ylab = "Weight", zlab = "Density", cex.lab = 1.3, cex.axis = 1.2) where the lims argument in kde2d controls the range of the axes (50, 450) for the x-axis and (40,130) for the y-axis. Producing the image plot is straightforward, the perspective plot less so and it’s best to have a look at the help facilities for detail. The arguments theta and phi determine the position from which the ‘landscape’ is viewed and experimentation with these would usually be necessary. It was implied at the start of this section that 2D density estimation might be regarded as a specialised topic, beyond what might be expected in an introductory text; the hope is that it has been demonstrated that the production of visually appealing and informative graphics is straightforward. There is, however, an issue 61 in the construction of the plots illustrated that is neglected and worth mentioning. This is that the kernels used so far have been orientated with the axes of the graph. In much software, including R packages, this is the default and often the only option. If you look at Figure 6.9 you will see that the main orientation of the point scatter is along a south-west/north-east axis. The other ‘natural’ axis is northwest/south-east. The kernels, however, are orientated west-east and north-south so may not capture the natural patterning of the data as well as they might. At the expense of some mathematical complication this can be dealt with. The kernel used is two-dimensional and can be displayed as a 2×2 table (matrix) with identical off-diagonal elements. The off-diagonal value controls the orientation of the kernel, but has so far been taken to be zero. A brief account of the mathematics is given in Baxter (2003: 32–33) and will not be repeated here. It’s possible to implement the more general methodology using the ks package. This is illustrated in Figure 6.11 where the opportunity is taken to contrast the results with those using kde2d with the base R contour function. 100 80 Height (mm) 40 60 80 40 60 Height (mm) 100 120 Non−diagonal bandwidth matrix 120 Diagonal bandwidth matrix 100 200 300 400 100 200 Weight (g) Weight (g) (a) (b) 300 400 Figure 6.11: Contour plots for the loomweight data based on two-dimensional KDEs using base R; (a) with kernels oriented to the axes (b) with non-diagonal kernels. Although both plots would probably be interpreted in the same way by most people there are noticeable differences. The contouring produced using the nondiagonal bandwidth matrix better respects the orientation of the data. The code used is given below where dens is as defined in the code associated with Figure 6.10. # Diagonal bandwidth matrix contour(dens, drawlabels = F, ...) points(loomweights, pch = 16) # Non-diagonal bandwidth matrix library(ks) H <- Hpi(loomweights) 62 KH <- kde(loomweights, 2*H) plot(KH, cont = c(seq(25, 95, 10)), drawlabels = F, ...) points(loomweights, pch = 16) The drawlabels = F suppresses the default labelling of the contour levels; if you want these omit this argument. The ellipsis (. . .) should be replaced with the arguments used to control colour, labels, text size and so on. For the plot produced using the ks package, with a non-diagonal bandwidth matrix, Hpi produces an estimate of the matrix; the kde function estimates the KDE using 2*H, a rescaled bandwith matrix. The cont = c(seq(25, 95, 10)) specifies the contour levels, the seq function generating levels between 25% and 95% at intervals of 10%. Finally, in both plots, the points function adds the data points with pch = 16 specifying a solid circle as the plotting symbol. 63 Chapter 7 Boxplots and related graphics 7.1 Introduction Boxplots are among the most commonly used methods for univariate data display and comparison. The seminal text of John Tukey in 1977, Exploratory Data Analysis, (Tukey 1977) both advertised and helped popularise what are also known as box-and-whisker plots, though their origin is to be found in earlier work. Wickham and Stryjewski (2011) provide a useful account of the early history and subsequent developments. We suspect many readers will have some acquaintance with boxplots; violin plots (Hintze and Nelson 1998) or beanplots (Kampstra 2008), which ‘generalise’ the boxplot, may be less familiar and we cannot recall seeing applications of them in the archaeological literature. Figure 7.1 illustrates default output using boxplot in base R, vioplot from the vioplot package of that name, and beanplot from the beanplot package. The Early data from the pins data set are used. Without writing your own code ggplot2 does not have an option for beanplots; the functions for producing boxplots or violin plots are intended for comparing two or more groups and need to be ‘tricked’ into providing plots for a single variable. This is not illustrated. Use boxplot(Early), vioplot(Early) and beanplot(Early) after installing and loading the necessary packages. Their construction and interpretation are discussed in the next section. 7.2 Construction and interpretation Descriptive statistics were discussed in Chapter 3. The boxplot is a graphical representation of what was there called a five-number summary. The lower and upper bounds of the box are the first and third quartiles so the length of the box corresponds to the interquartile range (IQR) and contains the central 50% of the data; the width is arbitrary. The bar through the centre of the box is the median. The lines (or whiskers) extend from the limits of the box to, at most, 1.5 × IQR; 64 140 100 80 40 60 80 60 100 120 120 120 100 80 60 1 (a) (b) (c) Figure 7.1: (a) A boxplot, (b) a violin plot, (c) a beanplot for the lengths of early Romano-British hairpins (Table 4.1). The default output using functions discussed in the text is shown. points beyond this range are highlighted as potential outliers. If there are no such cases, the whiskers extend to the maximum and minimum values and define the range of the data. The violin and beanplots show rotated and mirrored KDEs. It is usual, but not mandatory, to show a violin plot superimposed on a boxplot; in the beanplot the horizontal lines within the plot correspond to individual values, their length being proportional to the number. The solid black line that extends to the sides of the plot is the mean. For the violin and beanplots it is desirable to be able to control the bandwidth. The vioplot implementation of the former is limited in this and other respects, and only the ggplot2 version is illustrated in later examples. 7.3 Using plots for comparing groups For this section a different version of the Pompeian loomweight data will be used. At the time Baxter et al. (2010a) was written detailed phasing of the site was not available. Subsequently this deficiency has been remedied and 118 of the loomweights (excluding outliers) come from dated contexts. These can be allocated to seven phases but the numbers of loomweights in some of these is small, so for present purposes a coarser four-phase grouping is used. Phase I is centuries BC; Phase II covers roughly the first third of the first century AD; Phase III is roughly the second third of the century to AD 62 when Pompeii was badly shaken by an earthquake; and Phase IV is the period from then to AD 79 when Pompeii was destroyed by the eruption of Mount Vesuvius. There is some overlap between Phases I and II and between Phases II and III. The numbers of loomweights in each phase are 19, 33, 21, 45. In addition to Weight and Height the maximum and minimum lengths (mm) of the rectangular tops and bases of the loomweights (Tmax, Tmin, Bmax and 65 Bmin) were measured. The variables occupy the first six columns of a data frame lwphased, with Phase in the seventh column. The first few rows of data are shown in Table 7.1, the full data set, called lwphased, being used in Chapter 8. Table 7.1: The first few lines of data for phased loomweights and six dimensional variables. Weight Height Tmax Tmin Bmax Bmin Phase 96 106 107 110 .. . 69 78 57 74 .. . 39 30 38 23 .. . 24 26 21 23 .. . 35 40 46 33 .. . 25 40 17 30 .. . IV IV II II .. . If all that’s needed is a quick look at the data minimal plots to compare the weights across phase can be obtained using the following code. boxplot(Weight ~ Phase, data = lwphased) # beanplot(Weight ~ Phase, data = lwphased) # ggplot(lwphased, aes(x=Phase, y=Weight)) + geom_boxplot() # ggplot(lwphased, aes(x=Phase, y=Weight)) + geom_violin() # base R boxplot beanplot ggplot2 boxplot ggplot2 violin plot Figure 7.2 shows a base R boxplot and a beanplot using the beanplot package for comparing weights across phases. Beanplot 300 Weight (g) 100 100 200 250 150 200 Weight (g) 300 400 350 400 Base R boxplot I II III IV I Phase II III IV Phase (a) (b) Figure 7.2: Visualizing the weights of phased loomweights from Table 6.1 using (a) a base R boxplot and (b) a beanplot from the beanplot package. It’s possible to embellish or control the plots in a variety of ways for which it is best to see the help facilities for the function or look at what is on offer online. Restraint has been exercised in Figure 7.2. Apart from adding to the code above the usual presentational arguments for boxplot and beanplot, the argument 66 boxwex = .5 was added to the boxplot code and bw = 25 to the beanplot code. The first of these reduces the default width of the boxes by a half. The bw argument in beanplot controls the bandwidth of the KDEs; beanplot(Weight ∼ Phase, data = lwphased)$bw extracts the default bandwidth of about 36 and this was reduced to 25 for the plot. The figure suggests that the use of boxplots is inappropriate here. The hourglass shapes of the beanplots for Phases II and IV suggest that their weights have a bimodal distribution. This is not apparent from the boxplots which, for useful interpretation, require the data to be unimodal. Another way of presenting the data would be to use KDEs as illustrated in Section 6.3. This results in Figure 7.3. Phase I II III IV Density 0.006 0.004 0.002 0.000 0 100 200 300 400 500 Weights of loomweights (g) Figure 7.3: Kernel density estimates of weights of phased loomweights. Looking at this together with the beanplot shows that, other than Phase III, weights in all phases are spread across the full range. The data for Phases I and III are skewed towards low and medium/large weights respectively. There is a tendency for later loomweights to be larger but the bimodal distributions for Phases II and IV rule out a simple interpretation. The patterns that are visible can be interpreted in terms of the nature of weaving practice over time in the Insula but the (necessarily) detailed discussion is beyond the scope of the present text. The ‘business part’ of the code used was as follows. fig7.3 <- function () { library(ggplot2); library(grid) p <- ggplot(lwphased, aes(x= Weight, colour = Phase)) + geom_density(size = 1.3, adjust = 0.8) + coord_fixed(40000) + ... p } fig7.3() 67 Determining the aspect ratio using coord_fixed requires a little experimentation. A typical value on the x-axis (e.g., 250) divided by the maximum on the y-axis (0.006) produced a value close to the 40000 used, but we varied the value about this to see what happened. Figure 7.4 shows boxplots and violin plots obtained using ggplot2. The latter plots suggest that some of the boxplots are misleading. Weights of Loomweights by Phase − Boxplots Weights of Loomweights by Phase − Violin Plots 300 300 Weight (g) 400 Weight (g) 400 200 200 100 100 I II III IV I Phase II III IV Phase (a) (b) Figure 7.4: Visualizing the weights of phased loomweights from Table 7.1 using ggplot2 for (a) boxplots and (b) violin plots. Omitting most presentational code the following was used for Figure 7.4a. fig7.4a <- function () { p <- ggplot(lwphased, aes(x = Phase, y = Weight)) + geom_boxplot(fill = "skyblue", width = 0.5) + theme(legend.position = "none") + ... p } fig7.4a() Similarly, for Figure 7.4b use the following. fig7.4b <- function() { p <- ggplot(lwphased, aes(x = Phase, y = Weight)) + geom_violin(fill = "magenta", adjust = .9) + theme(legend.position = "none") + ... p } fig7.4b() These tell the same story as previous figures. The legend is suppressed using theme(legend.position="none"). As with geom_density, the default in the violin plot is to truncate the KDEs at the limits of the range for each phase. 68 This and other aspects of the appearance, and that of the boxplots, can be modified. For illustration the Romano-British hairpins data will be used. Previous analyses (e.g., Figures 4.8 and 6.8) have shown that, within the early and late groups the distribution of lengths is unimodal, so in contrast to the phased loomweights boxplots are a useful method of comparison. Figure 7.5 shows embellished boxplots obtained using both base R and ggplot2 and violin plots obtained with the latter package. Romano−British hairpin lengths (mm) Romano−British hairpin lengths (mm) Romano−British hairpin lengths (mm) 120 120 80 100 Length (mm) Length (mm) 100 120 90 80 60 60 60 Early Late Date Early (a) Late Early Date (b) Late Date (c) Figure 7.5: Boxplots and violin plots for Romano-British hairpin lengths; (a) base R boxplots, (b) ggplot2 boxplots , (c) ggplot2 violin plots. The code for the base R boxplot is boxplot(Length ~ Date, data = pins, notch = T, varwidth = TRUE, boxwex = 0.4, pch = 16, col = c("red", "skyblue"), ...) The notch argument produces a notched boxplot with the useful interpretation that if notches do not overlap it is likely that the medians for the corresponding groups are significantly different. The argument varwidth = TRUE (the default is FALSE) produces boxes with widths proportional to the square-roots of the number of observations in the groups. The boxwex argument rescales the size of the boxes usually resulting, with relatively few groups, in a more pleasing appearance. The plotting character for outlying points is controlled by pch while col determines the colour of the fill for the boxes. Just give one colour (e.g., col = "red") if you want them to be the same. Remaining presentational arguments, omitted in the code above, control labels and text size. The corresponding code for ggplot2 is ggplot(pins, aes(x = Date, y = Length)) + geom_boxplot(aes(fill = Date), notch = TRUE, width = 0.4, varwidth = TRUE, outlier.shape = 16, outlier.size = 2) + scale_fill_manual(values = c("red", "skyblue")) + theme(legend.position = "none") + ... where the arguments to geom_boxplot are mostly similar to those used with boxplot or are self-explanatory. The width argument is equivalent to boxwex. If a 69 single colour is wanted for box fills rdelete aes(fill = Date) in the geom_boxplot and add an argument with a colour name (e.g., fill="blue") outside aes and omit scale_fill_manual. The violin plot was produced using the following code. ggplot(pins, aes(x = Date, y = Length)) + geom_violin(aes(fill=Date), adjust=0.95, trim=FALSE, scale = "count", draw_quantiles = c(0.25, 0.5, 0.75)) + ylim(40, 140) + scale_fill_manual(values = c("red", "skyblue")) + theme(legend.position = "none") + ... where the trim=FALSE argument avoids truncation at the limits of the range; the limits on the y-axis, specified in ylim, need to extend somewhat beyond the range of the data to ensure this. If scale = "count" is used areas are scaled proportionally to the number of observations. The draw_quantiles argument draws lines at the quantiles listed, the first and third quartiles and median in the example. It should be emphasised that these examples are designed to illustrate presentational options and are not intended as recommendations. The possibilities shown are also not exhaustive, for which see the documentation on the functions. 7.4 Discussion As noted at the start of this chapter, boxplots in the form they are now most commonly seen have been around since the 1970s. The earliest texts intended to introduce mathematical and statistical methods to archaeologists (Doran and Hodson 1975, Orton 1980, Shennan 1988, Baxter 1994, Drennan 1996) either do not mention them at all or do so only briefly. Shennan (1988: 45–46), for example devotes just two paragraphs to them and does not expand on this treatment in the second edition of the text (Shennan 1997). Baxter (1994: 30–32) (reprinted as Baxter 2015a) has three pages on the subject, the graphical illustrations being a good example of how ‘primitive’ things were then compared to what is now possible. Drennan’s (1996: 39–44) treatment of what he calls ‘box-and-dot plots’ is slightly longer but rather idiosyncratic. It is, in fact, difficult to find published archaeological applications of boxplots that pre-date 1990, and then there was not exactly a flood though usage has inceased noticeably since then. Boxplots are a simple tool for comparing dimensional data of artefacts from different contexts, for example, or other archaeological entities of interest. More recent applications of this kind include Hiscock (2003), Mills (2007), Jennings (2008), Morehart and Eisenberg (2010), Mackay (2011) and Scharra-Liška et al. (2016). Boxplots have also used for comparing variables derived in morphological studies of animals and humans, or variables measured in the scientific study of the properties of materials (e.g., Nikita et al. 2012, Pollard et al. 2012). 70 All the papers just cited use boxplots constructed and presented in a similar manner to Figure 7.2 (apart from unimportant software dependencies that affect their appearance). The notched boxplot was introduced by McGill et al. (1978). The earliest archaeological applications we have located are from the early 1990s (Stafford, 1991, 1994) but very little else from that decade. Since then their use has increased (e.g., Wurz et al. 2003, Liebmann et al. 2005, Scarry and Reitz 2005, Plummer et al. 2008, Seetah et al. 2014, Duval et al. 2015) but to nothing like the same extent as the more usual form. A notched boxplot makes a brief appearance in the second edition of Drennan (2009), but it does not otherwise feature in the archaeological textbook literature. Little concern seems to have been expressed about problems with boxplots if the data within groups are multimodal, and perhaps this is because this is rarely an issue. Before commenting further one other additional issue with the standard boxplot may be mentioned. This is that, if the data set is large and the distribution is heavily skewed far too many outliers may be suggested. The usual rule is that an outlier is flagged if it is more that 1.5×IQR distant from the box. It can be shown that if the data are sampled from a normal distribution there is about a 0.007 probability that a case will be flagged as an outlier. That is, with 1000 observations you’d expect something like 7 outliers to be flagged (even though there may be nothing untoward about them). Suppose now that you have a highly skewed distribution with a long tail to the right – such distributions occur naturally and not uncommonly 1 . In these circumstances a lot of cases may stray quite naturally into the region where they are classified as outliers even though they are not. This is illustrated in Figures 7.6a and 7.6b. The histogram in the first of these is generated from a random sample from a lognormal distribution; this is naturally skewed and, by construction, will have a long right-tail none of which, unless we are unlucky, will be outliers. The boxplot of Figure 7.6b suggests, however, quite a few outliers. There is a straightforward and practical way to get round this problem, which is to produce an adjusted boxplot using the adjbox function from the robustbase package which works exactly as boxplot apart from the adjustment for skewness. The mathematics behind the adjustment is quite complex (Hubert and Vandervieren 2008). Roughly what happens is that a (robust) measure of skewness is defined; the limits beyond which a point is declared an outlier are differentially modified (in a slightly ad hoc manner) by quantities that depends on this measure. In particular the whisker that stretches in the direction of the long tail is extended to suggest far fewer outliers. The effect can be seen in Figure 7.6c where there is a substantial reduction in the number of cases flagged. It would be nice to cite archaeological applications of violin plots and beanplots but we are not aware of anything. Apart from a possible dereliction of scholarship on our part a good question to ask is ‘Why?’. Wickham and Stryjewski (2011: 11) 1 Weight in animal or human populations is one example; trace element concentrations in the materials from which artefacts are manufactured is another. 71 Adjusted boxplot for skewed random data 0 10 20 30 40 50 50 0 10 20 30 Random data 40 0 0 10 10 20 30 Random data 40 50 50 40 30 20 Frequency 60 Boxplot for skewed random data 60 Histogram for skewed random data 60 Random data (a) (b) (c) Figure 7.6: For 300 observations randomly generated from a skewed lognormal distribution (a) a histogram, (b) a boxplot obtained using boxplot, (c) an adjusted boxplot using adjbox from the robustbase package. observe that boxplots ‘were created to provide a succinct distributional summary that could easily be created by hand’; 40 years on nobody would dream of doing this. They further comment to the effect that the dramatic increase in computer power since then has led to the development of variants of the boxplot that remain true to its original spirit ‘while supporting much richer display of the underlying distributions’. The violin plot and beanplot are among these variants; both are as compact as the boxplot and as easy to produce, with the advantage of displaying features of the data not possible with the usual form of boxplot. Correspondence analysis (Chapter 12) is one of the most useful, and used, multivariate methods in archaeology for exploring large tables of data. Baxter (1994/2015a: 133–139) discusses its adoption in archaeology in some detail. In summary, the method was first applied to archaeological data in the mid-70s; was explored by French archaeologists in the late 1970s; ‘took off’ in the Anglo-Saxon speaking world as a result of publications by Scandinavian scholars in the 1980s; began to be widely adopted in Britain in the early 1990s; but did not hit US archaeology till the later 1990s. Some sub-disciplines, such as Roman archaeology, were even later adopters (Cool and Baxter 1999). The point is that archaeologists can be rather tardy in picking up on patently useful and readily applied methods that emerge in the statistical literature. To some extent the history of the use of boxplots in archaeology mirrors this pattern; violin plots and beanplots may be going the same way. This describes rather than explains the situation. We won’t speculate too much about the latter. Unfamiliarity with the statistical literature and/or a reluctance to abandon familiar software and experiment are almost certainly contributory factors. The beanplot is a comparatively recent development as is the ease with which violin plots can now be used. The message is that archaeologists who find boxplots useful for displaying their data might find it worthwhile to explore these other alternatives. 72 Chapter 8 Scatterplots 8.1 Introduction A common type of graph, with which everyone will be familiar, is what we shall call a (bivariate) scatterplot. The values of two variables are plotted against each other to see how, if at all, they are related. An example has already been presented, without comment, in Figure 6.9, where the weight and height of loomweights were plotted against each other with contours derived from a two-dimensional kernel density estimate superimposed. The loomweight data, omitting outliers, as held in the data frame loomweights will be used for illustration here (see Section 6.2 for the detail about the data). Remember that this can be stored in different forms, either as a data frame loomweights with two columns Weight and Height, or as two separate variables, Weight and Height. If the latter then plot(Weight, Height) produces the default base R scatterplot. If stored as a data frame then, apart from the labelling, each of the first four lines of the following code will produce the same plot, shown in Figure 8.1a; the ggplot2 code given is that used for Figure 8.1b. plot(loomweights$Weight, loomweights$Height) plot(loomweights[,1], loomweights[,2]) plot(Height ~ Weight, data = loomweights) plot(loomweights) # # # # a b c d ggplot(loomweights, aes(x = Weight, y = Height)) + geom_point() Of the base R options when data are in a data frame (d) only works if it contains just two columns in the correct order for plotting. Options (a) to (c) are different ways of picking out variables to plot from a data frame containing several variables. Options (a) and (b) are of the form plot(x, y) the first named being that on the x-axis. In (a) variables are picked out by name using $name, in (b) by using column number, [,number]. Option (c), structured as plot(y ∼ x, data = dataframe), was used for Figure 8.1a, where dataframe is the set of data investigated. 73 120 Height 90 90 70 80 Height 100 110 110 60 70 100 150 200 250 300 350 400 Weight 100 200 300 400 Weight (a) (b) Figure 8.1: Default scatterplots for loomweight heights and weights using (a) base R and (b) ggplot2. 8.2 Presentational aspects 8.2.1 Plotting symbols, axis labels, axis limits In the previous chapters presentational arguments have sometimes been implied with an ellipsis (. . .). In this section enhanced forms of the graphs in Figure 8.1 will be developed, as in Figure 8.2; the full code is given for convenience of reference. A ggplot2 scatterplot A base R scatterplot 100 80 Height (mm) 100 80 60 Height (mm) 120 120 60 100 200 300 400 100 Weight (g) 200 300 400 Weight (g) (a) (b) Figure 8.2: Enhanced scatterplots for loomweight heights and weights using (a) base R and (b) ggplot2. The base R code to follow should be reasonably self-explanatory. The intention is to show how different elements of the plot can be modified rather than to recommend the different options shown. 74 scatterplotR <- function(){ plot(Height ~ Weight, data = loomweights, xlim = c(50,450), ylim = c(50,130), # Modify plot limits xlab = "Weight (g)", ylab = "Height (mm)", # Add axis labels main = "A base R scatterplot", # Add a title col = "red", pch = 16, cex = 1.2, # colour, shape, size of plot symbols cex.axis = 1.2, cex.lab = 1.3, cex.main = 1.4) # Text sizes, axis } scatterplotR() Figure 8.3: Basic code for a base R scatterplot. The use of colour was discussed in Section 2.3.5. Options for the plotting character (pch) are shown in Figure 8.4; cex controls the size of the plotting symbol; other cex arguments control the size of the axis text, axis labels and title. Figure 8.4: Available symbols (shapes) for plotting points in R. In the ggplot2 code that follows the size, colour and shape (equivalent to pch in base R) of the plotting symbols are specified by the arguments in the geom_point function. Note the difference from the base R code in the way the plot limits and axis labels are determined. If any of the shapes 21–25 are used, specify the fill colour using, for example, colour = "blue", fill = "red", shape = 21 to get a circle with a blue border and red fill. In base R pch = 21, col = "blue", bg = "red" would be used. 75 fig8.2b <- function() { ggplot(loomweights, aes(x = Weight, y = Height)) + geom_point(size = 3, colour = "blue", shape = 15) + xlim(50,450) + ylim(50,130) + xlab("Weight (g)") + ylab("Height (mm)") + ggtitle("A ggplot2 scatterplot") + theme(panel.grid.minor = element_blank()) + theme(axis.text=element_text(size=12), axis.title=element_text(size=14)) + theme(plot.title = element_text(size = 16, face="bold")) } fig8.2b() Figure 8.5: Basic code for a ggplot2 scatterplot. 8.2.2 Adding lines to plots Simple lines From Figures 6.6 and 6.7 the division between ‘small’ and ‘large’ loomweights (the antimode) occurs at about 220 g; the antimode for height is at about 80 mm. Visually a line running from the coordinates (100, 85) to (350, 60) would appear to represent a natural ‘break’ in the point scatter. A solid line will be added to the plots in Figure 8.2 to show this division with dashed lines showing the antimodes. Figure 8.6 shows the result. 100 80 Height (mm) 100 80 60 Height (mm) 120 120 60 100 200 300 400 100 Weight (g) 200 300 400 Weight (g) (a) (b) Figure 8.6: Scatterplots for loomweight heights and weights with added reference lines using (a) base R and (b) ggplot2. See the text for an explanation. Once you have the base R scatterplot, the following lines of code can be added to produce the lines seen in Figure 8.6a. abline(v = 220, lty = 2, lwd = 2) abline(h = 80, lty = 2, lwd = 2) segments(100, 85, 350, 60, lwd = 2) 76 where, in the abline function, v means vertical and h means horizontal lines should be drawn in the positions specified; lty is the type of line; and lwd controls line width. In the segments function the first two numbers are the x and y coordinates at the start of the line to be drawn and the other two numbers are the coordinates of the end point The ggplot2 code is similarly structured. After the geom_point command in the code of Figure 8.5 add the following. geom_vline(xintercept = 220, linetype = "dashed") + geom_hline(yintercept = 80, linetype = "dashed") + geom_segment(x = 100, y = 85, xend = 350, yend = 60) + Note that linetype (or lty in base R) can be specified by either name or number. Figure 8.7 shows the options. Figure 8.7: Available lines for plotting in R. More complicated lines The segments function can be used to fit a ‘theoretical’ line to a plot if this is wanted. The abline function can also be used for this purpose; for example abline(0,1) would add a straight line with a zero intercept and slope of 1 to a plot. In ggplot2 using geom_line(intercept=0, slope=1) has the same effect. Sometimes it may be useful to add a reference line to the plot that can be thought of as a summary of the pattern in the data. This could be either a straight or curved line and both are shown in Figure 8.8. The plots are described as more complicated because they involve a modicum of statistics beyond the level assumed in much of this text. In the code to follow the straight line is based on linear regression where weight is predicted from height. The curved line is a scatterplot smoother obtained by fitting a loess curve to the data. If you just want a quick look at the plot then ggplot2 is simplest; just add stat_smooth(method = lm) + 77 to the code of Figure 8.5 following the geom_point command. The code actually used for Figure 8.8b was the following stat_smooth(span = 0.6, se = F, colour = "black", linetype = "dashed") + stat_smooth(method = lm, se = F, colour = "black") + Linear regression Loess smooth (span = 0.6) 100 120 80 Height (mm) 100 80 60 Height (mm) 120 where se = F suppresses the default standard error bands that are produced while colour = "black" overrides the default colouring. Without additional arguments stat_smooth() fits a loess curve. Very roughly what this does is calculate a rather complicated kind of average value for predicting height in the region of each data point and ‘joins up the dots’ so produced. The size of the region about each point over which the ‘average’ is computed is determined by the span, smaller values producing more ‘wiggly’ curves. In the second line of code method = lm specifies that a linear regression model is fitted. You will get different lines if the roles of weight and height are reversed (i.e. weight is predicted from height). 60 100 200 300 400 100 Weight (g) 200 300 400 Weight (g) (a) (b) Figure 8.8: Scatterplots for loomweight heights and weights with added linear regression lines and loess smooths using (a) base R and (b) ggplot2. See the text for an explanation. The corresponding base R code, which may be added to in the code of Figure 8.3, is z1 <- lm(Height ~ Weight, data = loomweights) lines(loomweights$Weight, predict(z1), lwd = 2) z2 <- loess(Height ~ Weight, data = loomweights, span = 0.6) lines(loomweights$Weight, predict(z2), lty = 2, lwd = 2) where the first and third lines fit a linear regression model and loess curve using the lm and loess functions. The predicted values from each model are obtained from the predict function; the lines function adds fitted lines to the point scatter. 78 8.2.3 Points Adding points to plots Thus far a subset of eight outliers have been identified and omitted from plots. These will now be added back into the data set to illustrate how points can be added to a plot. The eight cases are in a data frame outliers with column headings the same as for loomweights. Adding the outliers to the base R plot requires that something like the following be added to the code of Figure 8.3 points(outliers$Weight, outliers$Height, pch = 17, col = "blue", cex = 1.2) legend("bottomright", c("No", "Yes"), col = c("red", "blue"), pch = c(16, 17), title = "Outlier", bty = "n", cex = 1.3) after changing the upper values in xlim and ylim to 800 and 140 to accommodate the outliers. The result is shown in Figure 8.9a. 140 Outlier No Yes 120 130 Height (mm) 100 80 Height (mm) 110 60 Outlier 90 70 No Yes 200 400 600 800 200 Weight (g) 400 600 Weight (g) (a) (b) Figure 8.9: Scatterplots for loomweight heights and weights including outliers using (a) base R and (b) ggplot2. See the text for an explanation. Something similar is possible in ggplot2 by adding geom_point(data = outliers, aes(x = Weight, y = Height), size = 2, colour = "red", shape = 15) + to the code of Figure 8.5. This automatically scales the axes but does not produce a legend. An alternative is to create a data frame using all the data, adding a third column Outlier with entries No and Yes. Call the data frame loom.all and replace the first line of code in Figure 8.5 with ggplot(loom.all, aes(x=Weight, y=Height, colour=Outlier, shape=Outlier)) + to get Figure 8.9b, where the scale_colour_manual function has also been used to change the default colours and scale_shape_manual has been used to change the default symbol shapes. 79 Multiple groups The preceding example can be thought of as a way of displaying multiple groups in a set of data. As a more complicated illustration the data set lwphased, introduced in Section 7.3 is used. Remember that this consists of 118 loomweights, excluding outliers, that could be assigned to one of four phases; in addition to Weight and Height and four other variables a factor with four levels, Phase, is included. We want to produce a plot of height against weight using different coloured and shaped plotting symbols to distinguish between phases. These are shown in Figure 8.10. I II III IV I II III IV 110 Height (mm) 90 80 90 70 Height (mm) 100 110 120 Phase Phase 60 70 100 150 200 250 300 350 400 100 Weight (g) 200 300 400 Weight (g) (a) (b) Figure 8.10: Scatterplots for phased loomweight heights and weights showing the phasing, using (a) base R and (b) ggplot2. This is much easier in ggplot2 than base R. In base R the ‘messy’ part is mapping Phase to colours and symbols. A web search reveals several approaches to this, that given in the code below being among the simplest. plot(lwphased$Weight, lwphased$Height, pch = c(21,22,24,25)[as.numeric(lwphased$Phase)], bg = c("red","blue","magenta","green2")[as.numeric(lwphased$Phase)], ...) legend("topleft", c("I", "II", "III", "IV"), pt.bg =c("red", "blue", "magenta", "green2"), pch = c(21,22,24,25), title = "Phase", bty = "n", cex = 1.3) What’s happening here is that the as.numeric function maps the four levels of the Phase factor to the integers 1 to 4 which are then associated with the plot symbols given by pch and colours given by bg (background). The opportunity has been taken to illustrate the used of filled symbols with a border (21–25 in Figure 8.4). The border colour is black by default and bg is the fill colour. If you want to vary the border colours an additional argument, col, defined in the same way as bg will do it. Presentational arguments are implicit (. . .) but the legend is 80 given in full to illustrate the use of pt.bg (rather than bg) to specify fill colours. The result of all this is Figure 8.10a. In ggplot2 all that’s needed is to specify Phase as colour and shape aesthetics in the call to ggplot, thus ggplot(lwphased, aes(x = Weight, y = Height)) + geom_point(aes(shape = Phase, colour = Phase), size = 3) + ... with presentational code as before and default output shown in Figure 8.10b. We tend not to like the default colours and symbols; to get something similar to Figure 8.10a use geom_point(aes(shape = Phase, colour = Phase), size = 3) + scale_shape_manual(values = c(21,22,24,25)) + scale_fill_manual(values = c("red", "blue", "magenta", "green2")) + ... which produces Figure 8.11a. Colour palettes were mentioned in Section 2.3.5 but have not yet been illustrated. Figure 8.11b provides an example using the Dark2 ColorBrewer palette, for which the RColorBrewer package needs to be installed. Precede the ggplot command with library(RColorBrewer) and replace the scale_fill_manual command with scale_fill_brewer(palette = "Dark2") + Phase I II III Phase IV II III IV 110 Height (mm) Height (mm) 110 I 90 70 90 70 100 200 300 400 Weight (g) 100 200 300 400 Weight (g) (a) (b) Figure 8.11: Scatterplots for phased loomweight heights and weights indicating the phasing, illustrating ggplot2 variations on Figure 8.10b. (a) using scale_colour_manual, (b) Using the Dark2 palette from the RColorBrewer package. Figures 8.10 and 8.11 have been presented to illustrate some of the options available; plots (a) in both figures involve a user-defined selection of colours; plots (b) involve automatic selection. In practice, and particularly for publication, some experimentation with colour and shape may be desirable. What looks best may 81 be partly dependent on the nature of the juxtaposition of points from different groups. As far as interpretation goes there is no obviously clear relationship between weight, height and phase. This is a little misleading since we’ve already seen from Figures 7.2 to 7.4 that the data for Phases II and IV show some evidence of bimodality and there is a tendency for later weights to be heavier. As discussed in Section 7.3, however, the overall pattern is quite complex. 8.2.4 Scatterplot matrices Scatterplot matrices, also called pairs or draftsman plots, obtained using the pairs function, provide a quick way of examining all possible bivariate scatterplots when more than two variables are available. For illustration the 118 phased loomweights are used; in addition to Weight and Height the maximum and minimum lengths (mm) of the rectangular tops and bases of the loomweights are available (Tmax, Tmin, Bmax, Bmin). The first few lines of the data frame lwphased are shown in Table 7.1. Figure 8.12 is produced in base R using pairs(lwphased[, 1:6]). 110 15 25 35 45 20 40 60 250 400 60 80 110 100 Weight 15 25 35 45 60 80 Height 15 25 35 45 Tmax 50 60 Tmin 40 60 30 40 Bmax 20 Bmin 100 250 400 15 25 35 45 30 40 50 60 Figure 8.12: A scatterplot matrix for loomweight dimensions using base R. It can be seen that weight and height are clearly linearly related; size as measured by these two variables also shows some relationship with the base dimensions but not those of the top. The two base measurements are related, the striking lin82 ear ‘edge’ being associated with square bases; the two top measurements show a similar pattern but are not related to the base measurements. The plots above the diagonal are the same as those below the diagonal, except that axes are reversed, and it’s possible to suppress one or the other if desired. Some of the presentational arguments available in plot can be used here. Figure 8.13 uses code similar to that which follows Figure 8.10, namely pairs(lwphased[, 1:6], pch = c(15,16,17,1)[as.numeric(lwphased$Phase)], col = c("red", "green", "blue", "black")[as.numeric (lwphased$Phase)]) 110 15 25 35 45 20 40 60 250 400 60 80 110 100 Weight 15 25 35 45 60 80 Height 15 25 35 45 Tmax 50 60 Tmin 40 60 30 40 Bmax 20 Bmin 100 250 400 15 25 35 45 30 40 50 60 Figure 8.13: An enhanced scatterplot matrix for loomweight dimensions using base R. Colours and plotting symbols correspond to different phases. It’s possible, but messy, to add legends and because of the way the plot is compressed doesn’t necessarily look that satisfactory, so is not attempted here. Our view is that scatterplot matrices are useful tools but, unless the number of variables is small, are perhaps best viewed in the privacy of one’s office/study with a view to selecting the more interesting plots to subject to more detailed scrutiny and publication. The full plot takes up valuable space and the detail can be difficult to see. It’s possible to vary the display in the upper- and lower-diagonal, to modify variable labels, and to have graphs along the diagonal. This, for the most part, 83 requires some programming and will not be discussed in detail here. If it’s of interest ?pairs provides code for some of the things that can be done as does Chang (2013: 112–116). The scatterplotMatrix function from the car package will do some of the work for you but is not illustrated. There is a function for scatterplot matrices, ggpairs, in the GGally package which we must confess to finding difficult to use. Figure 8.14 shows a fairly simple example where 2D KDEs are plotted in the upper diagonal. Weight 0.004 0.003 0.002 0.001 0.000 Height 110 90 70 50 Tmax 40 30 20 Tmin 40 30 20 60 Bmax 50 40 30 60 Bmin 50 40 30 20 100 200 300 400 70 Weight 90 110 Height 20 30 40 50 Tmax 20 30 40 Tmin 30 40 50 Bmax 60 20 30 40 50 60 Bmin Figure 8.14: A ggpairs scatterplot for loomweight dimensions. The upper triangle shows 2D KDEs for pairs of variables and the diagonal shows univariate KDEs for the individual variables. The code to produce this is given below. library(GGally) ggpairs(data=lwphased, columns=1:6, upper = list(continuous = "density"), lower = list(continuous = "points")) Note that the syntax is more akin to that used in base R plots than the usual ggplot2 plots. As it stands phases are not differentiated. It’s possible to rectify this by adding aes(colour=Phase) as an argument, but the KDEs then become difficult to read. 84 Chapter 9 Pie charts Tables are preferable to graphics for many small data sets. A table is nearly always better than a dumb pie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial disarray, both within and between pies . . .. Given their low data-density and failure to order numbers along a visual dimension, pie charts should never be used. (Tufte 1983: 178) 9.1 Introduction Tufte’s opinion concerning pie charts is not extreme – it is, if anything, the orthodox view of statisticians who take an interest in graphical presentation. Unwin (2015: 60), in a text devoted to graphical analysis in R, notes that pie charts are ‘not generally recommended’ and that pie charts ‘in 3D should most definitely always be avoided’. The function, pie in base R, for constructing pie charts is fairly minimalist as R functions go. It is accompanied by the warning ‘Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.’ This statement is accompanied by others to the effect that, if you think you need a pie chart other methods of graphical presentation are better. Frequently they are wholly unnecessary and can be replaced by a succinct summary of the information conveyed by the numbers used to construct them, saving space in the process. There is not a purpose designed function for creating pie charts in ggplot2, though there are plenty of posts on the web that remedy this deficiency1 . Chang (2012: 307–309) devotes just over one page to the topic, about half the space being 1 e.g., http://mathematicalcoffee.blogspot.co.uk/2014/06/ggpie-pie-graphs-in-ggplot2.html (accessed 25/05/2016). 85 devoted to an illustrative base R pie chart. He comments that ‘the lowly pie chart is the subject of frequent abuse from data visualization experts’ and that more effective methods are usually available, but does comment that pie charts have the virtue ‘that everyone knows how to read them’. This last statement might be queried. Given a set of counts for each of a number of categories, converted to percentages of their sum, a properly constructed pie chart is a circular diagram divided into slices with the property that the area (or arc or angle) of a slice is proportional to the percentage of the category it represents. A frequent objection to pie charts, supported by psychological studies of visual perception, is that humans find it difficult to compare areas in terms of relative proportions compared to other methods such as barplots (Chapter 10) often suggested as an alternative. If a pie chart is unannotated with numbers one gains a qualitative impression of the relative abundance of different categories, often best summarised in a table of the numbers themselves with a commentary. This applies particularly when, as is often seen, a pie chart consists of very few categories (sometimes only one!). Charts can be annotated to show the numbers or percentages involved but, once again, direct commentary on these is preferable. Another prevalent problem, which can apply to barplots as well, is that the absolute numbers involved can be rather small. In these circumstances conversion to percentages does not make much sense; to do so and then display the percentages in a pie chart provides a meaningless picture that would be changed markedly with the addition of a few extra observations. Regrettably the pie chart, along with barplots, is one of the most common graphics to be found in archaeological publications. The examples to follow illustrate all the problems noted above. While we concur with the negative sentiments that have been widely expressed about pie charts, rather than simply noting them. moaning, and moving on, it was thought to be useful to illustrate the problems with real examples. Those to follow were located in a short period of browsing through the books and journals on our bookshelves. There is nothing pathological about them – they are the rule rather than the exception – and represent the tip of a very large iceberg. Many other published applications would have served equally well for illustration. 9.2 9.2.1 Examples Pointless pie charts Table 9.1 is adapted from Table 3 of Joy’s (2014) study of Iron Age and early Roman cauldrons from Britain and Northern Ireland. The original table, of cauldron types (groups) by depositional context was ordered alphabetically by context for those with a known context and numerically by group. There is no ‘natural’ ordering to either category and a clearer picture of any patterns in the data can 86 be obtained by rearranging rows and columns according to the magnitude of the marginal totals, as is done here. Table 9.1: Counts of Iron Age and early Roman cauldrons from Britain and Northern Ireland by depositional context (Source: Joy 2014, Table 3). Cauldron groups Context III IV I II Total Burial Beach Settlement Lake/Loch River Bog/Moss Hoard Unknown 0 0 0 2 0 0 0 2 0 1 0 0 0 3 0 1 0 0 0 3 0 3 1 5 1 0 3 0 5 0 18 1 1 1 3 5 5 6 19 9 Total 4 5 12 28 49 What’s most obvious from the table is that cauldrons found in hoards are much the most common, accounting for 45% of the cauldrons from known contexts, ‘Unknown’ being the second most common ‘context’. All but one of the hoard cauldrons come from Group (Type) II, this group accounting for 67.5% of the cauldrons from known contexts. The other obvious feature of the table is that the totals for most contexts and groups are too small for anything other than describing the composition of the sample available for study – there is no real basis for generalizing from the table to a wider population. Joy (2014: 334) states that Groups I, III and IV ‘are predominantly found in watery contexts’; allowing for unknowns this applies to 13/21 or just over threefifths of the cauldrons in these groups. There is little more that needs to be said that adds to this tabular description, but Figure 6 in the paper presents a set of pie charts to go with the table; Figure 9.1 reproduces this in essence. Joy states that this, along with Table 3 in the paper, provides more detail about the ‘perhaps surprising’ range of depositional contexts – it doesn’t. There are some problems with Figure 6 in Joy (2014). The pie charts from Groups I and II are correct; that labelled Group III repeats the plot for Group I; the Group III plot is labelled as Group IV, with Group IV missing entirely2 . We should emphasise that the critique to follow is presented because aspects of it apply to a large number of pie charts that have appeared in the archaeological literature. The first point is that converting numbers to percentages with samples as small as 2 The fact that no-one associated with publishing this paper in a leading international journal noticed this error cannot but underline the fact that multiple pie chart figures are frequently a waste of space. If a correction has been published we have not noticed it. No-one appears to look at them. 87 Group II Group I Lake/Loch Bog/Moss Hoard Unknown 25% Burial Settlement River Hoard Unknown 18% 25% 11% 4% 4% 8% 64% 42% Group IV Group III Lake/Loch Unknown Beach Bog/Moss Unknown 50% 20% 60% 20% 50% Figure 9.1: Pie charts showing context by cauldron group (Source: Joy 2014). about 10 or less, and certainly for samples as small as four and five (Groups III, IV), is largely meaningless and potentially misleading. Adding a single cauldron to one of the two groups represented in the chart for Group III would have a marked effect on its impact. Even if the numbers sustain representation as a pie chart it seems pointless to produce a chart where there are a very small number of categories to represent. It’s far more informative, and succinct, to state the relative proportions (e.g., 50:50, 50:30:20, 2:15:33:50) and provide a commentary on these (usually stating the obvious) rather than wasting space on a pie chart. The use of pie charts for comparative purposes is also to be discouraged since there are better ways of displaying the data. That the words ‘Figure 6’ and the figure itself could be removed from Joy (2014) without any effect whatsoever on the subsequent discussion – given the numerical information in the table – is telling evidence of the superfluity of the pie charts often seen in other studies. 88 9.2.2 Dishonest pie charts Figure 9.2a is a fairly close reproduction of the three-dimensional ‘pie chart’ of Figure 1 in Egri (2007: 44). It represents the numbers of amphorae for wine from the Lower Danube region, in the Roman provinces of Pannonia and Upper Moesia, from Augustus to Hadrian; the numbers around the circle are those of each amphora type available for study. 10 Dressel 6A Lamboglia2 Selov Rhodian Dressel 2−4 12 111 Dressel 6A Lamboglia2 Selov Rhodian Dressel 2−4 111 10 12 28 28 201 201 (a) (b) Figure 9.2: Pie charts showing the distribution of amphorae reflecting wine consumption in the Roman provinces of Pannonia and Upper Moesia, from Augustus to Hadrian, (a) ‘three-dimensional’ after Figure 1 of Egri (2007), (b) two-dimensional and much better if you have to have one. The pie-chart is pointless and dishonest. It’s pointless because, as Egri (2007: 44) notes, ‘the amphorae of Dressel 2-4 . . . and Rhodian dominate’. This is the only information extracted from the chart, used to justify a subsequent focus on these types. Stating that 86% of the amphorae are of these types does the job more efficiently and succinctly. If you must have a pie chart then Figure 9.2b is better but not needed since what you want it to tell you is simply expressed in words. Why is it dishonest? It’s an example of what are sometimes called presentational graphics, thought to be suitable for business presentations, advertising etc. A pie chart is a two dimensional construct, the areas of the slices showing the relative proportions of the different categories. This interpretation is lost in the elliptical representation imposed by the use of a non-existent third dimension. Thus, three-dimensional pie charts are a distorted representation of the truth with added fictional features and should be banned from academic publication. Excel is to blame for a lot of examples, or, more properly, Excel users3 . 3 This isn’t an isolated example; the journal Archeologia e Calcolatori is a fruitful source 89 9.2.3 Could do better Table 9.2 shows the percentages of cattle, pig and sheep/goat remains from seven phases identified in excavations of the Wroxeter baths basilica (Barker et al. 1997). The data have been extracted from a series of pie charts given as Figure 2 in Hammon (2011). He discusses the charts with reference to percentages shown on them, with a particular focus on pig proportions. Table 9.2: Percentages of cattle, pig and sheep/goat (Sh/Gt) bones (based on NME) from seven phases of the Wroxeter baths basilica Phase Cattle % Pig % Sh/Gt % 43 59 45 67 56 58 59 19 16 20 19 26 24 25 38 25 35 14 18 18 19 T-V W X X-Y Y Y-Z Z n 236 2631 94 1422 2639 1849 755 A multiple pie chart representation of these data, which essentially reproduces that of Hammon, in colour, is shown in Figure 9.3. Phase X Phase W Phase T−V 43% Phase X−Y 45% 59% 19% 67% 20% 16% 38% Phase Y 19% Phase Z 59% 58% 18% 26% 35% Phase Y−Z 56% 14% 25% 18% 19% Cattle Pig Sheep 25% 24% Figure 9.3: Pie charts showing the cattle, pig and sheep/goat proportions from seven phases from the Wroxeter baths basilica (after Hammon, 2011, Figure 2). (e.g., Volumes 16: 136,144; 21: 115–121; 22 : 187); Chester-Kadwell(2009: 70); Christensen (2006: 1638) (elliptical two-dimensional pie charts), De La Fuente (2011: 244) and Cocis (2015: 61) are particularly horrible examples elsewhere in the recent journal literature that defy the imagination. 90 Figure 9.4 shows ‘descendants’ of Roman sheep and cattle, believed to resemble them (Cool 2006: 86–87); similar data to that of Hammon (2011) are analysed in Sections 10.3 and 11.2.2. Archaeologists, of course, work with excavated bones but pictures of live animals are more exciting. Figure 9.4: Soay sheep and Dexter cattle. Similar data to that of Hammon (2011) are used in Sections 10.3 and 11.2.2. (Sources: Photographs by Evelys Simak (top) and Annie Kavanagh (bottom), distributed under a CC-BY 2.0 license). 91 80 From Table 9.2 the obvious features are that (a) cattle predominate throughout, (b) there is a sharp decline in the proportion of sheep/goat after Phase X, at which point the relative proportion of pig to sheep/goat also increases markedly, (c) Phase X-Y is a little anomalous, (d) the last three phases have very similar profiles. Hammon (2011: 284) says most of this and, given a table, the need for the pie charts (which occupy half a page) is questionable. Using pie charts the dominance of cattle is fairly obvious, as is the similarity of profiles for the last three phases. The relationship between pig and sheep/goat proportions, a focus of the table, is more difficult to judge other than qualitatively. To make quantitative statements about the relationship, as is done in the paper, resort to the numbers that annotate the charts is needed; this is more naturally done with the data expressed in tabular form. If you do want to make quantitative assessments from graphical analysis a (clustered) barplot does the job much better (Figure 9.5); they involve judgments based on length rather than area which, as Unwin (2015: 53) suggests, is ‘easier’. These are discussed in detail in Chapter 10 but are common in the archaeological literature. 40 0 20 Percentage 60 Cattle Pig Sheep T−V W X X−Y Y Y−Z Z Phase Figure 9.5: A clustered barplot showing the cattle, pig and sheep/goat proportions from seven phases from the Wroxeter baths basilica. The barplot makes it much easier to effect comparisons both within and between phases. The actual proportions are represented much more directly than in the pie chart, the ‘linear’ arrangement of the plot in a single row also aiding comparison. Apart from the more obvious features previously mentioned the decline in the proportion of sheep/goat after Phase X and the changes in the relative proportion of pigs to sheep/goat is much more readily seen than in the pie chart. 92 A large number of barplots (22) similar to those of Figure 9.5, contrasting up to 12 assemblages, are used in Hesse (2011). Some of her plots draw on data from a paper of King (1999a) who makes extensive use (there and elsewhere) of yet another approach to displaying the data. This is the ternary diagram (or triangular graph) designed specifically for the purpose of displaying three-category data sets expressed as percentages. They are particularly useful for effecting comparisons when a large number of assemblages are involved. A brief illustration, using the data of Table 9.2, is provided here; a fuller account including R code is provided in Chapter 11. Figure 9.6 was obtained using the ggtern package, one of several based on the ggplot2 ‘philosophy’. The position of each point in the diagram is determined by the three percentages; because they sum to 100%, given any two values automatically determines the third. Technically this means the data are two-dimensional and can be plotted in two dimensions in what is called a simplex. Phase Pig T−V 50 W X 50 X−Y 40 Y ttle Ca Pig 60 Y−Z Z 30 70 20 40 30 20 10 80 Cattle Sheep Sheep Figure 9.6: A ternary diagram showing the cattle, pig and sheep/goat proportions from seven phases from the Wroxeter baths basilica. For the moment terminology and detail need not concern us. What’s important is that each point on the plot represents the data for the corresponding category 93 exactly; it embodies precisely the same information as a pie chart. The distance between any two points on the plot can be read as the distance between the two cases involved, something not readily done with either a pie chart or barplot. Reading a ternary diagram can take a little practice to get used to, but the close similarity between Phases Y, Y-Z and Z, previously noted, is immediately apparent, testified by the adjacency of the symbols representing them on the plot. The close similarity of Phases T-V and X is also evident with the comparatively lower proportion of cattle an obvious feature. The other two phases stand out as slightly anomalous, Phase X-Y most obviously because of the comparatively higher proportion of cattle, Phase W because of the relatively high proportion of cattle in conjunction with a ratio of sheep/goat to pigs greater than 1. This contrasts with other phases having a similar proportion of cattle but with a sheep/goat to pig ratio of less than 1. Ternary diagrams are only useful with precisely three categories in the data to represent; the technique of correspondence analysis is available for larger tables (Chapter 12). Clustered barplots are available with more than three categories but can become unwieldy or impractical with large tables and, once again, correspondence analysis is available. The main message of this example is that some method of presentation (tabular, barplots, correspondence analysis) is almost invariably a better way of extracting useful information from the data in situations where multiple pie charts have been used. Robeerst’s (2005: 83) application is identical to that of Hammon (2011); Chester-Kadwell (2009: 70) presents an example using multiple pie charts showing the percentage of metals in finds from different contexts that, in addition to using three-dimensions and gratuitous shading (‘chart-junk’ in a term frequently employed in Tufte 1983), is wholly unnecessary. The charts show that metal-detected finds are largely made of copper alloy, whereas finds from other contexts are made from iron or copper alloy in roughly equal proportions; this can be stated in a single sentence without the need for the pie charts (a table would make the point equally well but is not provided). The pottery report of Leach (1993) in Woodward and Leach (1993) contains several multiple pie charts, linked to positions on maps to illustrate the spatial distribution of different forms and fabrics. They are borderline unreadable because of the number of categories involved and the small numbers in many categories; where differences are obvious they could have been more concisely expressed. One could go on! 9.2.4 From bad to worse? The following story is anecdotal, but regrettably true. Correspondence analysis (Chapter 12) has been mentioned above and, briefly, in Section 7.4. Today it is a widely-used method for the graphical display of large, possibly complex, tables of categorical data. It was not always so. The method only really began to attract archaeological attention in British publications from the early 1990s. We first collaborated on a problem using the method for a conference presen94 tation in 1993, published (eventually) as Cool and Baxter (1995). A fuller, more archaeologically oriented, account was published in Cool et al. (1995). The eminent archaeologist with editorial responsibility for the series in which this appeared asked, rather plaintively, if the analysis might not be done more readily with pie charts and had to be told, politely but firmly ‘No!’. At the time the excuse, perhaps, existed that correspondence analysis would have been unfamiliar in archaeological publications in the Anglo-Saxon speaking world. This is no longer the case, and one would have hoped it was now possible to submit papers for publication without the choice of correspondence analysis as a means of presenting the data being queried. Alas! It isn’t. On several subsequent occasions referees or editors have asked if we could replace a correspondence analysis with pie charts or barplots, on the grounds that these would be more ‘comprehensible’. Apart from being wrong this might be read as sad testimony to the fact that there are eminent and literate archaeologists – conversant with current theoretical intricacies and/or the area(s) they specialise in – who regard the pie chart as the acme of graphical aspiration. For illustration we draw on the most recent paper to suffer this fate (Cool 2016) where some effort was needed to resist the suggestion. The data in question are given in Table 9.3. The table shows counts of coloured glass ‘counters’ by phase recovered in the excavation of Insula VI.1, Pompeii, Italy, undertaken by the Anglo-American Pompeii Project between 1995 and 2006. Table 9.3: Colours of glass ‘counters’ by phase, from excavations at Insula VI.1, Pompeii. Colour Code Multi-Coloured Multi Opaque-White OpW Black Bl Dark_Yellow_Brown DYB Deep_Blue DpB Peacock Pea Mid/Light Blue MiB Yellow/Brown YB Yellow/Green YG Pale_Green PaG Green-tinged colourless GTC Colourless Cls Blue/Green BG I II 1 1 1 0 1 1 0 4 4 2 0 2 4 6 1 7 5 14 4 4 17 8 10 5 17 14 Phase III IV 11 4 15 5 11 6 1 18 5 13 5 14 24 V 5 4 3 3 10 8 5 1 11 5 3 3 4 1 15 7 3 6 12 7 2 2 6 5 31 25 19 112 132 110 77 n More properly the ‘counters’ are plano-convex discs of glass found throughout the ancient world. They are normally called ‘counters’ in English, ‘pedine’ in Ital95 ian, ‘pion’ in French and ‘Spielsteine’ in German, with the connotation that they were used as playing pieces in the various board games known to have been played by people in the Roman world. The thrust of Cool (2016) is that this interpretation may often be wrong, many ‘counters’ being more likely to be important elements of interior decoration. The arguments for this are not relevant here, but will be discussed in connection with the use of barplots in Section 10.4. Table 9.3 is a subset of the full 20 × 7 data set based on 20 standardised colours and seven phases. Owing to small sample sizes seven colours have been omitted, one phase has been eliminated, and two merged. As given here Phase 1 is the first century BC; Phases II and III cover, roughly, the first and second thirds of the first century AD; Phase IV runs from AD 62 to AD 79, between the earthquake that ruined parts of Pompeii and the eruption of Vesuvius which destroyed it; Phase V is the eruption level. Given 13 categories of glass colour are used the sample size of 19 for Phase I is rather small but retained as it leads to interpretable results in the correspondence analysis to be reported in Section 12.5. In what might be thought to be a spirit of irony the referee’s suggestion of showing the data in the form of pie charts has been used here for illustration. The outcome is shown in Figure 9.7. Phase I Phase III Phase II 4.8% 19% 12.5% 3.6% 3.6% 4.8% 8.3% 4.5% 4.8% 6.2% 4.8% 15.2% 3.8% 11.4% 4.5% 0.8% 3% 0.9% 5.4% 4.8% 8.3% 13.6% 19% 7.1% 12.5% 3.8% 18.2% 19% 9.8% 8.9% 9.5% 4.5% 9.5% Phase IV 10% 2.7% 3.6% 3.8% 15.2% Phase V 6.5% 1.3% 3.9% 1.3% 4.5% 10.4% 9.1% 9.1% 13.6% 2.7% 3.9% 7.8% 5.2% 4.5% 2.7% 9.1% 10.9% 2.6% 1.8% 5.5% 10.6% 28.2% 6.5% 32.5% Multi−coloured Opaque White Black Dark Yellow/Brown Deep Blue Peacock Mid/Light Blue Yellow/Brown Yellow/Green Pale Green Green−tinged colourless Colourless Blue/Green Figure 9.7: Pie charts for counter colours by phase from Pompeii, Insula VI.1. Other than ‘multi-coloured’ some effort has gone into making the colours of the slices correspond roughly to that of the glass, hence the rather muted appearance. As it happens, differentiation between slices is not too bad; more generally, particularly with more categories to represent or if publishing in black and white, as was necessary for Cool (2016), such differentiation can be problematic. We have 96 seen pie charts using as many as 30 categories and, once annotation is added, the result can look both awful and incomprehensible. Shennan (1997: 23–24), along with Drennan (2009), one of the textbooks most widely used for introducing quantitative methods to archaeologists, rather surprisingly – in our view – states that the pie chart ‘is a very helpful mode of data presentation when the aim is to illustrate relative proportions of unordered categories, especially when making comparisons’. We obviously disagree entirely. More pertinently, he does make the point that ‘it can be confusing if there are numerous categories, or categories with zero or very small entries’. If a pie chart is used at all it ought to be for a purpose. In many examples we have seen where some sort of purpose is evident the purpose is often to emphasise the dominance or otherwise of a subset, often small, of the categories involved. This quite often takes the form of a comment on the percentages involved, which are frequently included on the chart. This essentially renders the chart redundant and is better done with reference to a table of the data. The same is true when pie charts are used for comparative purposes; other graphical methods are preferable but, if patterns are present to comment on, a judiciously constructed table will reveal the patterns if the table is not too large. As far as Table 9.3 is concerned it was constructed to investigate whether there was any trend in the way different colours were used over time. The correspondence analysis to be reported in Section 12.5 shows clear and interpretable temporal variation. Figure 9.7 may or may not look pleasing, or at least not offensive, to some eyes. It is wholly useless for the purpose for which the data were collected; that such a representation can be urged on one by journal editors and referees, in preference to methods that are much more effective, is a cause for regret and concern. 97 Chapter 10 Barplots 10.1 Introduction Barplots (or bar charts) are one of the commonest kind of graphic to be found in the archaeological literature. Histograms (Chapter 4) are a specialised type of barplot. In this form the counts of continuous data within contiguous intervals are represented by bars whose areas are proportional to the frequencies. That the data are continuous is indicated by the contiguity of the bars. Barplots are also commonly used for representing categorical data where each case in a sample is assigned to a category (e.g., red, green, blue). For a single categorical variable counts within each category are obtained, and the counts (possibly after conversion to percentages) are displayed in a barplot where the heights of the bars are proportional to the frequency/count/percentage. This is an instance of nominal data. The nature of the data – in contrast to the histogram – is indicated by gaps between the bars. Provided they are the same, both the width of the bars and the gap between them are irrelevant. For the kind of data just discussed the order of categories as displayed on the barplot is arbitrary. Choosing an order that reflects the relative proportion of different categories (e.g., most to least common) often aids interpretation. If the categories have a natural order (e.g., small, medium, large) then the data are ordinal and this should be reflected in the order in which bars are displayed. In the case just described categories are ordered but not associated with a numerical value, so the distance between bars on a plot carries no information. If categories can be equated with numerical values, usually integers (i.e. 0, 1, 2 . . .), the data will be referred to as discrete; that is, values between 0 and 1 etc. can’t occur. An example might be the number of distinct decorative motifs on an artefact. The data are ordinal and the distances between bars on a barplot are meaningful. There is a ‘grey area’ here where data are continuous, but rounded to the nearest integer of some unit. If the range of the data is reasonably small it is sometimes convenient to treat the data as discrete. An example is provided in 98 Section 10.4.1 where diameters of glass counters from Pompeii and elsewhere were measured to the nearest millimetre. Another situation where barplots are used is where data are categorised in some way (e.g., Context A, B, . . .) and then within each category values of a continuous variable are contrasted (e.g., total weight of pot) and the values represented by bars. So barplots are widely used for a variety of purposes. We’d also argue overused, misused and abused; problematic uses are illustrated in Sections 10.4.2 and 10.4.3. In summary the use of barplots can be sullied for the same reasons we think pie charts are largely unnecessary. You don’t need a barplot if you only have two categories to deal with; if you have a large number of categories but counts for most categories are generally low just say so; pseudo three-dimensional barplots should be banned from academic publications. Our preference, when dealing with a small number of categories, is to present the data in a judiciously structured table with added commentary. This is, perhaps, a matter of taste. Unwin (2015: 58) presents a barplot using just three categories and comments that a ‘table would show [the pattern] just as well in principle but a graphic is more convincing’, so we won’t be too dogmatic. Another point to be noted, that we will not dwell on too much, is that there are plenty of examples in the literature of what should be histograms masquerading as barplots for discrete/categorical data. This is irritating rather than fatal since, other than misrepresenting the nature of the data, interpretation is not distorted too much. At least some of this misuse can be attributed to a dependence on Excel, or similar packages, which – for a long time – was unable to provide a proper histogram. Even now, as Unwin (2015: 21) notes, ‘histograms are difficult to draw in Excel’. The next two sections discuss, in detail, how to produce barplots in base R and ggplot2. Doing so is not always as straightforward as one might wish, so the attractions of menu-driven software such as Excel are obvious. The archaeological literature is disfigured with such plots. The real problem is not so much Excel itself – you can produce perfectly good plots and we have published such – as the fact that users do not exercise their inalienable right (and capacity) to customise plots, opting instead for default or ‘jazzy looking’ presentations that can look awful. Taking the trouble to customise an Excel graph so that it is truly worthy of publication can be as time-consuming as doing the same in R. Once the mechanics have been dealt with a variety of applications are provided in Section 10.4. 10.2 One-way data The simplest case arises when a single set of numbers is displayed. For illustration Table 10.1, adapted from Table 3.1 in Cool (2006), will be used. The adaptation is to facilitate entry of the data into R; the column ‘Amphora’ is the percentage of amphora in the total pottery assemblage, measured by weight. The first unlabelled column is the ‘Site’; when read into R in this form the sites 99 Table 10.1: Amphora as a proportion of total pottery assemblages for various types of Romano-British site. Source: Table 3.1 in Cool (2006). Bainesse Leicester Lincoln Alchester Kingscote Ivy_Chimneys Asthall Chignal Type Amphora Military Urban Urban Suburban Rural Rural Roadside Villa 27.8 18.1 11.7 11.1 8.2 5.4 4.9 2.3 5 10 15 20 25 Military Urban Suburban Rural Roadside Villa al l gn al th hi C As e ne ys ot m sc ng hi Ki Iv y C n te r es ch Al r te ol Li nc es ic Le Ba in es se 0 0 5 10 15 Percentage 20 25 are treated as the row names. Sites have been ordered by amphora percentage and it is immediately apparent that the order is a very good indication of what sort of settlement the assemblage is coming from (Cool 2006: 19). No further elaboration is needed and a barplot is superfluous though they are used here in various forms for illustrative purposes. Read the data as given into R (i.e. with a blank header for the first column), calling it amphora; extract the percentages as Amphora <- amphora$Amphora and create a site variable using Site <- row.names(amphora). Figure 10.1a, which is not very useful, is obtained in base R using barplot(Amphora). The presentation in Figure 10.1b is more informative. (a) (b) Figure 10.1: Default and enhanced barplots for amphora presence on Romano-British sites using the data of Table 10.1. Producing Figure 10.1b is not as straightforward as one might wish. The sixth element of the row names (Site) is renamed from Ivy_Chimneys – which was needed to allow the data to be read into R – to Ivy Chimneys; this is purely 100 for presentational purposes. Colouring the bars makes sense here; the particular choice is illustrative only and different shades of grey will do as well; use solid colours rather than differently angled lines, dots etc. which should be avoided if possible – apart from being displeasing to the eye, ‘op-art’ effects can be created that distract from any message the plot might have to offer. The way the legend is created also differs from other plot functions in that it is created within the call to the barplot function rather than separately from it. The legend.text argument indicates that both a legend is needed and the text to appear within it; args.legend controls the appearance in a list whose arguments are the same as those of the legend function used with other plots. Labels for the bars can be produced using the names.arg argument but results are unsatisfactory if there are lots of bars and/or labels are lengthy. One solution is to reduce the size of the print using the cex.names argument. This can be unsatisfactory since labels may become too small to read easily. Vertical labels can be produced with the argument las = 2 but the result can look ugly. Angled labels are nicer and the last line in the code to follow shows how this can be done. In the following code bar <- barplot produces the plot and saves the coordinates of the x-axis in bar; the axis function creates an x-axis with the plotting positions for the labels; the text function adds the labels. This has been written as a function, fig10.1b. Type fig10.1b(), once it’s working, to get Figure 10.1b. fig10.1b <- function() { row.names(amphora)[6] <- "Ivy Chimneys" Amphora <- amphora$Amphora Site <- row.names(amphora) Colour = c("red","skyblue","skyblue","orange","yellowgreen", "yellowgreen","grey70","pink") bar <- barplot(Amphora, ylab = "Percentage", col = Colour, legend.text = c("Military", "Urban", "Suburban", "Rural", "Roadside", "Villa"), args.legend = list(fill = c("red","skyblue","orange","yellowgreen", "grey70","pink"))) axis(1, at = bar, labels = F, tick = F) text(x=bar, y=-1.25, Site, xpd=TRUE, srt=45, pos=2) } fig10.1b() In ggplot2 the comparable function is designed for two-way tables of data. It’s possible to persuade it to deal with one-way data but is not illustrated here. 101 10.3 10.3.1 Two-way data Preliminary considerations Even if a barplot for one-way data is judged necessary they can be fairly boring. Things get much more exciting when dealing with two-way tables of counts. This is partly because several decisions need to be made about the objective of the plot and presentational details, even if these are often made more-or-less unconsciously. For illustration a set of data from Table 9.1 of Cool (2006) is used, reproduced in part as Table 10.2. Table 10.2: Later first- to mid-second century animal bone assemblages from RomanoBritish sites, after Table 9.1 in Cool (2006). Site Castleford Ribchester Castleford Wroxeter Leicester Wilcote Type Cattle (%) Sheep (%) Pig (%) Fort Fort Vicus Urban Urban Rural 64 63 59 53 44 38 26 25 28 29 39 52 10 11 13 18 17 9 n 14155 2473 3648 1548 2039 3116 The table shows the percentages of cattle, sheep and pig bones found at Romano-British sites of different types. Sites are ordered by the proportion of cattle. This illustrates a more general pattern, that proportions of cattle and pig bones tend to be higher on more ‘Romanized’ sites – military establishments, major towns and villas. More specifically, Cool (2006: 81) observes that the ‘rural site of Wilcote stands out as the only one where sheep are dominant. The urban sites, though interestingly not the military ones, stand apart with their higher proportion of pigs’. She did not feel the need to embellish the text with a barplot that would have told the same story. Note that Table 10.2 shows the numbers on which the percentages are based. Such information is sadly lacking in some archaeological publications that exploit barplots, as indeed, sometimes, is any quantitative information other than that to be gleaned from the plot itself. En passant it can be noted that Cool additionally provided information on the different methods of quantification used, going on to argue that the percentages did not accurately reflect meat consumption because of different animal sizes and differential amounts of waste generated in the butchering process. Crudely put, a pig produces more pork than a sheep does lamb, and more bits of it get eaten. That’s by-the-by. One decision has already been made, standard for these kind of data, and that’s to effect comparisons between assemblages in terms of relative rather than absolute quantities. In some circumstances a comparison of absolute quantities may be needed, which is why having access to the original numbers may 102 be useful. If you do have access to the raw data, and even if relative proportions are of interest, there is a choice between using row proportions or column proportions. In the present example interest lies in comparing assemblages with respect to the proportions of animal bones from different species. In principle one could also compare species with respect to their distribution across sites; it’s not of great interest here because it would largely reflect relative sample sizes, but in other situations a choice needs to be made and it can be of interest to look at the data in both possible ways. That’s two decisions to make. The third one is what kind of barplot to use. The choice is between a clustered or stacked barplot. The rest of this section illustrates how these may be constructed in base R and ggplot2. 10.3.2 Barplots for two-way data in base R The first thing to realise is that the ‘natural’ way to prepare a table for publication, as in Table 10.2, isn’t necessarily the best way to prepare it for analysis in R. The approach adopted here is to prepare the data before entry into R, tailored to the needs of the function(s) to be used for analysis. If the barplot function in base R is to be used recast the data in the form of Table 10.3. The species in the first column, when the table is read into R, will be treated as row names, as a column heading is lacking. Here the Castleford sites are distinguished by ‘F’ for Fort and ‘V’ for Vicus. ‘Default’ barcharts are shown for the data in Figure 10.2 following the table. Table 10.3: The data of Table 10.2 recast for use with the barplot function in base R. Cattle Sheep Pig Castleford-F Ribchester Castleford-V Wroxeter Leicester Wilcote 64 26 10 63 25 11 59 28 13 53 29 18 44 39 17 38 52 9 Table 10.3 is constructed such that the columns correspond to the categories to be compared, with respect to the categories designated by the row names. Figure 10.2 shows ‘default’ stacked and clustered barplots for the data which is imported into R as a data frame faunalR. A minimalist approach, to produce a stacked barplot, is to try using barplot(faunalR), which produces an error. Error in barplot.default(data1) : ‘height’ must be a vector or a matrix Effectively this is telling you that barplot won’t work with a data frame, even if the contents are entirely numeric. Use tab <- as.matrix(faunalR) to convert the (numeric) data frame to a matrix, then barplot(tab) will produce Figure 10.2a. To get Figure 10.2b add the argument beside = T. 103 60 100 40 50 80 30 60 20 40 0 10 20 0 Castleford.F Castleford.V Leicester Wilcote Castleford.F Castleford.V (a) Leicester Wilcote (b) Figure 10.2: Default barplots, (a) stacked, (b) clustered, for species presence on RomanoBritish sites, using the data of Table 10.3. The figures are almost serviceable but not all the bar labels are printed and a legend is needed somewhere. The first issue can be dealt with by reducing the character size of the labels using cex.names, but angled labels are more satisfactory. It’s easy enough to add a legend using legend.text but some work is needed to get it to display in a satisfactory fashion. Modified barplots are shown in Figure 10.3. Clustered barplot − base R 80 Stacked barplot − base R Species Species Sheep Cattle Cattle Sheep Pig 40 Percentage 60 e r ot ilc W es ic Le ro xe t W te er V d. tle as C R ib ch fo r es d. C as tle W ilc ot fo r e r te es ic Le V d. ro xe te r W fo r C as tle ch ib R C as tle fo r es d. F te r F te r 0 0 20 20 40 Percentage 80 60 100 Pig (a) (b) Figure 10.3: Enhanced barplots, (a) stacked, (b) clustered, for species presence on Romano-British sites using the data of Table 10.3 and base R. We shall give the code for the stacked barplot, embedded within a function, then discuss how it needs to be modified to get the clustered barplot. Part of the aim here is to show how to get nicely angled labels, which is not straightforward for clustered barplots. The data of Table 10.3 are held in a file called faunalR. 104 fig10.3a <- function() { tab1 <- as.matrix(faunalR) bar <- barplot(tab1, beside = F, ylab = "Percentage", main = "Stacked barplot - base R", yaxt = "n", legend.text = T, names.arg = rep("",6), ylim = c(0,120), args.legend = list(x = "top", horiz = TRUE, title = "Species", cex = 1.3, bty = "n"), cex.names = 0.8, cex.lab = 1.3, cex.main = 1.4) axis(2, at = seq(0,100,20)) text(x=bar, y=-2, names(faunalR), xpd=TRUE, srt=45, pos=2) } fig10.3a() We have already met some of this kind of code in connection with the simple barplot of Figure 10.1a but will repeat some of the detail here. • ylim = c(0,120) expands the y-axis to accommodate the legend but produces a tick mark at 120 , which exceeds the maximum range of the data and we want to get rid of it. To do so yaxt = "n" gets rid of the axis and the axis command replaces it with one having a range (0,100) with tick marks at intervals of 20. • names.arg = rep("",6) replaces the default labels with blanks; 6 is the number of labelled columns in the data table. This is needed because otherwise the default labels will overlap the angled labels we are going to produce. • bar, as well as producing the plot, stores the plotting positions for the 6 labels later used in text which adds angled labels. • In text the argument y = -2 determines the placement of the labels below the barplot; you may wish to experiment with this to see what pleases most. For the clustered barplot call the function fig10.3b and make the following changes to the function for the stacked barplot. • Use beside = T and delete yaxt = "n" and the axis command (and obviously change the main title!). • Replace the second element of ylim = c(0,120) with a smaller number, large enough to accommodate the legend without overlapping the bars. Here 80 works, but you may need to experiment a bit. • In text replace x=bar with x=bar[2,]. It is not obvious why something like this is needed, so we explain below. 105 When using bar the resultant object holds the otherwise invisible plotting positions of the labels. For simple and stacked barplots this is not a problem, since the number of bars and plotting positions are the same. In a clustered barplot there are R × C bars where R is the number of rows and C the number of columns, so 3 × 6 = 18 for our example. If you just use x=bar with a clustered barplot you will thus get 18 labels, three for each of the six clusters of bars; this is unsatisfactory since we only want one label for each cluster. Discussion forums on the web, where the question has been asked more than once, proved unsatisfactory for our purposes. We shall see shortly that producing angled labels for clustered barplots in ggplot2 is simpler, though, overall the amount of code needed is not reduced. The following worked for our problem. When constructing the function for the clustered barplot insert the command print(bar) at some point. When you execute the function the following will appear (using our data). [1,] [2,] [3,] [,1] [,2] [,3] [,4] 1.5 5.5 9.5 13.5 2.5 6.5 10.5 14.5 3.5 7.5 11.5 15.5 [,5] 17.5 18.5 19.5 [,6] 21.5 22.5 23.5 where rows correspond to species and columns to sites. Thus each column contains three x-coordinates that determine the plotting position for each of the three species associated with the cluster. We only want one label per cluster so extracting a single row – the central one seems sensible – to get one plotting position per cluster will do the trick. Hence the use of x=bar[2,] in the text function to do this, with the result shown in Figure 10.3b. 10.3.3 Barplots in ggplot2 For use with ggplot2 the data of Table 10.2 can be restructured, outside of R, as in Table 10.4, then imported into R as a data frame, here called faunalG. The code used to get the clustered barplot in the figure was ggplot(faunalG, aes(Site, Percentage, fill = Species)) + geom_bar(position = "dodge", stat="identity") where the data of Table 10.4 are in faunalG. Remove the position = "dodge" argument to get the stacked barplot. 106 Table 10.4: The data of Table 10.2 recast for use with ggplot2. Site Castleford-F Ribchester Castleford-V Wroxeter Leicester Wilcote Castleford-F Ribchester Castleford-V Wroxeter Leicester Wilcote Castleford-F Ribchester Castleford-V Wroxeter Leicester Wilcote Type Species Percentage Fort Fort Vicus Urban Urban Rural Fort Fort Vicus Urban Urban Rural Fort Fort Vicus Urban Urban Rural Cattle Cattle Cattle Cattle Cattle Cattle Sheep Sheep Sheep Sheep Sheep Sheep Pig Pig Pig Pig Pig Pig 64 63 59 53 44 38 26 25 28 29 39 52 10 11 13 18 17 9 Figure 10.4 shows ‘default’ clustered and stacked barplots using ggplot2 and the geom_bar function. 100 60 75 Percentage Species Cattle Pig Sheep Percentage 40 Species Cattle 50 Pig Sheep 20 25 0 0 Castleford−F Castleford−V Leicester Ribchester Wilcote Wroxeter Castleford−F Castleford−V Leicester Ribchester Site Site (a) (b) Wilcote Wroxeter Figure 10.4: ‘Default’ barplots, (a) clustered, (b) stacked, for species presence on Romano-British sites using the data of Table 10.4 and ggplot2. The main problem with the plots, if you don’t dislike the colours, is that the bars are ordered alphabetically by site rather than in the informative order of Table 10.2. A modified version of the clustered barplot is shown in Figure 10.5. The opportunity is taken to show how the width between bars can be altered, colours changed, and rotated bar labels produced. 107 Species Cattle Pig Sheep Percentage 60 40 20 te W ilc o st er ic e Le W ro xe te r V d− C as tle fo r he ib c R C as tle fo r d− F st er 0 Site Figure 10.5: A modified clustered barplot for species presence on Romano-British sites using the data of Table 10.4 and ggplot2. The code used follows, presented within a function as it is quite lengthy. Type fig10.5() to get the plot once the function is constructed. fig10.5 <- function() { library(ggplot2);library(grid) # Re-order factor levels faunalG$Site <- factor(faunalG$Site, levels = c("Castleford-F", "Ribchester", "Castleford-V", "Wroxeter", "Leicester", "Wilcote")) p <- ggplot(faunalG, aes(Site, Percentage, fill = Species)) + geom_bar(width = 0.6, position = "dodge", stat="identity") + theme(panel.grid.minor = element_blank()) + xlab("Site") + coord_fixed(ratio = 1/20) + scale_fill_manual(values=c("chartreuse3", "orchid", "skyblue")) + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) + theme(axis.text=element_text(size=12), axis.title=element_text(size=14)) + theme(plot.title = element_text(size = 18, face="bold")) + theme(legend.position="top", legend.text=element_text(size=14), legend.title=element_text(size=16, face="bold"), legend.key.height=unit(.8, "cm"), legend.key.width=unit(.8, "cm")) p } fig10.5() The width argument in geom_bar controls the spacing between bars; the theme 108 command that follows scale_fill_manual (used to change fill colours) produces rotated labels. The coord_fixed command is used to change the aspect ratio of the plot, in this case with the intention of elongating it; you need to play around with the ratio used to get the effect required. The various barplots presented, once bars have been appropriately ordered, tell the same story as Table 10.2. Our preference is to base interpretation on the table but it’s partly a matter of taste if you think the impact of a graphic is greater and makes the same points. It’s helpful to include a table of the data even if you go down this route. As far as style of barplot goes, patterns are sufficiently strong, and the categories sufficiently few, that both the clustered and stacked barplots are easily ‘read’. More generally, with more categories and less obvious pattern, we prefer to use clustered barplots; it’s a lot easier to effect comparisons both between and within the clustered bars than it is with a stacked barplot. 10.4 10.4.1 Examples, including things to avoid Glass ‘counters’ and their function The example presented in Section 9.2.4 was based on glass ‘counters’ discovered in excavations at Insula VI.1, Pompeii. The term ‘counters’ usually implies that the artefacts were used as gaming pieces. Cool (2016) queries the universality of this assumption; here the evidence against it is presented. Table 10.5 shows counts of counters with different diameters from Insula VI.1 and from grave contexts elsewhere where they were associated with gaming pieces such as dice. Table 10.5: Glass ‘counter’ sizes from Insula VI.1 Pompeii and grave contexts where the counters were associated with dice. Source: Cool, 2016. 5 6 7 8 9 10 11 12 13 14 15 16 VI.1 Grave 2 4 18 22 47 66 66 63 55 48 35 23 0 0 0 0 1 1 0 0 2 9 19 29 17 18 19 20 21 22 23 24 25 26 27 VI.1 Grave 16 12 8 2 4 1 1 0 0 1 1 39 33 11 6 5 1 0 0 0 0 0 The first, unnamed, column ‘Diameter’ is a continuous variable but was only measured to the nearest millimetre and for comparative purposes it is convenient to treat the data as discrete. The lack of a header means that ‘Diameter’ will form the row names. Other headers are convenient for data entry and will be modified in Figure 10.7. This uses a clustered barplot to contrast the diameters of ‘counters’ 109 from Insula VI.1 with those from the graves which, it seems reasonable to assume, were in fact used for gaming purposes. Figure 10.6 shows some of the material on which, here and in Sections 9.2.4 and 12.5, analysis is based. Figure 10.6: Glass counters from Gloucester and Pompeii. (Source: the authors). It is immediately obvious from Figure 10.7 that the distribution of size is bimodal with counters associated with gaming pieces in graves being generally larger with a somewhat smaller range than those from Pompeii. This casts doubt on the 110 60 Context 30 0 10 20 Count 40 50 Insula VI.1 Graves 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Diameter (mm) Figure 10.7: A clustered barplot for glass ‘counter’ dimensions from Insula VI.1, Pompeii and from graves where they are associated with gaming pieces. interpretation of most of the Insula VI.1 ‘counters’ as gaming pieces. Cool (2016) suggested they were more likely to be elements of interior decoration and there is other archaeological evidence from elsewhere that suggests this is a plausible explanation. Additionally, those counters that can reasonably be assumed to have been used for gaming purposes tend to have strongly contrasting colours (black and white); the Pompeii counters come in a wide range of colours, not all easily distinguished. The data are held in a data frame, countersgraves that is renamed data within the function. The base R code used for the graph was fig10.7 <- function() { data <- countersgraves barplot(t(as.matrix(data)), beside = T, ylim = c(0,60), xlab = "Diameter (mm)", ylab = "Count", legend.text = T, col = c("orange", "black"),names.arg = 5:27, args.legend = list(legend = c("Insula VI.1", "Graves"), title = "Diameter (mm)", cex = 1.3, title = "Context", bty = "n"), cex.names = .7, cex.axis = 1.2, cex.lab = 1.3, cex.main = 1.4) } fig10.7() where the data are held in a three-column data frame called countersgraves. This has to be converted to a matrix and transposed (i.e. columns become rows and rows columns) to get the effect desired. Although this is just a single command 111 it is worth embedding it in a function as you may need to play around with some of the arguments to get a figure suitable for publication. 112 10.4.2 Chartjunk I Tufte (1983: 107), in a chapter devoted to chartjunk, starts thus. ‘The interior decoration of graphics generates a lot of ink that does not tell the viewer anything new. The purpose of decoration varies – to make the graphic appear more scientific and precise, to enliven the display, to give the designer an opportunity to exercise artistic skills. Regardless of its cause, it is all non-data-ink or redundant data-ink, and it is often chartjunk.’ Among the things he singles out are ‘over-busy grid lines’, ‘redundant representations of the simplest data’ and ‘unintentional optical art’. He further comments that ‘like weeds, many varieties of chartjunk flourish’. Some of this has already been discussed in Chapter 9 with reference to pie charts; sadly, archaeological uses of the barplot are not immune from this critique. It can be difficult to reproduce some of the awfulness of plots appearing in the literature. Figure 10.8 uses the data of Table 10.1 to illustrate some of the problems. Urban Suburban Rural Roadside Villa 30 Military 20 15 18.1 11.7 10 Percentage 25 27.8 11.1 5 8.2 5.4 4.9 al gn al l C hi th As hi C Iv y Ki ng sc m ne ys ot e te r es ch Al ol n nc Li te es ic Le Ba in e ss e r 0 2.3 Figure 10.8: An alternative (and poorly) ‘enhanced’ barplot for amphora presence on Romano-British sites using the data of Table 10.1. There are far too many grid lines which are far too intrusive and distract from the message the graph has to offer. The visual impact of grid lines can be particularly pernicious with stacked barplots. This is illustrated in the alternative rendition of Figure 10.3a shown in Figure 10.9. 113 Species Sheep Cattle 60 te ilc o W st er ic e Le ro xe te r W d. V fo r C as tle ch ib R C as tle fo r es d. F te r 0 20 40 Percentage 80 100 Pig Figure 10.9: A version of Figure 10.3a illustrating the visual impact of grid lines. It hurts visually to look at Figure 10.9; stare long enough and the gaps between the bars leap to the foreground, distracting from the otherwise clear message of the plot. If you must have grid lines use fewer and make them less obtrusive. Unwin (2015: 3–4) quotes Tukey (1993) to the effect that graphics ‘are for the qualitative/descriptive . . . never for the carefully quantitative (tables do that better)’ and that graphics ‘are for comparison . . . not for access to individual amounts’. If you accept this there is no real need for the grid lines – they’re just what’s produced by the defaults in the software that gets used. Adorning the bars with the numbers as in Figure 10.8 is superfluous. If you do want to comment on these do so on the basis of a table. The examples above are not pathological; plenty of comparable plots are to be found in the literature. Figure 10.10 is a reasonable attempt to reproduce Figure 5 of Schörner (2007), except that we’ve been unable to reproduce the rather garish shading of the bars used there. The paper in which this appears is derived from inscribed stelae dedicated to Saturn found in Roman North Africa. They are associated with the ritual sacrifice of animals and some of the stelae are dated; Figure 10.10 is intended to show that the sacrificial ritual ‘was regularly performed all year round’ (Schörner 2007: 98). Other than via the barplot the data are not given, but it’s easy enough to see – and without the grid lines – that for January to December the number of sacrifices attested to is (0 2 1 2 4 4 2 1 4 4 2 3). Note that the numbers are rather small and integers so that the grid lines in the plot at 0.5, 1.5 etc. are wholly unnecessary, as the others are arguably so; the ‘jazzy’ shading of the bars, more 114 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Ju ly Au gu st Se pt em be r O ct ob er N ov em be D r ec em be r Ju ne M ay Ap ril Ja nu ar y Fe br ua ry M ar ch 0.0 Figure 10.10: Monthly distribution of sacrifices to Saturn (after Schörner 2007). intrusive in the original, is also unwanted. The message of the plot is that, January apart, between one and four sacrifies are attested to in the remaining months and there is no evidence that they were confined to official, or ‘seasonal’ events. You can say this in a single sentence, given the numbers, without recourse to a barplot. The plot falls into the category of ‘redundant representations of the simplest data’. The same might be said of several of the barplots in White (2007: 122) that appear in the same volume as Schörner (2007). Two barplots are used to display the differing sources of food offerings in burials at Poundbury, UK, in the period surrounding the Roman conquest of Britain and in the third/fourth centuries AD. The sample sizes are small (10 and 16) as is the number of categories represented. The message of the data is that in the earlier period only pig and sheep/goat offerings were found, in roughly equal numbers (4:6) while pig is not found in the later period where chicken and sheep/goat dominate (6:8) with single examples of two other species (cattle, dog). This can be stated in a single sentence and would be obvious from a small table; using two-thirds of a page to show barplots is a waste of space. 10.4.3 Chartjunk II The previous section presented unfortunately real examples, quite common in practice, of barplots that were redundant and/or awful to look at. If they have any merit it is that they recognise that barplots are two-dimensional constructs, and you can (through gritted teeth) ‘read’ them if it is the only form in which data are presented. In this section we stray beyond the merely pointless or awful into the criminal use of barplots – namely pseudo-three-dimensional constructs. The arguments against this sort of representation have already been rehearsed in the context of pie charts and will not be repeated in detail here. 115 n = 419 Deceased 10 9 350 7 5 6 Count 8 300 250 200 4 150 50 2 3 100 Quantity of samian stamps Commemorators 11 400 450 Here we shall illustrate their misuse, as with several prvious illustrations, using examples taken from TRAC proceedings. It is not possible (easily) to produce three-dimension barplots in R – it’s not allowed. We have therefore had to write our own code to produce the following examples. The first example, in Figure 10.11a, is an imperfect attempt to reproduce Figure 3 of Weber (2012) – our version actually looks better. In the original it occupies half a page. It shows the distribution of samian ware with potters stamps from the Castleford (UK) pottery shop assemblage, from two production sources. The overwhelming majority, 419/425, come from a single production centre, Lezoux (France), and that’s all that needs to be said. A graphic is entirely unnecessary unless it’s assumed that some aspirant theoretical archaeologists can’t understand a simple sentence with numbers in it. Given it’s there, the most egregious thing about the graphic is the use of a three-dimensional barplot. Because of the distorting effect of the non-existent third dimension it is usually difficult to relate the bars to the numbers associated with the grid lines (of which there are too many). Our graph is better than that in Weber (2012), where the bars appear to float with no obvious relation to the grid, and need to be endowed with numbers to make sense. 0 0 1 n=6 Lezoux Les Martres−de−Veyre Ingenui Liberti Servi Incerti Production centre Figure 10.11: Examples of unnecessary three-dimensional barplots, (a) after Figure 3 of Weber (2012), (b) after Figure 4 of Tupman (2005). See the text for contextual details. The same paper, in Figure 1, contains a similar and equally pointless chart, albeit with four categories. As an added bonus the paper (Figures 2 and 4) contains two pie charts – thankfully two-dimensional – which are borderline unreadable, and too ‘busy’, because of the large number of categories involved. The important information the pie charts are intended to convey would be better displayed in tabular form. For a second example Table 10.6 is that on which Figure 4 of Tupman (2005) is based. It’s taken from a study of the relationship between social status and commemoration as testified by inscriptions on funerary monuments for the non-elite from Iberia, dating to the first to third centuries AD. Four classes of monuments 116 Table 10.6: Individuals commemorated with and setting uo funerary altars (after Tupman, 2005, Figure 4). Deceased Commemorators Ingenui Liberti Servi Incerti 3 2 3 3 3 2 6 10 were defined, the table relating to inscriptions on altars. Tupman’s terminology is used in the table and the associated plot of Figure 10.11b; the status designations correspond to freeborn citizens, freedmen, slaves and uncertain. Inscriptions are categorised as referring either to the deceased or to commemorators. In Tupman (2005) the table and associated figure is one of a suite of four, and it is the contrast between them that is of interest. It is therefore perhaps a little unfair to treat a single table and figure in isolation; we do so in order to reiterate a number of points made elsewhere. Ironically, the bulk of Tupman (2005) is taken up with the archaeological background and theory. The actual data analysis is confined to about one page at the end of the paper, and is conducted in quantitative terms without reference to the graphs. The discussion is also couched in terms of percentages, some based on quite small numbers. As far as Figure 10.11b goes it does not exactly reproduce Figure 2 of Tupman (2012) where the bars ‘float’ and it is difficult to relate them to the numbers associated with the grid lines; the iniquities of pseudo-three-dimensional barplots have already been commented on and this is one of them. We have not attempted to reproduce the wholly unnecessary variable grey-shaded background to the plots that Tupman uses. One problem with this kind of graph is the optical illusions it can generated. Look long enough and what is intended as the background grid leaps to the foreground. In isolation, and ignoring Incerti, Table 10.6 shows that for each status category there are few inscriptions (5 or 6) more-or-less equally split between the deceased and commemorators. In fact, given the other monument types, the contrasting graphs are interesting – despite the poor presentation – and some commentary on them would have been welcome. All that is actually said is that ‘Figures 3, 4, 5 and 6 show the proportion of people commemorated with each type of monument belonging to each social status’ (Tupman 2005: 127). That’s it. Incidentally, the figures show numbers of people, not proportions. The numbers are converted to proportions (percentages) for the purposes of interpretation in Tupman (2005: 130). This is in the form of a list which would have been better presented, and more comprehensible and interpretable, as a table. If we can emphasise a point made earlier, in singling out papers for critical comment, sometimes in detail, we have no intention of impugning the archaeological scholarship of the authors involved. The intention is to use genuine examples to illustrate general points about the quality of graphical representation in a seriously non-negligent number of papers that use graphs. 117 What is disturbing – and we are not claiming this is universal – is the disjunction often to be seen between the quality of the archaeology and the quality of the statistical graphics that accompany any data analysis that takes place. Graphs of the kind illustrated in the two ‘chartjunk’ sections are all too readily produced in Excel, and unfortunately often are. If you need a graphic, and they are often superfluous, they deserve the same care and attention as the rest of the text and should be properly integrated into it. This is not always the case. 118 Chapter 11 Ternary diagrams 11.1 Introduction An example of a ternary diagram was shown in Figure 9.6; here further examples and details of their construction are provided. Ternary diagrams go under a variety of names – triangular, tripolar or phase diagrams among them. They are used when p = 3 variables are available that are scaled to sum to 100% (or 1). The data may be categorical or continuous. This scaling (or compositional constraint) means the values of any two variables determine the value of the third variable. This implies that the data are exactly two-dimensional. The data can be plotted in an equilateral triangle where a point represents the relative proportions of the three variables as measured by the (perpendicular) distances to the three axes. In effect, and as previously noted, a point in a ternary diagram encapsulates the same information as a pie chart for three categories, with the considerable advantage that the distances between points are visualized. Howarth’s (1996) history of the ternary diagram traces its origins to the mid18th century. Their use in archaeology is widespread, often occurring in the more ‘specialised’ literature. Howarth (1996: 338) notes that they are widely used in geology, physical chemistry and metallurgy and are to be found in archaeological publications that interact with these disciplines. Recent papers in this vein that use ternary diagrams include Hughes et al. (2010) study of flint compositions, Panich (2016) of obsidian, and Chirikure et al. (2010) of slags. For data sets with more than 3 variables composite variables (i.e. linear combinations of those available) are sometimes defined to enable their representation in a ternary diagram (e.g., Hein et al. 2007: 149; Plaza and Martinón-Torres 2015: 93). Another use is in phase diagrams where a plot is ‘zoned’ in some way to identify, for example, different classes of material or manufacturing technologies corresponding to different regions of the diagram. Data are then plotted to characterize the cases beings studied (e.g., Thornton and Rehren 2009; Radivojević et al. 2010). Archaeozoological applications, to data of the kind used for Figure 9.6, are also common. An important body of work making extensive use of ternary diagrams 119 for this kind of purpose is attributable to King (e.g., King 1984, 1999a, 1999b, 2005); Levitan’s (1993) report on the vertebrate remains from excavation of the Uley shrines (Woodward and Leach 1995) also makes extensive and effective use of the method. Stiner (1990, 1994) popularised the use of ternary diagrams for studying mortality patterns (i.e. old, prime, juvenile) in archaeofauna. Steele and Weaver (2002), for similar data, proposed what they called a ‘modified triangular graph’ allowing an indication of sampling uncertainty to be incorporated into the diagram in the form of density contours (confidence ellipsoids) surrounding individual points. Weaver et al. (2011) further develop their ideas. Stiner introduced the idea of demarcation of ‘zones’ in ternary diagrams to identify specific mortality patterns; Discamps and Costamango (2015) provide a critique of the methodology and propose an alternative species-specific approach to zoning1 . This by no means exhausts the uses to which ternary diagrams can be, or have been, put in the service of archaeological data analysis. Meighan (1959) is the earliest archaeological application of ternary diagrams we have seen; he used the relative proportions of three types of pottery, displayed in a ternary diagram to seriate pottery assemblages. The illustrative example in the next section can be viewed in this light. Implementation is possible in a variety of packages in R, the triax.plot function from the plotrix package and plot.acomp from the compositions package among them. There are others but only the ggtern package, dedicated to the production of ternary diagrams, will be illustrated. 11.2 11.2.1 Examples Palaeolithic assemblages For an initial illustration, data from Table 9.12 in Doran and Hodson (1975) are used (Table 11.1). The left-hand table shows the counts of cores, blanks and stone tools found in different levels at the palaeolithic site of Ksar Akil (Lebanon); the levels are numbered from earliest (25) to latest (12). The right-hand table shows the counts converted to percentage for each level and can be represented in a ternary diagram2 . Read the data (i.e. the headers and what follows) into a data frame Ksar – you will need to select a subset of the columns for analysis. Either Ksar[, 1:4] or Ksar[, c(1,5:7] will do it. we’ll use the former; ggtern will transform these 1 The last three papers are quite mathematical and go beyond what is discussed in any detail in the text. 2 Note that the row sums are not all exactly 100% because numbers have been rounded to integers for presentational purposes; more exact values were used for the analysis. If you see a table like this in the literature where row sums are all exactly 100% the chances are that the table has been ‘doctored’ a little to achieve this. 120 Table 11.1: To the left, counts of cores, blanks and tools from middle levels of the palaeolithic site at Ksar Akil (Lebanon). This is Table 9.12 from Doran and Hodson (1975). The right-hand table converts the data to row percentages, rounded to integers. Levels Cores Counts Blanks Tools CoresP 21 36 126 159 75 176 132 46 550 76 17 4 29 133 12 52 650 2342 487 1090 713 374 6182 846 182 21 228 2227 70 115 549 1633 511 912 578 266 1541 349 51 14 130 729 20 18 10 4 7 8 9 7 7 6 7 10 7 4 25 24 23 22 21 20 19 18 17 16 15 14 13 12 Percentages BlanksP ToolsP 12 26 49 57 45 50 50 55 75 67 73 54 59 72 68 57 41 40 48 42 41 39 19 27 20 36 34 24 to the percentages needed. The ternary diagram for these data, in Figure 11.1a, is followed by the code used. Blanks Blanks 100 100 20 20 80 80 17 1512 16 40 40 res res Co s s Co 60 13 22 1418 19 20 23 21 nk Bla nk Bla 60 60 60 40 40 80 80 24 20 20 25 10 Tools Tools (a) (b) 0 10 80 60 Cores 40 Tools 20 0 10 80 60 40 20 0 0 10 Cores Tools Figure 11.1: Ternary diagrams for the palaeolithic assemblage data of Table 11.1. ggtern(data=Ksar[,1:4],aes(Cores, Blanks, Tools, label=Levels)) + geom_point(size = 3) + theme_showarrows() 121 The slightly curvilinear pattern of points suggests the possibility of seriation, presumably related to the sequencing of the levels3 . To investigate this replace geom_point(size = 3) with geom_text() which adds the text contained in Levels. The result, Figure 11.1b, confirms the interpretation as a seriation, though it is not perfect. Later levels are to the top of the diagram and earlier ones to the lower-right. The (partial) seriation could, of course, have been inferred from the percentages displayed in Table 11.1. 11.2.2 Romano-British civilian faunal assemblages For more detailed illustration data on the faunal assemblages from Romano-British civilian sites, extracted from Table 2 of King (1984) and Table E of King (1999a), are used. Pre-Roman assemblages from the first of these tables have been omitted. The first few lines of the data, as entered into R are shown in Table 11.2 Table 11.2: Faunal remains from Romano-British civilian sites; C = cattle, P = pig, SG = sheep/goat. (Source: King 1984, King 1999a). Type Settlement Vicus Settlement Settlement Villa Villa Villa Villa .. . C P SG 37 37.4 74.4 62 48.7 56.2 59.5 71.3 .. . 10.4 13.7 4.7 10.8 29.2 12.9 10.1 8.5 .. . 52.6 48.9 20.9 27.2 22 30.9 30.4 20.2 .. . The codes for site type in King (1984), sett, urb, vic and vill, have been renamed here as Settlement, Urban, Vicus and Villa; King (1984: 14) provides a fuller description of the site types embraced by these terms (sett = unromanised (rural) settlements; urb = larger town/civitas capital, colonia etc.; vic = civilian vicus, Romanised settlement, smaller town/civitas capital; vill = villas). It is convenient here, for presentational purposes, to use our ‘shorthand’ code, but readers should bear in mind the fuller definitions. The additional data extracted from the 1999 paper are those with identical site type codes to those used in the 1984 paper. There are 254 assemblages of the percentage presence of cattle, pig and sheep/goat bones, based on the total number of identified pieces of bone for those three species groups. In R the data were read into a data frame King with column names Type, C, P and SG for the site type and cattle, pig and sheep/goat percentages. For an 3 label = Levels is not actually needed for Figure 11.1a but does no harm. It’s included because it is needed for Figure 11.1b. 122 initial look at the data the following code was used. The upper part of Figure 11.2 is the result. library{ggplot2}; library(ggtern) ggtern(data=King,aes(C, P, SG, colour=Type, shape=Type)) + geom_point() + theme_showarrows() P 100 20 80 40 Type Settlement P C 60 Urban 60 Vicus 40 Villa 80 20 0 10 10 0 80 60 40 20 C SG SG Type P Rural settlement 100 Urban site Vicus 20 80 Villa 40 P C 60 60 40 80 20 10 0 10 80 60 40 20 0 C SG SG Figure 11.2: The upper plot is the ‘default’ ternary diagram for Romano-British civilian faunal assemblages using ggtern; the lower plot is a ‘modified’ version. C = Cattle, P = Pig, SG = Sheep/Goat. 123 The code is the ‘default’ in the sense that it is almost the minimum code needed to get a ternary diagram, using ggtern. The theme_showarrows() command isn’t essential, but we find that the arrows produced, aligned with the axes, help in interpreting plots. We don’t much like the defaults; an alternative rendition with modified presentational details is shown in the lower part of Figure 11.2. The code used to produce it was as follows. fig11.2lower <- function() { library{ggplot2}; library(grid); library(ggtern) King$Type <- factor(King$Type, labels = c("Rural settlement", "Urban site", "Vicus", "Villa")) ggtern(data=King,aes(C, P, SG, colour=Type, shape=Type, fill = Type)) + geom_point(size = 3) + theme_showarrows() + scale_shape_manual(values=c(21,22,24,25)) + scale_colour_manual(values=rep("black", 4)) + scale_fill_manual(values=c("skyblue", "red", "yellow", "white")) + theme_legend_position("tl") + theme(legend.title=element_text(size=16), legend.key.height=unit(1, "cm"), legend.key.width=unit(1, "cm")) } fig11.2lower() Here King$Type renames the levels of Type to produce a more informative legend. The different scale commands, in sequence, select open plotting symbols; specify their border colour as black; and specify different colours for their fill. The legend is placed at the top-left using theme_legend_position("tl"). King’s papers can be consulted for an interpretation of the data; these also use assemblages for military sites. A brief summary of some of King’s (1999a: 178– 180) conclusions is that ‘sites in Britain show that the high cattle/high pig pattern correlates with the apparent “Romanized” nature of the sites. There is a gradient towards higher average cattle and pig percentages that goes in the sequence; rural settlements, villas, secondary urban centres, urban sites . . .’ with military sites and legionary sites completing the sequence. This can be envisaged by imagining straight lines drawn from the vertex for cattle (C) through the scatters of points for each site type. The imagined line for (rural) settlements, relative to the sheep/goat axis, would make a sharper angle with the horizontal axis than that for urban sites. This reflects differences in the relative proportions of the different species that King interprets in term of ‘Romanization’; rural settlements tend have relatively more sheep, with the presence of cattle and pig being correspondingly less. With several different types to display and lots of data such differences can be difficult to discern. One approach to alleviating this is to compare types in a pairwise fashion, which could be done by modifying point sizes. This is illustrated in Figure 11.3 which emphasises the contrast between the patterns for rural settlements and urban sites. 124 Type P Rural settlement 100 Urban site Small town 20 80 Villa 40 P C 60 60 40 80 20 10 0 10 80 60 40 20 0 C SG SG Figure 11.3: A ‘modified’ ternary diagram for Romano-British civilian faunal assemblages emphasising the distinction between two types of site. C = Cattle, P = Pig, SG = Sheep/Goat. This was effected by modifying the geom_point command in the previous chunk of code and using scale_size_manual to set the point sizes as desired. # Link point sizes to site type geom_point(aes(size = Type)) # Not especially nice so manually set point sizes scale_size_manual(values = c(3,3,.5,.5)) + Another approach would be to display separate ternary diagrams for each site type. Figure 18 in King (1999a) overlays summary ternary diagrams on a map of the Roman Empire for Roman Imperial assemblages. Each diagram includes a ‘blob’ summarizing the compositional variation within a region. In ggtern2 this kind of display can be achieved by overlaying a ternary diagram with a contoured two-dimensional kernel density estimate for the data. This is illustrated for the British civilian faunal assemblages in Figure 11.4, where site distinctions have been ignored, using the following code. fig11.4 <- function() { library(ggplot2); library(grid); library(ggtern) ggtern(data=King, aes(C, P, SG)) + theme_showarrows() + stat_density_tern(geom="polygon", n=200, aes(fill=..level..)) + geom_point() + scale_fill_gradient(low="blue", high="red") + guides(color = "none", fill = "none") } fig11.4() 125 P 100 20 80 40 P C 60 60 40 80 20 0 10 0 SG 10 80 60 40 20 C SG Figure 11.4: A ternary diagram for Romano-British civilian faunal assemblages overlaid with a contoured two-dimensional kernel density estimate. Rural settlements Villas P P 100 100 20 20 80 C 60 60 60 40 40 80 80 20 20 SG SG Urban sites Small towns P P 100 0 SG 100 20 20 80 80 40 P 60 60 40 40 80 80 20 P 60 C 40 60 C 10 80 60 C 40 SG 20 10 80 60 40 20 0 0 10 0 10 20 10 0 80 40 C 20 10 0 80 60 40 20 0 0 10 10 SG SG 60 C C P P C 40 40 60 80 SG SG Figure 11.5: Ternary diagrams for Romano-British civilian faunal assemblages by site type overlaid with contoured two-dimensional kernel density estimates. 126 To get the plot for rural settlements, for example, replace King in the first line of code with King[King$Type=="Settlement",]; the title is added using ggtitle("Rural settlements") + theme(plot.title = element_text(size = 20, face="bold"))} and replacing Type=="Settlement" with "Urban", "Villa" and "Vicus" for the other plots, adjusting titles accordingly. 11.2.3 Staffordshire Hoard gold compositions The Staffordshire Anglo-Saxon Hoard is a treasure find from 2009, consisting of about 7kg of metal pieces, mostly of gold and, to a lesser extent, silver. The project design established for the study of the Hoard (Cool, 2013: 21–22) provides the data given in Table 11.3 and background; some of the objects analysed are shown in Figure 11.6. The data are measurements of the surface composition, with respect to gold (Au), silver (Ag) and copper (Cu), of 22 metal pieces from six objects found in the Staffordshire Anglo-Saxon Hoard, normalized to sum to 100%. A ternary diagram is shown in Figure 11.7. Table 11.3: Gold/silver/copper compositions for the surfaces of artefacts from the Staffordshire Hoard, normalized to sum to 100%. Artefact Cu Ag Au Artefact Cu Ag Au a a a a b b b c c c c 3.3 2.3 2.6 3.0 2.1 2.6 2.0 5.3 3.7 3.9 3.7 10.9 11.5 11.0 12.0 18.8 20.8 17.7 29.0 26.9 27.9 28.1 85.5 85.9 85.9 84.4 78.8 76.1 79.7 65.5 57.1 60.7 60.7 c c c d d e e f f f f 5.3 6.1 7.8 2.2 1.7 1.1 5.4 1.6 1.9 2.4 2.2 14.0 16.7 26.0 18.4 15.3 24.4 25.8 15.2 17.7 18.1 18.1 79.6 71.8 65.6 79.0 82.6 74.2 68.5 83.0 78.2 78.7 78.2 The analysis was undertaken to see if pieces that could be associated with the same object were compositionally similar with respect to their surface and distinct from other objects. The plot shows that objects are not as compositionally consistent and distinct from other objects as one might wish, though it has subsequently transpired that surface measurements do not adequately reflect the body composition, something it was hoped might be the case (Blakelock, 2016). The code used is similar to that already illustrated, using a data file staffgold, with one change. This is that only a subset of the full triangle is displayed, the better to see any patterns in the data this is achieved by, in this case, adding the code 127 Figure 11.6: A selection of objects from the Staffordshire Hoard analysed for Table 11.3. (Source: Photographer Guy Evans. Copyright Barbican Research Associates). tern_limits(T = .4, L = .4, R = 1) which determines the top, left and right apices of the triangle. This was previously used, without comment, to produce Figure 9.6 where the plotting region was delimited by values of 0.5, 0.8 and 0.45. fig11.7 <- function() { library(ggplot2); library(grid); library(ggtern) p <- ggtern(data=staffgold,aes(Cu, Ag, Au, color=Artefact, shape = Artefact, fill=Artefact)) + geom_point(size = 3) + theme_showarrows() + scale_shape_manual(values=c(21,22,21,23,24,25)) + scale_colour_manual(values=rep("black", 6)) + scale_fill_manual(values=c(NA, "yellow", "skyblue", "darkorange", "purple", NA)) + theme_legend_position("tl") + theme(legend.title=element_text(size=16), legend.key.height=unit(1, "cm"), legend.key.width=unit(1, "cm")) + tern_limits(T = .4, L = .4, R = 1) p } 128 Artefact Ag a 40 b c 10 d 30 e Cu Ag f 20 20 30 10 0 10 90 80 70 40 Cu Au Au Figure 11.7: A ternary diagram for the Au/Ag/Cu surface compositions of 22 pieces from six metal artefacts from the Staffordshire Hoard. fig11.7() 129 Chapter 12 Correspondence analysis 12.1 Introduction Several previous chapters have concentrated on the analysis of tabulated data, either in the form of a set of counts for two or more categories of a single categorical variable, or two-way tables obtained by crosstabulation of two categorical variables. Pie charts, sometimes used to inspect such data in archaeological applications, are usually not ‘fit for purpose’ (Chapter 9). For small tables of data, if you must have a graphic, barplots are generally better (Chapter 10). It will be clear, from discussion in these chapters, that we think simple tabular presentation with an intelligent commentary is often preferable, though we recognise that some archaeologists prefer graphical presentation even when their subsequent discussion of a graph is based on the numbers represented by it. It’s a different story if you have a large table to deal with, and this is where correspondence analysis (CA) comes in. Any informative pattern in the data may be both non-obvious and subtle; CA is one way to get a handle on this. Put very simply, the rows of a table of data are reduced to points that can be plotted on a map that allows the similarity between rows, and any pattern in the data, to be assessed. The method is commonly used for the purposes of seriation, for example. The same sort of thing can be done with the columns of the table, generating a second set of points. Joint inspection of the two set of points/maps can be informative about the reasons for any pattern in the data. This is discussed in more detail in the next section. Section 12.3 introduces data on the regional and temporal distribution of brooch use in Late Iron Age and Roman Britain, used for detailed illustration in Section 12.4 which discusses how a basic CA can be carried out in R and shows some options for displaying results. It’s possible to use CA for pattern recognition; in archaeological applications it is widely used for seriation which generates what can be thought of as a ‘specialised’ form of pattern. Sections 12.5 to 12.7 provide some additional examples that illustrate chronological and spatial seriation. The different sections also illustrate different presentational options available with CA. 130 The method is a valuable tool for the graphical exploration of data. In the examples to follow sufficient detail has been provided about the archaeological context, including the substantive conclusions that CA can give rise to, in the hope that readers not already convinced by the merits of CA might see that it is a method worth knowing about. 12.2 Correspondence analysis – basics Apologies to anyone reading this who is familiar with CA and uses it; they will recognise that some corners are being cut in the account that follows. The more theoretical treatments of CA can be borderline unreadable, even for someone with a training in statistics. But the basic idea is very simple. Greenacre (2007) is an accessible introduction; Baxter (1994/2015a) provides a fairly detailed account aimed at archaeologists, including a review of the history of the use of CA in archaeology to that date; Chapter 13 of Shennan (1997) is also a detailed introductory account aimed at archaeologists; Baxter (2003) updates the story and provides a more concise account of the methodology; Chapter 9 of Baxter (2015b) provides a number of illustrative examples using R. Some of this earlier work is repeated here, but we go into more detail about implementation in R. Baxter and Cool (2010b) was intended as an introduction to CA in R for archaeologists with little or no statistical training, who nevertheless appreciated what CA might have to offer in their own work; it is now outdated. Alberti (2013a) is in a similar spirit while taking a different approach to exposition; see, also Alberti (2013b, 2015). At a more advanced methodological level are Peeples and Schachner (2012) and Lockyear (2013), the former providing access to R code. The story begins with a table having R rows and C columns, consisting of raw counts based on the crosstabulation of two categorical variables. The values of R and C are sufficiently large that it may not be obvious what the story, if any, is that the data have to tell you, or it may not lend itself to immediate narrative discourse. So to simplify things, and allow such a discourse, the data are reduced to two pictures, one for rows and one for columns. The pictures are in the form of scatterplots (Chapter 8). Roughly, and concentrating on the rows, each row is converted to row percentages – call these row profiles. The ‘distance’ between each row profile is measured; this generates a set of x- and y-coordinates for the R rows which can be plotted on an ordinary scatterplot, so that’s the first picture. The second picture is obtained by treating the columns in a similar way. The mathematics is quite ‘messy’ but the idea is simple; you are just measuring how similar pairs of row profiles are, and computational aspects have been taken care of by the people who write software for this kind of thing. Call the points row markers; rows that plot closely to each other should have similar profiles and if they are far from each other they should have dissimilar profiles. It needs to be recognised that the maps produced by CA are only an approximation to ‘reality’, 131 so the quality of the approximation needs to be judged. This might be done in various ways, including expert assessment of any pattern revealed. For example, if rows correspond to assemblages and columns to artefact types the raw data can be inspected to see if it makes archaeological sense. More technically a quantity called inertia is often used; this will be dealt with in context in the examples to follow. At the heart of many applications of CA is a comparison of the relative positioning of row and column markers. This can be effected in various ways, often by the superimposition of the two maps. An alternative, which is sometimes better with large data sets, is to present the maps as adjacent plots. Sometimes the main focus may be on the rows (or columns) only, and there is nothing wrong with this. The superimposition of the two maps constiutes a biplot and the term may be used more loosely to refer to the presentation of adjacent maps. A certain amount of mathematical ink has been spilled on the topic of the best/most ‘correct’ way to display a biplot. The problem, if a ‘purist’ view is adopted, is that plots can become difficult to read; the points for one map tend to lie around the periphery of a plot while those for the other map get bunched-up in the middle. Therefore a ‘compromise’ approach, which can be explained mathematically, is often adopted, which gives both maps equal prominence. This is the approach adopted here, but software for CA often allows a choice if you want to explore presentational options. The fundamental issue is that it is meaningful to talk about distances between points on the individual maps but does not make sense to talk about distances between points from different maps. Although this latter interpretation of ‘distance’ is not legitimate it is the case, the point of CA, that the relative location of points on each map is informative. Imagine that maps are divided into four quadrants, centred on the (0, 0) point. Rows that plot in the north-east quadrant, for example, will tend to have larger values on those column categories that plot in the same region, smaller values on the column categories that plot opposite in the south-west quadrant, and so on. As noted at the outset of the section this is a simplified account. Interpretation in practice is often more complex and best illustrated by examples to which we shall turn to after introducing one of the data sets to be used in the next section. 12.3 Some data Table 12.1, has previously been published in the supplement to Cool and Baxter (2016a) and in Baxter and Cool (2016). The first of these papers is a detailed archaeological exploration of temporal and regional patterns of brooch use in Late Iron Age and Roman Britain; the second paper is an extended, and generalised, account of some of the statistical methodology developed for analysis in the first paper. The data were extracted, by the second author, from Mackreth’s (2011) corpus Brooches in Late Iron Age and Roman Britain. Figure 12.1 shows a selection of the brooches studied. 132 Table 12.1: Regional counts of late Iron Age and Roman brooches by TPQ and period. The regions are EA = East Anglia, N = North, S = South, EM = East Midlands, WM = West Midlands, SW = South-West. Periods are IA = Iron Age, AT = Augusto-Tiberian, CN = Claudio-Neronian, FL = Flavian, TH = Trajanic-Hadrianic, AS = AntonineSeveran, L = Late. TPQ -70 -50 -30 -20 -5 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 95 100 105 110 115 120 125 135 140 145 150 155 160 170 175 180 185 190 200 210 215 275 290 310 340 350 370 400 EA 25 25 13 0 6 62 2 173 9 38 3 32 247 31 82 190 297 68 55 29 89 130 161 70 4 10 52 12 10 36 8 21 23 34 10 29 13 22 7 16 68 44 15 19 23 3 5 3 8 5 4 N 0 1 2 1 0 11 0 14 0 4 0 0 26 2 8 8 24 4 10 11 58 70 89 73 5 1 37 25 6 10 7 17 15 50 2 21 18 3 7 3 16 25 19 33 2 7 3 3 5 0 0 S 88 107 49 7 24 141 19 532 26 277 13 56 561 72 203 106 182 68 74 27 106 57 73 26 5 12 58 18 7 38 6 19 11 37 8 15 31 12 15 12 40 28 9 33 3 7 6 3 12 3 7 EM 20 29 11 1 3 64 5 157 7 51 0 20 176 41 54 77 119 22 45 22 90 79 142 18 5 2 37 11 17 15 2 7 4 18 7 10 5 10 6 0 16 14 3 8 2 2 1 1 2 0 2 133 WM 3 3 1 0 0 30 0 37 3 15 0 1 50 5 20 28 34 19 64 10 53 35 63 10 13 1 40 13 9 11 3 7 0 11 1 6 2 0 4 0 11 3 0 1 2 1 1 3 5 2 2 SW 24 26 25 20 1 95 82 119 3 81 1 14 319 8 339 142 98 48 146 17 140 94 145 22 4 14 49 32 17 53 6 5 15 39 6 14 14 13 6 6 18 16 5 11 11 0 2 1 6 5 2 Period IA IA IA IA IA AT AT AT AT AT AT AT AT CN CN CN CN CN CN FL FL FL FL FL TH TH TH TH TH TH TH TH AS AS AS AS AS AS AS AS AS AS AS AS L L L L L L L Figure 12.1: A selection of Romano-British bow brooches of first to fourth century date. (Source: authors). 134 A brooch can be assigned to a type and the region in which it was found. Many of the brooches were not recovered from dated contexts, but the type to which they belong can be assigned a terminus post quem (TPQ) based on the earliest date for brooches of the type found from dated contexts. Several types may be associated with the same TPQ and have been grouped together in Table 12.1. That is, the entries refer to counts of brooches from a region whose type has a particular TPQ with type differences for that TPQ being ignored. The period labels are conventional, named, for the most part, after the more memorable emperors or their families who dominated the period. For present purposes the data will be used to illustrate the mechanics of carrying out a correspondence analysis in R and some of the presentational possibilities. Some of this was undertaken in Cool and Baxter (2016a) and the reader will be alerted to other analyses undertaken there, as appropriate. 12.4 12.4.1 R basics ‘Default’ analyses Enter the data of Table 12.1 into R with the name brooches. Figure 12.2 shows the (almost) default output from two implementations of CA in R, using columns 2 to 7 only (i.e. just the counts). −0.5 0.0 0.5 1.0 46 S EA 0.0 WM −0.5 Dimension 2 (25.8%) EM SW −1.0 4 N 0.5 43 24 N −1.0 −1.0 −0.5 0.0 44 32 47 42 40 41 37 17 39 48 36 EA 49 33 EM35 20 3134 3 13 38 18 27 22 6 23 45 16 25 21 WM 28 26 30 29 50 SW19 15 0.0 9 14 21 812 10 S 51 −0.5 0.5 11 5 0.5 1.0 1.0 −1.0 7 −1.0 −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 Dimension 1 (46.5%) (a) (b) Figure 12.2: Basic correspondence analyses for the Romano-British brooch data set of Table 12.1; (a) using the corresp function from the MASS package, (b) using the ca function from the ca package. Figure 12.2a was obtained using the corresp function from the MASS package (Venables and Ripley 2002); Figure 12.2b was obtained using the ca function from the ca package (Nenadic and Greenacre 2007). The code follows. 135 library(MASS) biplot(corresp(brooches[,2:7], nf = 2)) library(ca) plot(ca(brooches[,2:7])) The corresp and ca functions carry out the CA and biplot and plot produce the graphs from the output generated. The default in the corresp function is to extract just one axis, so the argument nf = 2 is necessary to extract two axes for plotting purposes. Both plots are informative in different ways, but for both presentational and interpretive purposes can clearly be improved on. A more detailed interpretation will follow in Section 12.4.3; for the moment note that the south-west (SW) region sits apart from the other regions and Figure 12.2a shows that this is attributable to three ‘outliers’, rows 4, 7, and 15 which correspond to TPQs of 20 BC, 5 AD and 45 AD. Inspection of Table 12.1 shows that the SW contributed an unusually high proportion of the brooches for each of these TPQs. Detailed analysis of the brooches associated with these TPQs, reported in Cool and Baxter (2016a), revealed that this was being caused by the number of Durotrigian brooches present in these date-assemblages. For reasons discussed in that paper these TPQs were removed from subsequent analysis and this will be emulated in this section. There is a general lesson here. The ‘outliers’ are not ignored; they merit interpretation in their own right. But they can also obscure other informative structure in the data so it is sensible to remove them and see what happens. The same applies when an analysis reveals clear structure in a data set, such as clusters of cases. It can be useful to ‘break down’ the data set into its ‘component parts’ and examine these separately to see if more subtle structure is also present. Cool and Baxter (1999) is a more extended treatment of this kind of thing. In Figure 12.2b the percentage figures associated with the two axes/dimensions are the inertias, which sum to about 72%. Remember that this is a measure of the ‘quality’ of approximation to the ‘reality’ of the data set. It’s generally considered good practice to report the values and 72% can be regarded as respectable. It’s not perfect, however, which suggests that looking at axes/dimensions other than the first two may be worthwhile and this will be illustrated shortly. To read the plots as maps it’s important that the axes are equally scaled. In technical terms the aspect ratio should be 1. 12.4.2 Customising plots In this section enhanced versions of the plots in Figure 12.2 are presented using both base R and ggplot2. We prefer the former, which seems to us to be simpler, but provide code for the latter as well. The general idea, which applies to either approach, is to undertake a CA and extract, separately, the information needed to plot the maps for both rows and columns. These can then be superimposed, as is done here, or plotted as adjacent maps – this approach allows much greater control 136 3 over presentational matters than do the defaults. The use of the third dimension in a CA is illustrated and is informative. Initially the graphs, with accompanying code and commentary, are presented. The archaeological interpretation of the plots is discussed in Section 12.4.3. In all the analyses the three outliers associated with the SW region are omitted and this changes the picture a bit as regards the positioning of the SW in relation to other regions. To reiterate, they are not being ignored; the reasons for their ‘outlyingness’ can be explained on archaeological grounds and a more ‘balanced’ picture of the overall pattern is obtained if they are removed. Figure 12.3 shows the base R plot; we’ve chosen to do this in black-and-white but it’s easy to have coloured symbols, labels and arrows if you wish. 2 WM S N 1 IA AT CN FL TH AS L EA EM 0 EA Period S SW −1 EM Axis 3 (inertia = 15.5%) SW −1 0 IA AT CN FL TH AS L −2 N WM −3 −3 −2 Axis 2 (inertia = 21.1%) 1 2 Period −3 −2 −1 0 1 2 3 −3 Axis 1 (inertia = 53.8%) −2 −1 0 1 2 3 Axis 1 (inertia = 53.8%) (a) (b) Figure 12.3: Enhanced correspondence analyses for the Romano-British brooch data set using the ca function and base R; (a) the CA plot based on the first two axes (b) a plot based on the first and third axes. The code used can be broken down into different chunks. The first chunk is used for both the base R and ggplot2 analyses. The ca function is used; a data frame, df is created that omits the outliers previously identified; the regional counts are extracted in regions; and a correspondence analysis is carried out with the results held in CA. The final two lines extract the row and column coordinates used in subsequent plots. This could be done in other ways that use less code, but the approach used here is intended to be more transparent. library(ca) df <- brooches[-c(4,7,15),] regions <- df[,-c(1,8)] CA <- ca(regions) x1 <- CA$rowcoord[,1]; x2 <y1 <- CA$colcoord[,1]; y2 <- # omit outliers # extract regional data # carry out CA CA$rowcoord[,2]; x3 <- CA$rowcoord[,3] CA$colcoord[,2]; y3 <- CA$colcoord[,3] 137 The base R code used now follows. That for the plot based on the first two axes is given. xlab <- df$Period ylab = names(regions) PCH <- rep(c(16,3,7,8,15,2,17), c(4,7,5,5,8,12,7)) plot(x1, x2, asp = 1, pch = PCH, xlab = "Axis 1 (inertia = 53.8%)", ylab = "Axis 2 (inertia = 21.1%)", cex.axis=1.2, cex.lab=1.3, cex=1.2) text(y1, y2, ylab, cex=1.2) arrows(0, 0, .85*y1, .85*y2, length = .15, angle = 10, lwd = 2) legend("topleft", c("IA", "AT", "CN", "FL", "TH", "AS", "L"), pch = c(16,3,7,8,15,2,17), title = "Period", bty = "n", cex = 1.2) The chunks of code above could usefully be combined in a user-defined function. The first three lines in the second chunk set up the labels to be used in the subsequent plot; this will be discussed further after the ggplot2 code has been presented. In the plot function the asp = 1 argument ensures that the aspect ratio is 1; everything after that is presentational. If the axis labels include the inertias, which is recommended, these need to be obtained. When the initial correspondence analysis is undertaken you can type CA. It will generate a lot of output you may not wish to see, but the first part gives the principal inertias, which is what’s needed. Alternatively, plot(CA) will produce a plot labelled with the inertias for the first two axes. The arrows function is optional; not all authors display analyses in this way, though we find it can make interpretation easier. If it’s omitted plots somewhat similar to those of Figure 12.2 will be obtained and this kind of presentation is often seen in the literature. Period IA CN TH AT FL AS L Period IA CN TH AT FL AS L 3 WM 2 0 EA EA S 0 S SW −2 1 EM Axis 3 (inertia = 15.5%) EM Axis 2 (inertia = 21.1%) SW 2 −1 N WM N −2 −3 −4 −2 0 2 −2 Axis 1 (inertia = 53.8%) 0 2 Axis 1 (inertia = 53.8%) (a) (b) Figure 12.4: An enhanced correspondence analyses for the Romano-British brooch data set using the ca function and ggplot2; (a) the CA plot based on the first two axes (b) a plot based on the first and third axes. 138 The ggplot2 code, the output from which can be seen in Figure 12.4 can be broken into two chunks in addition to that needed to get the plotting coordinates in the first place. The three chunks involved should be embedded, sequentially, in a user-defined function. The first chunk below is a bit tedious, but necessary to construct the data frames that are subsequently needed. It is not always immediately obvious how to do this. Period <- data.frame(df$Period) dfscores <- data.frame(cbind(Period, x1, x2, x3)) names(dfscores) <- c("Period", "X1", "X2", "X3") dfscores$Period <- factor(dfscores$Period, levels = c("IA", "AT", "CN", "FL", "TH", "AS", "L")) colnames <- data.frame(names(regions)) colscores <- data.frame(cbind(y1, y2, y3)) dfcols <- cbind(colnames, colscores) names(dfcols) <- c("Region", "Y1", "Y2", "Y3") dfcols$angle <- with(dfcols, (180/pi) * atan(Y2/Y1)) Essentially the extracted coordinates from the original CA need to be combined with the labelling information to be used in two separate data frames, one corresponding to rows and the other to columns. Some of the coding is needed to ensure that what’s being created is actually a data frame; the columns need to be named in a suitable way for subsequent use. The factor command on the fourth line is needed if you want the legend to reflect the temporal order of the periods, otherwise they are arranged alphabetically. The last line is needed if you want to have appropriately angled column markers; if you don’t it can be dispensed with. The code to produce the CA plot requires the three packages noted to be available. It’s stored in an object p which the final line plots. library(ggplot2); library(grid); library(scales) p <- ggplot(dfscores, aes(x=X1, y=X2)) + xlab("Axis 1 (inertia = 53.8%)") + ylab("Axis 2 (inertia = 21.1%)") + coord_fixed() + geom_point(data=dfscores, aes(x=X1, y=X2, shape=Period, colour=Period), size=3) + geom_segment(data=dfcols, aes(x=0, y=0, xend=Y1*.8, yend=Y2*.8), arrow=arrow(length =unit(1/2, "picas")), color=muted("red")) + geom_text(data=dfcols, aes(label=Region, x=Y1*1, y=Y2*1, angle = angle), color="darkred", size=5) + scale_colour_manual(values=c("darkred", "brown", "green3", "black", "blue", "red", "magenta","red")) + scale_shape_manual(values=c(16,3,7,8,15,2,17,16)) + xlim(-3,3) + theme(axis.text=element_text(size=12), axis.title=element_text(size=14)) + theme(panel.grid.minor = element_blank()) + theme(legend.position="top", legend.text=element_text(size=12), legend.title=element_text(size=14), legend.key.height=unit(.8, "cm"), legend.key.width=unit(.8, "cm")) p 139 The full code for Figure 12.4a is given; much of it is presentational. for Figure 12.4b, in the second chunk of code, change X2 to X3 and Y2 to Y3, where they occur. Change ylab("Axis 2 (inertia = 21.1%)") to ylab("Axis 3 (inertia = 15.5%)"). The coord_fixed() function produces an aspect ratio of 1; the xlim(-3,3) command is designed to produce a square plot rather than the rectangular one otherwise obtained. If you’re happy to accept default colours and horizontally aligned column markers without the arrows, which is perfectly acceptable, the code could be simplified quite a lot, but we were curious to see what was needed to obtain the kind of plots illustrated in Figure 12.4. The scale_shape_manual command is needed. If there are more than six categories to be represented, and there are seven here, then without it a stern message to the effect that it ‘becomes difficult to discriminate with more than 6 categories . . . consider specifying shapes manually if you must have them’ appears. This rather overstates the problem, but a judicious choice of shapes/symbols that takes into account the configuration of points on the plot may be needed. It’s possible to use the same symbol twice with suitably contrasting colours. In general it is simpler in ggplot2 to specify colours and shapes via the colour and shape arguments in the aesthetics (aes). In base R it’s necessary to create variables that govern the colour and symbol shape and this can be tedious. It’s simplified in the present case, when defining the plotting symbols in PCH which exploits the rep function, because the data are grouped by period. 12.4.3 Interpretation We’ve already noted that, at least for the approach adopted here, the base R plot is somewhat simpler to implement and we often prefer to do CA in this way, even when a presentation in colour is feasible. We don’t think, in this instance, that ggplot2 confers any particular advantages either from a presentational or interpretational perspective - the interpretation is the same using either approach. The plot based on the first two axes shows a clear south-north opposition. Other regions plot between them. It is possible to read a kind of geographical progress with time in the plots. Starting in the south, then running through the diagonal East Anglian/Midlands/South-West band and then arriving in the north. If the relationship between the column and row markers is inspected it is easy to see that the northern end is dominated by the later periods and the southern end by the earlier periods, though the separation is not perfect. The CA was merely the first stage of analysis in Cool and Baxter (2016a) and the other methodologies used in that paper show the regional divisions more clearly. The plot of the first against third axis is interesting. The third axis accounts for 15.5% of the inertia, which is not negligible; the first three axes account for about 90% of the total inertia. What’s interesting is the positioning of East Anglia on the plot, which separates out, and thereby hangs a tale. The Portable Antiquities Scheme (PAS) in Britain encourages the reporting and recording of metal-detected finds and promotes interaction between archaeologists and metal140 detectorists. Coins apart, brooches are one of the most commonly reported finds. The PAS was in its infancy at the time Mackreth closed his corpus so, for the most part, metal-detected finds are poorly represented in the corpus. The exception is East Anglia where a considerable fraction of the relevant corpus was metal-detected. This is because archaeologists in the region pioneered collaboration with metal-detectorists well before the PAS came into being. The interesting bit is that there is a considerable, not always appreciated, bias in the kind of brooch recovered, or at least reported, by metal detecting. This is discussed at length, and demonstrated, in Cool and Baxter (2016b). Roughly, bulky and/or gaudy brooches are much more likely to be reported than slighter, less glamorous types, which are, however, common. Furthermore, there is a temporal dimension in that the bulky/gaudy types tend to occur later than their less ‘attractive’ forebears. It means that there is almost certainly a serious bias in the East Anglian assemblage related both to the types of brooch recorded and temporal patterns in the data. The end result is that any interpretation of the East Anglian data in relation to other regions has to be undertaken with considerable caution. Cool and Baxter (2016b), in addition to illustrating this in detail, examine more recent PAS data on brooches – the issue remains and poses questions about how such data should be used to address archaeological questions that go beyong the purely descriptive. The critical appraisals of PAS data that we are aware of – mostly undertaken by the PAS itself or in collaboration with it – have been largely concerned with the nature of geographical bias in the locations of metal detecting rather than bias in the nature of artefact types of a particular kind that are recovered. 12.5 Glass ‘counters’ from Pompeii In Section 9.2.4, Table 9.3, data on the colours of glass ‘counters’, found in excavations of Insula VI.1, Pompeii, Italy, were presented. The data appear in Cool (2016), the main purpose of which was to question the common assumption that such ‘counters’ were usually gaming pieces, suggesting that it was equally, if not more, likely that they were used as elements of ‘interior decoration’. Some of the evidence for and thinking behind this is discussed in Section 10.4.1. In the present instance the focus is on the variation in the use of colour over time, regardless of function. Correspondence analysis was used in Cool (2016) to demonstrate that there was, and this analysis is repeated here. Section 9.2.4 was written in a spirit of some irony to show that presenting the data using pie charts rather than CA, as suggested by a referee, was not the best idea anyone has ever had. More detailed information is given about the original data set here, since it illustrates the kinds of decisions that may need to be made before undertaking a CA. In total 566 glass counters were recovered making them the commonest small find from the excavation other than coins. The counters were found in a 141 2 considerable number of colours, ‘rationalised’ in Cool (2016) into 20 categories. Multi-coloured counters formed 6% of the assemblage with the remainder being monochrome. Opaque colours accounted for just under 6%, counters which appeared black for just under 10% and the remainder were monochrome translucent glass. Of those, blue/green was the commonest (19%), followed by yellow/brown with the very dark shades accounting for 4% and the lighter shades, often variously described as amber or honey-coloured, for 13%. Other common colours (8-10% of the assemblage) were deep blue, pale green and colourless. An early decision was made to use only those colours for which there were 10 or more counters, reducing the number of colour categories used to 13. Phases I to V in Table 9.3 correspond to the first century BC (C1BC), Augusto-Tiberian (AugT), Augusto-Neronian (AugN), the period between the AD 62 earthquake and the eruption of Vesuvius that destroyed Pompeii in AD 79 (Post 62), and the eruption level itself for which the term ‘Modern’ is used as a proxy. This can be regarded as a chronological sequence with some overlapping arising from uncertainty in some of the phasing. The last two phases are secure; the Augusto-Tiberian phase overlaps with those either side of it. Omitted entirely is the earliest third to second century BC assemblage. This was comparatively small, atypical in several respects, and there is uncertainty about some of the dating evidence for it. The end result, Table 9.3, subjected to CA was therefore a result of informed archaeological judgment and statistical considerations related to sample sizes. Eventually 452 counters were used in the analysis, presented in Figure 12.5. MiB DY/B 1 Post 62 DpB 0 AugT Cls Bl PaG AugN Multi B/G Y/B −1 Pea Modern −2 OpW −3 axis 2 (inertia = 29.6%) GtC Y/G −4 C1BC −3 −2 −1 0 1 2 3 axis 1 (inertia = 49.1%) Figure 12.5: A correspondence analysis of glass counter colours by phase from Insula VI.1, Pompeii, Italy. 142 Presentation differs from that exploited so far in this chapter and is designed to emphasise the fact that there is chronological variation in the use of colour. In archaeological applications CA can be used for pattern recognition; perhaps most frequently it is used for the purposes of seriation – that is, patterns are sought in the data that can be clearly interpreted in terms of a chronological sequence. Section 12.6 will illustrate this in more detail, Section 12.7 provides an example where a seriation is interpretable in spatial rather than chronological terms. Broadly speaking – and it is a demonstrable mathematical thing – if the data seriate then the row and/or column markers should display a ‘horseshoe’ shaped pattern; interpretation as a chronological sequence, or whatever, has to be based on criteria external to the CA itself. Often, and as in Section 12.6, the focus is on the row markers. In the present case the chronological sequencing of the columns is known. If you ‘join up the dots’ corresponding to the column markers, in chronological order, and get a ‘horseshoe’ shaped pattern without any crossovers of lines it’s interpretable as a chronological seriation, and that’s what’s happening in Figure 12.5. It wasn’t expected. This enables the patterning associated with the colours to be interpreted. Yellow/green counters are typical of the first century BC assemblage, moving onto colourless and green-tinged colourless ones in the following Augusto-Tiberian period, with strongly coloured translucent and multi-coloured ones more prominent moving into the longer Augusto-Neronian period. Black is clearly becoming more important as time progressed, being typical of the post-earthquake period, whilst blue/green and opaque white characterise the modern deposits that can be taken as a proxy for the eruption level. The inertias of the first two axes sum to about 79% which is satisfactory. The reasons for the chronological shift are discussed in Cool (2016); it suffices here to note that it exists and explanations can be advanced for it. The code used to produce Figure 12.5 is given below as it introduces features not covered earlier in the text. A data frame countercolour was created that contains the second to sixth columns of Table 9.3; that is, the first column countercolour[,1] contains the shortened names of the colours that will be used for labelling purposes. The second line of code executes a CA of the count data held in countercolour[,-1] with the following two lines extracting the coordinates to be used for plotting purposes. library(ca) countersCA <- ca(countercolour[,-c(1)]) x1 <- countersCA$rowcoord[,1]; x2 <- countersCA$rowcoord[,2] y1 <- countersCA$colcoord[,1]; y2 <- countersCA$colcoord[,2] plot(x1, x2, asp = 1, type = "n", xlab = "axis 1 (inertia = 49.1%)", ylab = "axis 2 (inertia = 29.6%)", main = "", cex.axis = 1.2, cex.lab = 1.3, cex.main = 1.3, xlim = c(-3.5,3.5)) text(x1, x2, countercolour[,1], col = "black", cex = 0.8) text(y1, y2, c("C1BC", "AugT", "AugN", "Post 62", "Modern"), col = "red", cex = 1.3) segments(y1[1:4], y2[1:4], y1[2:5], y2[2:5], lwd = 2) 143 In practice we’d embed this within a function, fig12.4 say, and obtain the plot using fig12.4(). In the plot function the argument type = "n" suppresses any plotting of points which are then added, with appropriate labelling, using the text functions. These add, respectively, the row and column markers, the first two arguments specifying the plotting positions and the third the labels to be used. The segments function ‘joins up the dots’ corresponding to the column markers, which have been arranged in chronological order. It’s perhaps best to look at the help facility in R for segments if you feel moved to use it. Essentially it’s necessary to specify the (x, y) coordinates for each pair of points you wish to unite with a line segment. In the present application the construction of the data frame is such that an approximate horseshoe pattern should be obtained if the pattern can be interpreted as a seriation and that’s what happening. It should perhaps be noted that, to the best of our recollection, this kind of presentation is less common in the archaeological literature than the other forms of display used in this chapter. Often, and as in the next two sections, the emphasis is on the pattern of the row markers, when we suspect (not entirely with reason) that the seriation only gets published if it is obviously manifest in the form of a clear horseshoe that doesn’t need additional presentational embellishment to clarify the picture. This is discussed further in the next section, which includes an example of CA that is much more typical of seriation studies. 12.6 Anglo-Saxon male graves and seriation This section is designed to illustrate the more thoroughgoing use of CA for the purposes of chronological seriation, in this case of 272 male Anglo-Saxon burials characterised by the presence or absence of 80 types of grave goods, coded as 1 or 01 . The section is based on Section 9.5 of Baxter (2015b) which, in turn, drew on a more extended acoount in Baxter (2014a). The use of CA for the purposes of seriation in Anglo-Saxon burial studies is quite common; typically there is a clear expectation that a chronological seriation exists. Stratigraphy enabling the relative dating of graves may be unavailable, so methods used elsewhere for this purpose, which is what is attempted in CA applications, are not readily applied. It is expected that changing fashions associated with the grave goods will result in graves that are temporally close to each other, having assemblages more similar to each other than to temporally more distant graves. It can be shown that in such situations CA is expected ‘to work’. The data were collected for, and analyzed in, Bayliss et al. (2013). Analysis, to be found in Chapter 6 there, proceeded iteratively, over 50 pages or so. Rather than entering all the data at the start, only a subset of the types was used 1 Given the size of the data set it is not reproduced here but can, with a little effort, be extracted from the Archaeology Data Service archives at http://archaeologydataservice.ac.uk/archives/view/aschron_eh_2013/ 144 initially. Graves and types were then omitted if their representation was unsatisfactory; further types added; and the process repeated until a seriation judged to be satisfactory was achieved. There is some further ‘tweaking’ at the end that is discussed shortly. Figure 12.6 emulates the preferred seriation of Bayliss et al. (2013: 286) with some embellishment. −2 −3 axis 2 −1 0 1 CA (row plot) of Anglo−Saxon male graves Phases − Dated Graves −4 p q r s t −3 −2 −1 0 1 2 3 axis 1 Figure 12.6: A seriation of 6th and 7th century Anglo-Saxon graves from Britain. Dashed vertical lines are phase boundaries. (Data source: Bayliss et al., 2013.) Ignoring, for a brief moment, the embellishments (coloured dots, dashed vertical lines, and legend) this is absolutely typical of what gets published in archaeological seriation studies. You get as good a horseshoe pattern as you can reasonably expect; it can be interpreted unambiguously as a chronological sequence; nobody is going to argue with you. The ‘philosophy’ that underpins the way the CA was conducted follows that espoused in Jensen and Høilund Nielsen (1997), the latter being a co-author of the book under discussion. A condensed summary of the ‘philosophy’ is that an underlying seriation is assumed to exist at the outset; analysis proceeds iteratively as just described, omission of graves and types that do 145 not conform to a ‘seriation pattern’ being justified by the assumption that such a pattern exists. Ideally this is also rationalized on archaeological grounds; for example, some grave goods are of a type of some antiquity (compared to other types in the burial assemblage) at the time the burial took place. As described thus there is a resemblance to the ‘peeling’ process of Cool and Baxter (1999). There are, however, differences. The Jensen/Høilund Nielsen approach assumes a particular type of structure in the data, whereas Cool/Baxter do not, and an emphasis is on identifying ‘outliers’ that ‘conceal’ the seriation. Cool and Baxter, by contrast, envisage the use of CA for purposes other than seriation, where the emphasis is on identifying patterns in the data, and removing the more obvious structure to reveal more subtle features. It’s possible to both understand the reasoning behind this approach and be uneasy about it. Its application goes some way towards explaining why – if the concluding paragraph of the previous section is at all valid – only ‘convincing’ seriations tend to see the light of day. It’s assumed that a seriation exists; artefact types and cases that might cast doubt on this are excluded from a final presentation, so ending up with an ‘unconvincing’ published seriation can be quite difficult. Some unease about the analytical approach described stems from the fact that it is not always clear how much of the data originally collected have been excluded from final publication because they don’t ‘fit in’, and why. One can be left wondering what any omitted data might usefully have to say, including doubts about the apparent ‘certainty’ of a seriation. Ideally we’d like to see ‘warts and all’ presentations of seriations, in addition to the chosen picture, coupled with a full discussion of the reasons for omissions – this might actually tell you something beyond the seriation itself. We’re realistic enough to recognise that the exigencies of academic publication may often militate against ‘full disclosure’ even when authors have fully engaged with the issues raised in the course of their research. As an historical aside Scandinavian scholars are usually credited with introducing CA to the English-speaking (and reading) archaeological world, or at least popularising the method (Bølviken et al. 1982; Madsen 1988a,b). French archaeologists, publishing in that language, were on the scene earlier. Much of this was associated the work of François Djindjian and collaborators who, if memory serves, advocated an analytical approach not dissimilar to Jensen and Høilund Nielsen’s (1997) at an earlier date. We’re not disputing the validity of what’s done, but you would sometimes like a little more information in published applications. Returning to the application that is the subject of this section, it has so far been presented as a typical application of CA which, in some aspects, it is. The embellishments in Figure 12.6, so far ignored, evidence a more singular story. In addition to the kind of information normally used for this kind of exercise, 48 radiocarbon dates were available for 40 of the male graves, indicated in Figure 12.6. This is the preferred seriation in Bayliss et al.2 . The seriation provides a relative 2 Their Figure 6.49 on p. 286. Figure 12.6 reproduces this closely, but with some embellishment not in the original; R was used, rather than the software used in the book. 146 ordering of the graves and is used to suggest phasing for the data. What is novel is the way Bayesian modelling is used in conjunction with the seriation to provide date estimates for the phase boundaries. In Figure 12.6 the ordering of the dated graves is determined by the ordering of the graves on the first CA axis. Some of the phasing is rather ‘fine’ and some phases contain few dated graves3 . What the phasing does is provide prior information about the relative chronological position of subsets of the dated graves that feeds into the modelling process. Not all of the original data set contribute to the seriation and some of the graves originally omitted have associated radiocarbon dates. The ‘almost’ final seriation provides partial information on the relative chronology of these graves that allows them to be fitted into what becomes the preferred seriation. The combination of techniques used converts the relative chronology provided by the seriation into an estimated absolute chronology. This provides finer dating evidence for the sixth to seventh-centuries than that previously available, including an estimate of the cessation of furnished Anglo-Saxon burials that places it in the later seventh-century rather than the earlier eighth-century, as previously accepted. This is an eye-catching conclusion (for Anglo-Saxon scholars at least) and is a good example of innovative statistical methodology leading to archaeologically important interpretations. There are open questions, as the authors acknowledge. The preferred seriation is based on calibrated radiocarbon dates that assume fish is not an important component of diet. Were fish consumption non-negligible – and the evidence in the book allows this reading – the estimated end-date would be later, though still earlier than the previously presumed end. For some of the later, and dated, burials, there is also an unresolved conflict with dates derived from numismatic evidence which are later. Methodologically the way in which the relative dating provided by the CA is interpreted in absolute terms is far superior to other work we are aware of on similar lines. The methodological development is, therefore, interesting and leads to substantively important archaeological conclusions. Unfortunately the book is, to put it kindly, ‘heavy going’ in places; this is particularly true of Chapters 6 and 7 which make considerable use of statistics and are central to the book. A major problem is that far too much detail, that obscures the essentials, is provided; Baxter (2014a, b) provides a commentary that attempts to separate out the essential material. For a frank assessment of the book from an informed archaeological perspective see Hills (2014). 3 Given the data the CA is straightforward and is all that is reported here other than labelling the plot with the phased dates. The Bayesian analysis is methodologically and computationally complex and best left to experts. We understand that some of the analyses reported in the book took, literally, weeks to run so it’s not something you should attempt at home. 147 12.7 Flavian drinking-vessels Usually CA is presented as a method for analysing tables of counts. It can, in fact, be applied to any table of non-negative data and sometimes is, though the ‘theoretical’ justification for data other than counts may be less clear-cut and data may be transformed before application. One purpose of this final example is to illustrate the application of CA to fractional counts in the form of estimated vessel equivalents (EVEs). The example is based on drinking vessels, current during the Flavian period in England. A selection of vessels is illustrated in Figure 12.7. (a) (b) (c) Figure 12.7: (a) A Hofheim cup; (b) a facet-cut beaker; (c) a circus cup. (Sources: (a) Courtesy of the Metropolitan Museum, New York. Cesnola Collection. Purchased by subscription, 1874-76. (b) Courtesy of the J. Paul Getty Museum Los Angeles. (c) Courtesy of the Metropolitan Museum, New York. Gift of Henry G. Marquand, 1881 (from Montagnole, France).) Table 12.2, originally published in Cool and Baxter (1999), shows glass vessel EVEs for seven types of drinking-vessel, current during the Flavian period in England, from 10 sites ordered by their north-south orientation, with three from the north, two from the Midlands, and five from the south. In pottery studies, in Britain at least, since the seminal work of Orton (1975, 1993) it has been recognised that EVEs are the best way to quantify fragmented data where the aim is to effect comparisons across assemblages. The merits of EVEs are sometimes honoured more in the breach than the observance; they seem not to be much used in pottery studies on the European mainland or the USA. It means that a lot of pottery studies based on fragmented data that use other methods do not lend themselves well to quantitative comparative studies. This comment applies to measures such as fragment count, minimum number of vessels represented and so on. Moreno-García et al (1996) extended the thinking behind EVEs to animal bone counts. This was based on a characterisation of bones that divided them into recognisable zones. The equivalent of an EVE, for a single bone, was the proportion of recognisable zones present. A pre-publication view of this was the 148 Table 12.2: Assemblage profiles (EVEs) for seven drinking vessel types, from the Flavian period in England. ‘Order’ is geographical from north to south. ‘Sport’, ‘Rib’ and ‘Hof’ refer to mould-blown sport cups, mould-blown ribbed cups and Hofheim cups; ‘Tall’, ‘Ind’, ‘FC’ and ‘WC’ are beaker types, tall mould-blown, indented, facet-cut and wheelcut respectively (Source: Cool and Baxter, 1999). Site Carlisle York Castleford Wroxeter Caersws Colchester Gloucester Caerleon London Fishbourne Date N-S Order Sport Tall Rib Hof Ind FC WC Total EVE 71-105 71-100 71-100 80-120 69-100 65-100 70–98 74-100 65-100 75-100 1 2 3 4 5 6 7 8 9 10 0.8 0 0 0 0 0.8 0 0 2.0 0 0.2 0 0.4 0 0.2 0 0.6 0 1 0.2 1 0.2 1.4 0.6 0.6 0 0 0 0 0 0 2.2 0 0.4 2.2 2.4 2.8 1.2 2.8 0 0.2 0 0.2 0 0 0.6 0 1.0 3.0 0.4 0.2 0 0.2 0.2 0 0.4 0 1.8 3.2 0.8 0.4 0.6 0.2 0.2 0 0.2 0 0.8 1.0 0 2.8 2.4 2.4 1.4 3.0 4.4 3.4 4.8 13.0 1.4 inspiration for Cool’s (1994) development of glass vessel EVEs (see, also, Cool and Baxter 1996). Figure 12.8 shows the result of applying CA to the data of Table 12.2. CA (row plot) of Flavian drinking vessels −0.5 8 9 3 10 −1 1.0 0.5 0.0 Rib Sport Ind FC −1.5 1 Tall WC −0.5 Dimension 2 (inertia = 37.0%) 0.0 0.5 6 4 −1.5 Hof −1.0 5 1.5 North Midlands South 1.0 2 −1.0 Dimension 2 (inertia = 37.0%) 1.5 7 CA (column plot) of Flavian drinking vessels 0 1 2 0 Dimension 1 (inertia = 41.4%) 1 2 3 Dimension 1 (inertia = 41.4%) Figure 12.8: Correspondence analysis row and column plots for the Flavian drinkingvessels of Table 12.2. The manner of presentation chosen illustrates the use of adjacent rather than superimposed plots. Both plots are divided into quadrants, so that a comparison between them is much easier. The code used to obtain the plots is given towards the end of the section. With the exception of Site 1 (Carlisle) an almost perfect seriation of the data is obtained in the row plot. Labelling sites by their north-south orientation, Site 149 2 (York) sits apart from others in its region but, though not perfect, the seriation admits a spatial interpretation. Reading around the horseshoe, sites from the north and Midlands are perfectly separated from southern sites on the first axis. Cool and Baxter (1999: 90–91) can be referred to for a discussion of these results, which admit more than one interpretation. One is that the north is characterised by earlier vessel forms and the south by later ones, and this may be the ‘rule rather than the exception’, contradicting ‘traditionally’ held views that the earlier forms found in the north and Midlands were ‘isolated survivals’. An alternative interpretation is that drinkers in the north/Midlands favoured low cups while those in the south preferred tall beakers. Ribbed cups only appear in the northern and Midland assemblages, all of which plot, along with these cups, to the left. Hofheim cups are not confined to any particular region, but are particularly prominent in Sites 2, 5 and 7 (York, Caersws, Gloucester) which accordingly plot in a similar region towards the bottom. The indented and facet-cut beakers are particularly prominent in the three most southerly sites, 8 9 and 10 (Caerleon, London, Fishbourne), and also plot closely in the north-east quadrant. The analysis of these data raises an interesting and more general question. We are aware that Cool and Baxter (1999) has been used for teaching purposes in a leading British university archaeology department where students were invited to criticise it. We’ve been told (in more than one personal communication) that it is often commented on that the numbers (of EVEs) on which the analysis is based are small and asked if this raises questions about the validity of any interpretation based on them. Both the observation and the question are legitimate. The ‘generality’ of the question arises because in archaeological data analysis one has to work with the data to hand, and the sample sizes involved may often be quite ‘small’ according to the desiderata sometimes laid down for what constitutes an adequate sample (not, incidentally, a question that has an easy answer). It can also be argued that most archaeological data analysis is concerned with pattern recognition in some form or other, using the terms ‘data’ and ‘pattern’ in a very broad sense. It is difficult to put the following thought into words, but it’s along the lines that if there is obvious pattern in a data set and if a plausible archaeological interpretation can be advanced for that pattern (the emphases are important here) then this transcends any purely statistical concerns one might have about sample sizes. That is, intelligent archaeological pattern recognition and interpretation may ‘trump’ statistical formalism when the two appear to conflict because of sample size doubts. Something of this kind, in more general terms, is expressed in Section 3.15 of Doran and Hodson (1975). This was around the time when quantitative methodology began to be widely applied in archaeology, and those involved were grappling with the conflicting requirements of archaeological and statistical inference. Some scholars rejected the use of formal statistical methodology entirely while 150 proponents of what was then the ‘New Archaeology’ rather exaggerated what was possible with such methodology. Both positions are extreme; statistics are useful but, in an applied context, should be subservient to archaeology. The patterns in Table 12.2 revealed by the CA are undeniable. The numbers are, however, small so the possibility exists that future discoveries may overturn some of the interpretations offered. We have yet to see it in print, but are aware that the paper has been criticised on the basis of finds made in the south, subsequent to Cool and Baxter (1999), that are supposed to contradict the suggestion that drinkers in the south preferred tall vessels. It would be necessary to quantify the new data and subject it to the kind of statistical analysis that would set it in the context of the data from other regions to assess this claim. If, hypothetically, the ‘new’ glass amounts to 2 EVEs of ribbed bowls split equally between the southern sites (where previously there were none) the interpretation does not change. To change the broad picture you’d need somewhat larger EVE values spread around small vessels in the south and/or find more tall vessels in the north/Midlands. The R code used to produce Figure 12.8 follows. fig12.8 - function() { library(ca) EVE <- Flavian[, 2:8] # extract EVE counts CA <- ca(EVE) # undertake the CA # extract plotting coordinates x1 <- CA$rowcoord[,1]; x2 <- CA$rowcoord[,2] y1 <- CA$colcoord[,1]; y2 <- CA$colcoord[,2] #Set up plot symbols and colours to be used Col <- c(rep("pink", 3), rep("yellowgreen", 2), rep("skyblue", 5)) Sym <- c(rep(16, 3), rep(17, 2), rep(15, 5)) # Row plot win.graph() plot(x1, x2, asp = 1, pch = Sym, col = Col, cex = 3, main = "CA (row plot) of Flavian drinking vessels") text(x1, x2, 1:10, cex = 1) abline(h=0, v=0, lty = 2, lwd = 1.5) legend("topright", c("North", "Midlands", "South"), col = c("pink", "yellowgreen", "skyblue"), pch = c(16,17,15), cex = 1.2, bty = "n") # Column plot win.graph() plot(y1,y2, asp = 1, type = "n", main = "CA (column plot) of Flavian drinking vessels") text(y1, y2, names(EVE), cex = 1.5) abline(h=0, v=0, lty = 2, lwd = 1.5) } fig12.8() This introduces one or two new features that have not previously been used. These are not specific to CA and could equally well have been introduced in Chap151 ter 8. The site and EVE count data, other than the total, are read into a file Flavian (i.e. only columns 1 and 4–10 are used). The code is given in summary form, omitting presentational aspects covered elsewhere (e.g., text sizes) unless it seems useful to make it explicit. The first two blocks of code set up the information to be used in the plots. The idea was to use different plotting symbols, differently coloured to distinguish between regions then label the points in a manner that both distinguishes between sites and is informative about their geographical location (i.e. the ‘N-S Order’ in Table 12.2). To do this it’s best to use pale colours, created in Col. If you want to have the site labels lying within the symbols you need to play around with their sizes, so we’ve made explicit the cex arguments used. The abline function adds the reference grid. Dashed (lty = 2) horizontal and vertical lines (h=0, v=0) that intersect at the (0, 0) point are created, lwd controlling the line width. In the column plot, type = "n" suppresses the printing of points and the text function replaces them with the names of the columns contained in the data frame EVE. A final comment is that we have used correspondence analysis for the graphical display of counted data, and this is the dominant use in the archaeological literature. Where pattern exists it is usually possible to infer, from the positioning of the row and column markers, what the more important determinants of the pattern are. This can be done more formally and numerically by calculating a variety of diagnostic statistics, which is possible using the ca function. Examples are scattered throughout Greenacre (2007) including an account of the theory. Shennan’s (1997: 327–341) case study of Danish Bronze Age hoards, which predates the development of the ca package, provides a detailed illustration of the use of the statistics, and is the earliest archaeological example we have seen. Other archaeological illustrations include Baxter (2015b: 138–143) and Lockyear (2016), but the ‘technology’ has not been widely used. 152 Bibliography Albarella, U., Tagliacozzo, A, Dobney, K. and Rowley-Conwy, P. 2006: Pig hunting and husbandry in prehistoric Italy: a contribution to the domestication debate. Proceedings of the Prehistoric Society 72, 193–227. Alberti, G. 2013a: An R script to facilitate correspondence analysis. a guide to the use and the interpretation of results from an archaeological perspective. Archeologia e Calcolatori 24, 25–53. Alberti, G. 2013b: Making sense of contingency tables in archaeology: The aid of correspondence analysis to intra-site activity areas research. Journal of Data Science 11, 479–499. Alberti, G. 2015: CAinterprTools: An R package to help interpreting Correspondence Analysis’ results. SoftwareX 1–2, 26-31. Aldeias, V., Goldberg, P., Sandgathe, D., Berna, F., Dibble, H. L., McPherron, S. P., Turq, A. and Rezek, Z. 2012: Evidence for Neandertal use of fire at Roc de Marsal (France). Journal of Archaeological Science 39, 2414– 2423. Baddeley, A., Rubek, E. and Turner, R 2016: Spatial Point Patterns. Boca Raton, FL: Chapman & Hall/CRC Press. Barker, P., White, R., Pretty, K., Bird, H. and Corbishley, M. 1997: The Baths Basilica Wroxeter: Excavations 1966–90. London: English Heritage Archaeological Report 8. Baxter, M. J. 1994: Exploratory Multivariate Analysis in Archaeology. Edinburgh: Edinburgh University Press. Baxter, M. J. 2003: Statistics in Archaeology. London: Arnold. Baxter, M. J. 2014a: Anglo-Saxon Chronology I – the male graves. A commentary on Chapter 6 of Anglo-Saxon Graves and Grave Goods of the 6th and 7th Centuries AD: A Chronological Framework. https://nottinghamtrent.academia.edu/MikeBaxter/Archaeology (accessed 10/10/2016). 153 Baxter, M. J. 2014b: Anglo-Saxon Chronology II – the female graves. A commentary on Chapter 7 of Anglo-Saxon Graves and Grave Goods of the 6th and 7th Centuries AD: A Chronological Framework. https://nottinghamtrent.academia.edu/MikeBaxter/Archaeology (accessed 10/10/2016). Baxter, M. J. 2015a: Exploratory Multivariate Analysis in Archaeology. Reprint of Baxter (1994) with a new introduction. New York: Percheron Press. Baxter, M. J. 2015b: Notes on Quantitative Archaeology and R. https://www.academia.edu/12545743/Notes_on_Quantitative_Archaeology_and_R (accessed 18/05/2016). Baxter, M. J. 2016: Multivariate Analysis of Archaeometric Data: An Introduction. https://www.academia.edu/24456912/Multivariate_Analysis_of _Archaeometric_Data_An_Introduction (accessed 28/07/2016). Baxter, M. J. and Cool, H. E. M. 2008: Notes on the statistical analysis of some loomweights from Pompeii. Archeologia e Calcolatori 19, 49–66. Baxter, M. J. and Cool, H. E. M. 2010a: Detecting modes in low-dimensional archaeological data. Journal of Archaeological Science 37, 2379–2385. Baxter, M. J. and Cool, H. E. M. 2010b: Correspondence Analysis in R for archaeologists: an educational account. Archeologia e Calcolatori 21, 211– 28. Baxter, M. J. and Cool, H. E. M. 2016: Reinventing the wheel? Modelling temporal uncertainty with applications to brooch distributions in Roman Britain. Journal of Archaeological Science 66, 120–127. Baxter, M. J., Beardah, C. C. and Wright, R. V. S. 1997: Some archaeological applications of kernel density estimates. Journal of Archaeological Science 24, 347–354. Baxter, M. J., Cool, H. E. M. and Anderson, M. A. 2010: Statistical analysis of some loomweights from Pompeii: a postcript. Archeologia e Calcolatori 21, 185–200. Bayliss, A., Hines, J., Høilund Nielsen, K. H., McCormac, G. and Scull, C. 2013: Anglo-Saxon Graves and Grave Goods of the 6th and 7th Centuries AD: A Chronological Framework. London: Society for Medieval Archaeology. Becker, R. A., Chambers, J. M. and Wilks, A. R. 1988: The S Language: A Programming Environment for Data Analysis and Graphics. Murray Hill, NJ: Bell Telephone Laboratories. 154 Binford, L. R. 1978: Dimensional analysis of behavior and site structure: learning from an Eskimo hunting stand. American Antiquity 34, 330–361. Blakelock, E. S., 2016: Never judge a gold object by its surface analysis: A study of surface phenomena in a selection of gold objects from the Staffordshire Hoard. Archaeometry, in press. DOI: 10.1111/arcm.12209 Bølviken, E. E., Helskog, K., Holm-Olsen, I., Solheim, L. and Bertelsen, R. 1982: Correspondence analysis: an alternative to principal components. World Archaeology 14, 41–60. Chang, W. 2012: R Graphics Cookbook. Sebastopol, CA: O’Reilly. Chester-Kadwell, M. 2009: Early Anglo-Saxon Communities in the Landscape of Norfolk. Oxford: Archaeopress, British Archaeological Reports, British Series 481. Chirikure, S., Heimann, R. B. and Killick, D. 2010: The technology of tin smelting in the Rooiberg Valley, Limpopo Province, South Africa, ca. 1650– 1850 CE. Journal of Archaeological Science 37, 1656–1669. Christensen, A-M., Holm, P. M., Schuessler, U. and Petrasch, J. 2006: Indications of a major Neolithic trade route? An archaeometric geochemical and Sr, Pb isotope study on amphibolitic raw material from present day Europe. Applied Geochemistry 21, 1635–1655. Cocis, H. 2015: Some remarks on the Roman necropolises of Potaissa. Journal of Ancient History and Archeology 2, 58–66. Conolly, J. and Lake, M. 2006: Geographical Information Systems in Archaeology. Cambridge: Cambridge University Press. Cool, H. E. M. 1983: Roman Personal Ornaments Made of Metal, Excluding Brooches, from Southern Britain. Cardiff: University of Wales Ph.D, Thesis. (Available at http://ethos.bl.uk). Cool, H. E. M. 1990: Roman metal hair pins from southern Britain. Archaeological Journal 147, 148–182. Cool, H. E. M. 1994: The Quantification of Roman Vessel Glass Assemblages. Unpublished report: English Heritage. Cool, H. E. M. 2006: Eating and Drinking in Roman Britain. Cambridge: Cambridge University Press. Cool, H. E. M. 2013: Contextualising Metal-Detected Discoveries: Staffordshire Anglo-Saxon Hoard. Nottingham, UK: Barbican Research Associates. 155 Cool, H. E. M. (2016): Recreation or decoration: what were the glass counters from Pompeii used for? Papers of the British School at Rome 84, 157–177. Cool H.E.M. and Baxter M.J. 1995: Finds from the fortress: artefacts, buildings and correspondence analysis. In Wilcock, J. and Lockyear, K. (eds.), Computer Applications and Quantitative Methods in Archaeology 1993. Oxford: Tempus Reparatum, British Archaeological Reports, International Series 598, 177–182. Cool, H. E. M. and Baxter, M. J. 1996: Quantifying glass assemblages. Annales du 13e Congrès de l’Association Internationale pour l’Histoire du Verre, 93–101. Cool, H. E. M. and Baxter, M. J. 1999: Peeling the onion: an approach to comparing vessel glass assemblages. Journal of Roman Archaeology 12, 72–100. Cool, H.E.M. and Baxter, M.J. 2016a: Brooches and Britannia. Britannia, 71–98. Cool, H.E.M. and Baxter, M.J. 2016b: Exploring morphological bias in metaldetected finds. Antiquity (to appear). Cool, H. E. M. and Price, J. 1995: Roman Vessel Glass from Excavations in Colchester, 1971-85. Colchester: Colchester Archaeological Reports 8. http://cat.essex.ac.uk/reports/CAR-report-0008.pdf (accessed 22/10/2016). Cool, H. E. M., Lloyd Morgan, G. and Hooley, D. 1995: Finds from the Fortress. The Archaeology of York 17:7. York: Council for British Archaeology. De La Fuente, G. A. 2011: Urns, bowls, and ollas: pottery-making practices and technical identity in the southern Andes during the late period (ca. a.d. 900-a.d. 1450) (Catamarca, Northwestern Argentine region, Argentina). Latin American Antiquity 22, 224–252. De Reu, J, Bourgeois, J., Bats, M., De Smedt, P., Gelorini, V., Zwertvaegher, A., Antrop, M., De Maeyer, P., Finke, P., Van Meirvenne, M., Verniers, J. and Crombé, P. 2013: Beyond the unknown: understanding prehistoric patterns in the urbanised landscape of Flanders. Journal of Historical Geography 40, 1–15. Discamps, E. and Castamango, S. 2015: Improving mortality profile analysis in zooarchaeology: a revised zoning for ternary diagrams. Journal of Archaeological Science 58, 62–76. Doran, J. E. and Hodson, F. R. 1975: Mathematics and Computers in Archaeology. Edinburgh: Edinburgh University Press. 156 Drennan, R.D. 1996: Statistics for Archaeologists. New York: Plenum Press. Drennan, R.D. 2009: Statistics for Archaeologists: Second Edition. New York: Springer. Duval, C., Lepetz, S., Horard-Herbin, M-P., and Cucchi, T. 2015: Did Romanization impact Gallic pig morphology? New insights from molar geometric morphometrics. Journal of Archaeological Science 57, 345–354. Egri, M. 2007: The use of amphorae for interpreting patterns of consumption. In Croxford, B., Ray, N., Roth, R. and White, N. (eds.), TRAC 2006: Proceedings of the Sixteenth Annual Theoretical Roman Archaeology Conference. Oxford: Oxbow Books, 43–58. Greenacre, M. J. 2007: Correspondence Analysis in Practice: Second Edition. Boca Raton, FL: Chapman & Hall/CRC Press. Hammon, A. 2011: Understanding the Romano-British-Early-Medieval transition: a zooarchaeological perspective from Wroxeter (Viroconium Cornoviorum). Britannia 42, 275-305. Hein, A., Kilikoglou, V. and Kassianidou, V. 2007: Chemical and mineralogical examination of metallurgical ceramics from a Late Bronze Age copper smelting site in Cyprus. Journal of Archaeological Science 34, 141–154. Hesse, R. 2011: Reconsidering animal husbandry and diet in the northwest provinces. Journal of Roman Archaeology 24, 215–248. Hills, C. 2014: Review of Anglo-Saxon Graves and Grave Goods of the 6th and 7th Centuries AD: a chronological framework. By Bayliss Alex, Hines John, Høilund Nielsen Karen, McCormac Gerry and Scull Christopher. The Antiquaries Journal 94, 370–372. Hintze, J. L. and Nelson, R. D. 1998: Violin plots: A box plot-density trace synergism. The American Statistician 52, 181–184. Hiscock, P. 2003: Quantitative exploration of size variation and the extent of reduction in Sydney Basin assemblages: A tale from the Henry Lawson Drive rockshelter. Australian Archaeology 57, 64–74. Howarth, R. J. 1996: Sources for a history of the ternary diagram. The British Journal for the History of Science 29, 337–356. Hubert, M. and Vandervieren, E. 2008: An adjusted boxplot for skewed distributions. Computational Statistics and Data Analysis 52, 5186–5201. 157 Hughes, R. E., Högberg, A. and Olausson, D. 2010: Sourcing flint from Sweden and Denmark: A pilot study employing non-destructive energy dispersive X-ray fluorescence spectrometry. Journal of Nordic Archaeological Science 17, 15–25. Jennings, T. A. 2008: San Patrice: An example of Late Paleoindian adaptive versatility in South-Central North America. American Antiquity 73, 53– 559. Jensen, C.K. and Nielsen, K.H. (eds.) 1997: Burial and Society: The Chronological and Social Analysis of Archaeological Burial Data. Aarhus: Aarhus University Press. Joy, J. 2014: ‘Fire burn and cauldron bubble’: Iron Age and Early Roman cauldrons of Britain and Ireland. Proceedings of the Prehistoric Society 80, 327–362. Kampstra, P. 2008: Beanplot: A boxplot alternative for visual comparison of distributions. Journal of Statistical Software Code Snippets 28(1), 1–9 https://www.jstatsoft.org/article/view/v028c01 (accessed 18/05/2016). Keeler, D. 2007: Intrasite spatial analysis of a Late Upper Paleolithic French site using geographic information systems. Journal of World Anthropology: Occasional Papers 3.1, 1–40. King, A. C. 1984: Animal bones and the dietary identity of military and civilian groups in Roman Britain, Germany and Gaul. In Blagg, T. F. C. and King. A. C. (eds), Military and Civilian in Roman Britain Oxford: British Archaeological Reports, British Series 136, 187–217. King, A. C. 1999a: Diet in the Roman world: a regional inter-site comparison of the mammal bones. Journal of Roman Archaeology 12, 168–202. King, A. C. 1999b: Animals and the Roman army: the evidence of the animal bones. In Goldsworthy, A and Haynes, I (eds.), The Roman Army as a Community. Journal of Romans Archaeology Supplement 34. Portsmouth, RI: Journal of Roman Archaeology, 200–11. King, A. C. 2005: Animal remains from temples in Roman Britain. Britannia 36, 329–369. Leach, P. 1993: Pottery. In Woodward, A. and Leach, P. (eds.), The Uley Shrines: Excavation of a Ritual Complex on West Hill, Uley, Gloucestershire: 1977–9. English Heritage Archaeological Report 17. London: English Heritage, 219–249. 158 Levitan, B. 1993: Vertebrate remains. In Woodward, A. and Leach, P. (eds.), The Uley Shrines: Excavation of a Ritual Complex on West Hill, Uley, Gloucestershire: 1977–9. English Heritage Archaeological Report 17. London: English Heritage, 257–301. Liebmann, M., Ferguson, T. J. and Preucel, R. W. 2005: Pueblo settlement, architecture, and social change in the Pueblo Revolt Era, A.D. 1680 to 1696. Journal of Field Archaeology 30, 45–60. Lockyear, K. 1994: ‘Ranter’s Corner’ – PUPs’. Archaeological Computing Newsletter 41, 1–2. Lockyear, K. 2013: Applying bootstrapped correspondence analysis to archaeological data. Journal of Archaeological Science 40, 4744–4753. Lockyear, K. 2016: Simplifying_complexity. https://www.academia.edu/28521469/ (accessed 06/10/2016). Mackay, A. 2011: Nature and significance of the Howiesons Poort to postHowiesons Poort transition at Klein Kliphuis rockshelter, South Africa. Journal of Archaeological Science 38, 1430–1440. Mackreth, D. F. 2011: Brooches in Late Iron Age and Roman Britain. Oxford: Oxbow Books. Madsen, T. (ed.) 1988a: Multivariate Archaeology. Aarhus: Aarhus University Press. Madsen, T. 1988b: Multivariate distances and archaeology. In Madsen, T. (ed.), Multivariate Archaeology. Aarhus: Aarhus University Press, 7–27. McGill, R„ Tukey, J. W. and Larsen, W. 1978: Variations of box plots. The American Statistician 32, 12–16. McMahon, T. C. 2007: Discerning prehistoric landscapes in Colorado and the Mesa Verde region using a kernel density estimate (KDE) Method. In Clark, J. T. and Hagemeister, E. M. (eds.), Digital Discovery. Exploring New Frontiers in Human Heritage, CAA 2006. Budapest: Archaeolingua, 151–166. Mårtensson, L., Nosch, M-L. and Strand E. A. 2009: Shape of things: understanding a loom weight. Oxford Journal of Archaeology 28, 373–398. Meighan, C. W. 1959: A new method for the seriation of archaeological collections. American Antiquity 25, 203–211. Mills, B. J. 2007: Performing the feast: visual display and suprahousehold commensalism in the Puebloan Southwest. American Antiquity 72, 210– 239. 159 Monna, F., Jebrane, A., Gabillot, M., Laffont, R., Specht, M., Bohard, B., Camizuli, E., Petit, C., Chateau, C. and Alibert, P. 2013: Morphometry of Middle Bronze Age palstaves. Part II – spatial distribution of shapes in two typological groups, implications for production and exportation. Journal of Archaeological Science 40, 507–516. Morehart, C. T. and Eisenberg, D. T. A. 2010: Prosperity, power, and change: Modeling maize at Postclassic Xaltocan, Mexico. Journal of Anthropological Archaeology 29, 94–112. Moreno-García, M., Orton, C., Rackham, J. 1996: A new statistical tool for comparing animal bone assemblages. Journal of Archaeological Science 23, 437–453. Murrell, P. 2005: R Graphics. Boca Raton, FL: Chapman & Hall/CRC Press. Murrell, P. 2011: R Graphics: Second Edition. Boca Raton, FL: Chapman & Hall/CRC Press. Nenadic, O. and Greenacre, M. 2007: Correspondence Analysis in R, with two- and three-dimensional graphics: the ca package. Journal of Statistical Software 20(3), 1–13. Nikita, E., Mattingly, D. and Mirazón Lahr, M. 2012: Three-dimensional cranial shape analyses and gene flow in North Africa. Journal of Anthropological Archaeology 31, 564–572. Oron, M. and Goren-Inbar, N. 2014: Mousterian intra-site spatial patterning at Quneitra, Golan Heights. Quaternary International 331, 186–202. Orton, C. 1975: Quantitative pottery studies: some progress, problems and prospects. Science and Archaeology 16, 30–35. Orton, C. 1980: Mathematics in Archaeology. London: Collins. Orton, C. 1993: How many pots make 5 - an historical review of pottery quantification. Archaeometry 35, 169-184. Panich, L. M. 2016: Beyond the colonial curtain: Investigating indigenous use of obsidian in Spanish California through the pXRF analysis of artifacts from Mission Santa Clara. Journal of Archaeological Science: Reports 5, 521–530. Peeples, M. A., and Schachner, G. 2012: Refining correspondence analysisbased ceramic seriation of regional data sets. Journal of Archaeological Science 39, 2818–2827. 160 Plaza, M. T. and Martinón-Torres, M. 2015: Metallurgical traditions under Inka rule: a technological study of metals and technical ceramics from the Aconcagua Valley, Central Chile. Journal of Archaeological Science 54, 86– 98. Plummer, T. W., Bishop, L. C. and Hertel, F. 2008: Habitat preference of extant African bovids based on astragalus morphology: operationalizing ecomorphology for palaeoenvironmental reconstruction. Journal of Archaeological Science 35, 3016–3027. Pollard, A. M., Ditchfield, P., Piva, E., Wallis, S., Falys, C. and Ford, S. 2012: ‘Sprouting like cockle amongst the wheat’: The St Brice’s day massacre and the isotopic analysis of human bones from St John’s College, Oxford. Oxford Journal of Archaeology 31, 83–102. Radivojević, M., Rehren, T., Pernicka, E., Šljivar, D., Brauns, M. and Borić, D. 2010: On the origins of extractive metallurgy: new evidence from Europe. Journal of Archaeological Science 37, 2775–2787. Robeerst, A. 2005: Interaction and exchange in food production in the Nijmegen region frontier area during the Early Roman period. In Bruhn, J., Croxford, B. and Grigoropoulos, G. (eds.), TRAC 2004: Proceedings of the Fourteenth Annual Theoretical Roman Archaeology Conference. Oxford: Oxbow Books, 43–58. Sarkar, D. 2008: Lattice: Multivariate Data Visualization with R. New York: Springer. Sayer, D. and Weinhold, M. 2013: A GIS investigation of four early Anglo-Saxon cemeteries: Ripley’s K-function analysis of spatial groupings amongst graves. Social Science Computer Review 31, 71–89. Scarry, C. M. and Reitz, E. J. 2005: Changes in foodways at the Parkin site, Arkansas. Southeastern Archaeology 24, 107–121. Scharra-Liška, G., Cichocki, O. and Wiltschke-Schrotta, K. 2016: Wooden coffins in the Avar-period cemetery in Frohsdorf, Lower Austria. Open Archaeology 1, 54–78. Schmidt, I., Bradtmöller, M., Kehl, M., Pastoors, A., Tafelmaier, Y., Weninger, B. and Weniger, G.-C. 2012: Rapid climate change and variability of settlement patterns in Iberia during the Late Pleistocene. Quaternary International 274, 179–204. Schörner, G. 2007: New images for old rituals: Stelae of Saturn and personal cult in Roman North Africa. In Croxford, B., Ray, N. Roth, R. and White, N. (eds.), TRAC 2006: Proceedings of the Sixteenth Annual Theoretical Roman Archaeology Conference. Oxford: Oxbow Books, 92–102. 161 Seetah, K., Cucchi, T., Dobney, K. and Barker, G. 2014: A geometric morphometric re-evaluation of the use of dental form to explore differences in horse (Equus caballus) populations and its potential zooarchaeological application. Journal of Archaeological Science 41, 904–910. Shennan, S. 1988: Quantifying Archaeology. Edinburgh: Edinburgh University Press. Shennan, S. 1997: Quantifying Archaeology: Second Edition. Edinburgh: Edinburgh University Press. Silverman, B. W. 1986: Density Estimation for Statistics and Data Analysis. London: Chapman & Hall. Simmonds, A., Marquez-Grant, N. and Loe, L. 2008: Life and Death in a Roman City. Excavation of a Roman Cemetery with a Mass Grave at 120122 London Road, Gloucester, Oxford Archaeology Monograph 6. Oxford: Oxford Archaeological Unit. Stafford C. R. 1991: Archaic Period logistical foraging strategies in WestCentral Illinois. Midcontinental Journal of Archaeology 16, 21–246. Stafford C. R. 1994: Structural changes in Archaic landscape use in the dissected uplands of Southwestern Indiana. American Antiquity 59, 219–237. Steele, T. E. and Weaver, T. D. 2002: The modified triangular graph: a refined method for comparing mortality profiles in archaeological samples. Journal of Archaeological Science 29, 317–322. Stiner, M. C. 1990: The use of mortality patterns in archaeological studies of hominid predatory adaptations. Journal of Anthropological Archaeology 9, 305–351. Stiner, M. C. 1994: Honor Among Thieves: a Zooarchaeological Study of Neanderthal Ecology. Princeton: Princeton University Press. Thornton, C. P. and Rehren, T. 2009: A truly refractory crucible from fourth millennium Tepe Hissar, Northeast Iran. Journal of Archaeological Science 36, 2700–2712. Tufte, E .R. 1983: The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press. Tukey, J. W. 1977: Exploratory Data Analysis. Reading, MA: Addison-Wesley. Tukey, J. W. 1993: Graphic comparisons of several linked aspects. Alternatives and suggested principles. Journal of Computational and Graphical Statistics 2, 1–33. 162 Tupman, C. 2005: The cupae of Iberia in their monumental contex: a study of the relationship between social status and commemoration with barrelshaped and semi-cylindrical tombstones. In Bruhn, J., Croxford, B. and Grigoropoulos, G. (eds.), TRAC 2004: Proceedings of the Fourteenth Annual Theoretical Roman Archaeology Conference. Oxford: Oxbow Books, 119–131. Unwin, A. 2015: Graphical Data Analysis with R. Boca Raton, FL: Chapman & Hall/CRC Press. Venables, W. N. and Ripley, B. D. 2002: Modern Applied Statistics with S: Fourth Edition. New York: Springer. Venables, W. N., Smith, D. M. and the R Core Team 2016: An Introduction to R: Notes on R: A Programming Environment for Data Analysis and Graphics Version 3.3.1. https://cran.r-project.org/doc/manuals/r-release/Rintro.pdf (accessed 10/10/2016). Wainwright, G. J. 1984: The pressure of the past. Proceedings of the Prehistoric Society 50, 1–22. Wand, M. P. and Jones, M. C. 1995: Kernel Smoothing. London: Chapman & Hall. Weaver, T. D., Boyko, R. H. and Steele, T. E. 2011: Cross-platform program for likelihood-based statistical comparisons of mortality profiles on a triangular graph. Journal of Archaeological Science 38, 2420–2423. Weber, M. 2012: The devil is in the (Samian) detail – potter’s stamps on samian ware and their implications for the Roman provincial economy. In Duggan, M., McIntosh, F. and Rohl, D, J. (eds.), TRAC 2011: Proceedings of the Twenty First Theoretical Roman Archaeology Conference. Oxford: Oxbow Books, 60–75. Whallon, R. 1987: Simple statistics. In Aldenderfer, M. S. (ed.), Quantitative Research in Archaeology. Newbury Park, California: Sage Publications, 135–150. Wheatley, D. and Gillings, M. 2002: Spatial Technology and Archaeology. London: Taylor and Francis. White, N. 2007: Catering for the cultural identities of the deceased in Roman Britain: Interpretive potential and problems. In Croxford, B., Ray, N. Roth, R. and White, N. (eds.), TRAC 2006: Proceedings of the Sixteenth Annual Theoretical Roman Archaeology Conference. Oxford: Oxbow Books, 115– 132. 163 Wickham, H, 2009: ggplot2. New York: Springer. Wickham, H, 2016: ggplot2: Second Edition. New York: Springer. Wickham, H. and Stryjewski, L. 2011: 40 years of boxplots. http://vita.had.co.nz/papers/boxplots.pdf (accessed 10/10/2016). Wilkinson, L. 2005: The Grammar of Graphics. New York: Springer. Woodward, A. and Leach, P. (eds.) 1993: The Uley Shrines: Excavation of a Ritual Complex on West Hill, Uley, Gloucestershire: 1977–9. English Heritage Archaeological Report 17. London: English Heritage. Wurz, S., le Roux, N. J., Gardner, S. and Deacon, H. J. 2003: Discriminating between the end products of the earlier Middle Stone Age sub-stages at Klasies River using biplot methodology. Journal of Archaeological Science 30, 1107–1126. 164 Index artefact type amphorae, 89, 99 brooches, 132 cauldrons, 86 faunal remains, 90, 102, 113, 122 glass counters, 94, 141 glass drinking vessels, 149 glass pillar-moulded bowls, 37 glasscounters, 109 gold fittings, 127 grave goods), 144 hairpins, 31, 39, 58 inscribed funerary altars, 116 loomweights, 53, 59, 65, 73 pottery (samian ware), 116 stelae, 115 stone tools, 120 aspect ratio, 109, 136, 138 choice of using ggplot, 68 awful three-dimensional barplots, 115–118 pie charts, 89 barplots, 37, 98–118 alternative to pie chart, 92 bad practice, 113–118 clustered vs. stacked, 103, 109 one-way data, 99–101 superfluity of, 100, 102 three-dimensional avoid, 115 two-way data, 102–109 beanplots, 64 and kernel density estimates, 65 construction, 65 examples, 65, 66 multiple, 66 biplot, 132 boxplots, 64–72 as five-number summary, 64 bad for multimodal data, 67 construction, 64 controlling box size, 67 for skewed data, 71 notched, 69 correspondence analysis, 130–152 archaeological literature on, 131 basic idea, 130 biplot, 132 inertia as measure of fit, 132, 136 outliers, 136 seriation, 130 chronological, 144 data sets Anglo-Saxon graves, 144 analysis, 145–147 Iberian funerary altars, 117 analysis, 117 Iron Age/Romano-British brooches, 133 analysis, 135–141 IronAge/Roman cauldrons, 87 analysis, 87–88 Lower Danubian Roman amphorae, 89 analysis, 89 Palaeolithic stone tools from Ksar Akil, 121 analysis, 121–122 Pompeian glass counter colours, 95 165 analysis, 96–97, 141–144 Pompeian glass counter diameters, 109 analysis, 110–112 Pompeian loomweights, 53, 54, 66 analysis, 53–57, 59–63, 65–68, 70, 73–84 Roman glass pillar-moulded bowls, 37 analysis, 37–39 Roman North African stelae, 115 analysis, 115 Romano-British amphorae, 100 analysis, 100–101 Romano-British faunal assemblages, 122 analysis, 123–127 Romano-British glass drinking vessels, 149 analysis, 149–152 Romano-British hairpin lengths, 28, 31 analysis, 33–36, 39–42, 58, 64, 65 species presence on Romano-British sites, 102, 103, 107 analysis, 102–109, 113–114 species remains from Wroxeter, 90 analysis, 90–94 Staffordshire Hoard gold, 127 analysis, 128, 129 stamped samian ware from Castleford (UK), 116 analysis, 116 descriptive statistics, 22–30 computation in R, 28 computation inR, 30 five-number-summary and boxplots, 28 measures of dispersion, 23 interquartile range, 24 mean absolute deviation, 25 range, 24 standard deviation, 25 variance, 25 measures of location, 23 mean, 24 median, 24 mode, 24 trimmed mean, 24 misuse of term ‘average’, 26 quartiles, 24 skewed data difficulty of summarising, 27 estimated vessel equivalents (EVEs), 148 Excel awful barplots, 118 awful pie charts three-dimensional, 89 failure to customise, 9, 99 problems with histograms, 99 widely used in archaeology, 2, 9 frequency polygons, 43–44 histograms, 31–42 as specialised form of barplot, 98 construction, 33 density histograms, 35 dependence on bin-width, 34 dependence on origin, 34 examples of, 26 superimposed with KDE, 35 kernel density estimate and violin plots, 65 kernel density estimates, 51–63 and beanplots, 65 bandwidth, 67 adjusting default, 56 choice of, 52 dependence on, 52 comparison with histograms, 52 construction, 51–52 superimposed on density histogram, 35 two-dimensional contoured, 62, 73 166 vectors, 11 extracting subsets of data, 12 functions, 13–14 metal-detected artefacts graphics packages bias in recovery, 141 ggplot2, 14–19 multimodality, 24, 26, 67, 82, 110 grid, 14 antimode, 26, 76 lattice, 14 base R, 14 outliers, 23, 79 legend construction, 40–42 effects of, 27 line types, 77 in correspondence analysis, 136, 137 packages, 14 removal of, 56 plotting symbols, 75 types of data period numeric, 12 Anglo-Saxon (Britain), 127, 144 user interfaces Iron Age and Romano-British, 132 Rcmdr, 9 Palaeolithic (Lebanon), 120 Rstudio, 9 Roman (Lower Danube), 89 user-defined functions, 45–50 Roman Britain, Northern Ireland, 86 R functions Roman Iberia, 116 coord_fixed, 68 Roman Italy (Pompeii), 53, 59, 65, geom_freqpoly, 43 73, 94, 109, 141 theme_showarrows, 124 Roman North Africa, 115 abline, 77, 152 Romano-British, 31, 37, 39, 58, 90, adjbox, 71 99, 102, 113, 116, 122 arrows, 138 pie charts, 85–97 as.matrix, 12 three-dimensional as.numeric, 80 avoid, 89 axis, 101 Portable Antiquities Scheme, 140–141 bandwidth.nrd, 60 pseudo-3-dimensional plots, 89, 115–118 barplot, 38, 101 quantification beanplot, 64 of broken artefacts, 148–149 boxplot, 64 ca, 135 R, see also R functions, see also R packcolors, 19 ages colours, 19 acquiring R, 10 contour, 62 colour, 19–20 coord_fixed, 109, 140 palettes, 81 corresp, 135 data import, 10–11 density, 57 data structures fivenum, 29 data frames, 11 geo_point, 19 factor, 12 geom_bar, 107 matrix, 12 geom_boxplot, 66 missing data, treatment of, 11 diagonal bandwidth matrix, 60 non-diagonal bandwidth matrix, 62 167 geom_histogram, 42 geom_point, 75 geom_violin, 66 ggpairs, 84 ggplot, 58 grid, 35 hist, 33 image, 61 is.data.frame, 11 kde2d, 60 kde, 63 legend, 40, 101 lines, 57, 78 lm, 78 loess, 78 mad, 29 max, 29 mean, 29 median, 29 min, 29 pairs, 82 par, 40 persp, 61 pie, 85 plot.acomp, 120 plot, 57, 138, 144 points, 63 predict, 78 print, 106 qplot, 35 quantile, 29 read.table, 11 rep, 37, 140 scale_colour_manual, 18, 58, 79 scale_fill_brewer, 81 scale_fill_manual, 42, 109 scale_shape_manual, 18, 140 scale_size_manual, 125 scatterplotMatrix, 84 sd, 29 segments, 77, 144 seq, 30, 41 summary, 29 text, 101 theme, 19, 36 var, 29 vioplot, 64 R functions library, 48 triax.plot, 120 R packages GGally, 84 MASS, 135 RColorBrewer, 81 beanplot, 64 car, 84 ca, 135 compositions, 120 ggplot2 violin plots, 69 ggplot2, 35 ggtern, 93, 120 grid, 42 ks, 62 plotrix, 120 reshape2, 15 robustbase, 71 splines, 14 vioplot, 64 robust statistics, 23 Romanization, 102, 124 scatterplots, 73–84 adding lines to, 76–78 scatterplot matrices, 82–84 smoothers linear regression, 77 loess curves, 77 seriation, 120 correspondence analysis, 130, 144– 147 spatial, 149–150 with ternary diagrams, 122 ternary diagrams, 119–128 alternative to pie chart, 93 history of, 119–120 168 seriation, 120, 122 types of data categorical, 23, 98 continuous, 22, 31 crosstabulated, 23 discrete, 22, 98 nominal, 98 ordinal, 23, 38, 98 violin plots, 64 and kernel density estimates, 65 construction, 65 examples, 65 in ggplot2, 66 multiple, 66 169