Articles and research related to academic publishing

Why do journals insist that data ‘are’?

Scientific writing | 10 comments

Given the controversy over this grammatical point, I argue that journal style guides should allow both ‘data is’ and ‘data are’.

I was recently directed (via @blefurgy and @deb_lavoy on Twitter) to an old blog post on something that frequently bugs me: the question of whether the word ‘data’ is singular or plural. The post, by Norman Gray, an astronomical data management researcher at Glasgow University, UK, dates from 2005 but I haven’t seen a better one on the topic. Gray argues that:

…the word ‘data’, in english, is a singular mass noun. It is thus a grammatical and stylistic error to use it as a plural.

Plural use is barbaric: amongst other crimes, it is a deliberate archaism, and thus a symptom of bad writing.

Strong stuff.

An alternative view is given by Peter Coles (@telescoper), another astronomer at Cardiff University, UK, who also explains the issue clearly:

For those of you who aren’t up with such things, English nouns can be of two forms: “count” and “non-count” (or “mass”). Count nouns are those that can be enumerated and therefore have both plural and singular forms: one eye, two eyes, etc. Non-count nouns (which is a better term than “mass nouns”) are those which describe something which is not enumerable, such as “furniture” or “cutlery”. Such things can’t be counted and they don’t have a different singular and plural forms. You can have two chairs (count noun) but can’t have two furnitures (non-count noun)…

…Norman Gray asserts that (a) “data” is a non-count noun and that (b) it should therefore be singular.

I tend to look and listen out for instances of ‘data’, and I have very rarely heard someone say ‘the data are’ in natural speech. As Gray says:

The majority of writers who would dutifully pluralise ‘data’ in writing naturally and consistently use it as a mass noun in conversation: they ask how much data an instrument produces, not how many; they talk of how data is archived, not how they are archived; they talk of less data rather than fewer; and they always talk of data with units, saying they have a megabyte of data, or 10 CDs, or three nights, and never saying ‘I have 1000 data’ and expecting to be understood.

You may wonder why this matters at all. Well, practically every scientific paper contains the word ‘data’ somewhere, and all the journals I edit for insist that it is made plural every time. I spend a ridiculous amount of my editing time looking out for instances of ‘the data is’ and similar. And they can’t be found automatically using a macro, either, because the subject and verb can be separated by other words, or the verb may be something else like ‘shows’ or ‘illustrates’ rather than ‘is’. This is ‘mistake’ that a lot of authors make.

So is ‘data is’ really a mistake? Are the journals right to insist on this change?

The argument from etymology

The main argument used for ‘data are’ is that the word is derived from a plural Latin word. Gray dismantles this thoroughly by showing that it never was a simple plural in Latin. It is:

…the neuter plural past participle of the first conjugation verb dare, ‘to give’ (it’s actually also the feminine singular past participle, but that really, really, doesn’t matter).

…there was almost certainly no latin word for the concept that we now identify by the english word ‘data’….

…Put another way, that means that the word ‘data’, as a technical term referring to the ore of observations, which can be painstakingly reduced to extract knowledge, is not a latin word at all. It’s a native english word with a latin past, which means, bluntly, that we get to choose how to use it, and if its meaning changes over time – as it has – then its grammatical analysis can reasonably and properly migrate also.

I find this a convincing argument. It reminds me of the pedants who don’t like split infinitives (‘to boldly go’) because Latin infinitive verbs couldn’t be split, which is pretty irrelevant to how we should treat them in English (see Wikipedia for current views on this issue).

Gray goes on to compare ‘data’ with other similar Latin-derived words, such as ‘agenda’, ‘stamina’, ‘media’ and ‘phenomena’. ‘Stamina’ is at one end of a spectrum: it is never used in the singular (‘stamen’) except in a specialist botanical sense, and it is a singular noun. ‘Phenomena’ is at the other end – the singular ‘phenomenon’ is frequently used and ‘phenomena’ is a plural noun. ‘Agenda’ is almost the same as ‘stamina’ but the singular ‘agendum’ just about makes sense (although ‘agenda item’ would be more usual). ‘Media’ is moving from being a plural of ‘medium’ to being a separate singular noun in its own right. Gray says:

In this spectrum (not ‘spectra’, of course), ‘data’ is clearly located near ‘agenda’.

I would agree with this assessment on the whole, though I disagree with Gray that ‘datum’ is ‘certainly not one of the things that makes up data’. But like ‘agenda item’, a more commonly used term would be ‘data point’.

In fact, there is a technical use of the word ‘datum’, which Gray has dug out: it is a surveying term. But the plural of this usage of ‘datum’ is ‘datums’, not ‘data’.

Peter Coles doesn’t in fact completely agree with the journal publishers’ stipulation that ‘data’ is never singular – rather, he argues that there are contexts in which the plural use makes sense, and others in which singular use is better:

“If I had less data my disk would have more free space on it.” (Non-count)

“If I had fewer data I would not be able to obtain an astrometric solution.” (Count).

I’m fine with this distinction if people want to use it. But why, then, should journals insist that the singular use is incorrect?

A proposal: stop being prescriptive about data

You may or may not agree with Norman Gray (and me) that ‘data are’ is incorrect. But you can surely agree that there is controversy about the issue. The reasons to insist on plural data are hotly contested, to say the least.

So I propose that publishers remove the stipulation in their style guides that ‘data is’ is incorrect and should be changed to ‘data are’. In fact there is no need to be prescriptive on the issue at all: if the author writes ‘data are’, it can stay, but if they write ‘data is’, that can stay too. This would save a not insignificant amount of time for copyeditors, in searching and replacing ‘data is’ and in arguing the point with authors. It would probably save authors some time and annoyance too. And it would also make journals look more modern in this age of terabytes of data.

Who is going to be the first publisher to take a leap into the unknown? You have nothing to lose but your fuddy-duddy reputation.

Your opinions

Grammatical issues like this usually generate more heat than light, so I expect there will be comments on this post. I would particularly like to hear from journal editors who have been involved in discussions about this issue for their style guides, and from authors who have railed against the ‘data are’ rule imposed by a journal. I reserve the right to remove comments that simply rehash old arguments or only say that one or other construction is ‘ugly’ or ‘just wrong’.

You may also like

Submission to first decision time

Having written previously about journal acceptance to publication times, it is high time I looked at the other important time that affects publication speed: submission to first decision time. As I explained in the previous post, the time from submission to...

read more

Acceptance to publication time

Journals vary a lot in how long they take to publish accepted papers. Publication speed is one factor that many authors take into account when choosing a journal. The time from submission to publication in a peer reviewed journal can be split into three phases: The...

read more