corpus linguistics – Paola Trimarco

I’m currently reading Caroline Criado Perez’s wonderful book Invisible Women: Exposing Data Bias in a World Designed for Men. She addresses many issues convincingly, such as the way drivers’ seats in cars are made and safety tested with men in mind and the amount of medical research that uses male as the default, leaving women’s health and medicine in the Middle Ages. Statistics and studies are blended with entertaining – though often infuriating – anecdotes.

But I do have a bone to pick. After discussing the male-voice bias in voice recognition databases, raising some good points, Perez tackles corpora of written texts, which she notes are used by translators, CV-scanning software and web search algorithms. She failed to mention that these corpora were compiled by linguists, who are the main users of these databases for language research. Because she has missed this point, her own research using corpora comes up short. This is what she did:

‘Searching the BNC [British National Corpus] (100 million words from a wide range of late twentieth century texts) I found that female pronouns consistently appear at around half the rate of male pronouns. The 520-million-word Corpus of Contemporary American English (COCA) also has a 2:1 male to female pronoun ratio despite including texts as recent as 2015.’

From this, Perez criticises the ‘gap-ridden corpora’ for giving ‘the impression the world is actually dominated by men.’

As someone who has used both corpora, I have a problem here. Representativeness is always taken into account when drawing data from these large corpora. It is as much as part of the discussion as the results of the research itself. If I were looking at gendered pronoun use, I would first isolate my search to newspapers only, where I would expect the ratio of male to female pronouns to be even higher than what Perez found looking at all text types. Newspapers are not only written mostly by men, but report and comment on the world around us – its predominantly male politicians and public figures. And then there are the sports pages, where women’s sports struggle to get even a tenth of the column inches given to men’s sports. That is, newspapers, one of the main sources in the BNC, skew the figures. It might be more accurate to say that the world of news print is ‘actually dominated by men.’

Furthermore, corpus research is not just about frequency – it’s also about the context these search terms appear in. For example, a search on the word ‘hysterical’ will show that it is often in the context of ‘she’ or some women mentioned by name. This for me is more telling than the frequency of ‘she’ in printed texts. There is so much more to learn about gendered pronouns in a more rigorous search. The conclusions could reflect the biases in our societies more than the biases in the collection of data.

hysterical — Taken from a quick search on Webcorp of internet texts.

Okay, I’ve had my linguist’s rant and I don’t wish to labour the point. Many of the studies in this book – and it is an avalanche of studies – are thoroughly considered against other studies, often revealing gaps in data, where sex difference hasn’t been taken into account, or where it has, women have been deliberately and shamefully excluded.

Tag Archives: corpus linguistics

Coming to Terms with Invisible Women