Category Archives: CLM

Review of CLGM in Facta Universitatis

Vladan Pavlović reviewed Corpus Linguistics: A Guide to the Methodology for his the University of Niš’s open-access journal Facta Universitatis, Series: Linguistics and Literature. The issue containing the review is freely accessible here.

It is a positive review, concluding that “[t]his book represents a valuable source for students and others interested in corpus linguistics, and an excellent starting point for delving further into the area” — which is exactly what I intend it to be!

Review of CLGM in Lingua

Zhen Dong and Fan Pan reviewed Corpus Linguistics: A Guide to the Methodology for Lingua. It is behind the Elsevier paywall here.

It is a very positive review, emphasizing three areas in which the reviewers see it as particularly valuable: first, the extensive introduction to statistical thought and practice, second, the case studies drawing from a broad range of linguistic phenomena, and third, the focus on reproducibility (something I want to expand on in potential future editions).

The reviewers also raise two critical points that will be useful for me when working on future editions:

  1. They criticize that the book does not include a chapter on how to construct specialized corpora in cases where available corpora do not meet the needs of a particular research project. They are right — although Section 2.1 talks about the design of “representative” or “balanced” corpora at length, it does not discuss the design of corpora for specific research projects, nor does it give any practical advice. Part of the reason for this is that I felt that a section (or even a chapter) on this topic would have to raise not only practical issues (where to find texts, how to store and process them, etc.), but also legal issues (how to deal with copyrighted texts). The latter issue, apart from the fact that it is beyond my expertise, is very dependent on the researcher’s jurisdiction, which makes it difficult to discuss in general terms. However, I will certainly think about including such a section in potential future editions. In the meantime, I can only recommend Martin Wynne’s excellent open-access book Developing Linguistic Corpora: a Guide to Good Practice, which my potential chapter would be based on in large parts!
  2. They criticize the absence of any reference to specific software tools for concordancing and statistical analysis. I do see their point (as I saw Kevin Gerigk’s point concerning my focus on manual statistic analysis when there are tools that will do some of this analysis for you). I just feel that given the quick pace of software development, a textbook that builds on specific software tools will be outdated too quickly. A discussion of software tools is one of the things that this blog is meant to provide, if only I had time…

Drawing syntax trees with R

A while back I was looking into treebanks (something that a future edition of CLM should probably spend more time on than the current one, which basically just points out that they exist). I created some small treebanks, trying out different parsers and manually correcting their output. In order to find errors in the parse, I used Yoichiro Hasebe’s great online tool RSyntaxTree – so named because it is written in Ruby, not, unfortunately, in R – to visualize the trees.

Then it struck me how great it would be if I could actually use R to draw the trees for me instead. I looked around for a package that would do this, and I don’t remember if I couldn’t find one or if I just didn’t like what I found. Anyway, I decided to come up with a way on my own – and I did, relying almost exclusively on existing packages. This post describes how. Continue reading

Exercise: Perfect Progressive Copular Constructions

I’m working on a new case study on statistically underrepresented constructions to complement (or perhaps replace) the case study on negative evidence in Section 8.2.2.3 of CLGM. The case study involves perfect progressive passive constructions (inspired by the broader case study on progressive passives in Manfred Krug and Julia Schlüter’s Research Methods in Language Variation and Change, Cambridge, 2013). It is a complex case study and I’m not sure it will lead to anything, but it has yielded a by-product that might make an interesting exercise.

To get a first overview of perfect progressive passives, I did what Krug and Schlüter (and others) have done, and simply queried the BNC for the sequence “been being” (the CQP query I used was ⟨[word=”been”%c] [pos=”AV.”]? [word=”being”%c]⟩, allowing for the potential occurrence of an adverb). This yielded six hits: Continue reading

Review of CLGM in the IJCL

Kevin Gerigk has reviewed Corpus Linguistics: A Guide to the Methodology for the International Journal of Corpus Linguistics. The review is open access, so you can read it here.

The review is very useful, because it draws attention to ways in which a future edition of the book might be improved. I would like to respond very briefly to three issues raised in the review. Continue reading

Precision and recall

Precision and recall are discussed in Section 4.1.2 of “Corpus Linguistics: A Guide to the Methodology” (CLGM) (p. 111–116). Frequently, students seem to have more difficulties than I would have expected in understanding these concepts, so I looked around to see how other people have explained it. Frequently, a fishing metaphor is used, which I quite like, as it has a potential to explain many other aspects of corpus linguistics. So I decided to write my own version of such a metaphorical explanation, which I may or may not include in a future edition of CLGM. Continue reading

Review of CLGM in the Časopis pro moderní filologii

Lucie Lukešová reviewed Corpus Linguistics: A Guide to the Methodology for the Časopis pro moderní filologii last year. If you read Czech or if you, like me, are willing to trust Google Translate, you can read the text here.

The review is very positive overall, concluding as follows: “I dare
say that the author succeeded in what he set out to do – to create a textbook that was lacking in the market. It is full of information, and yet the reader does not find themselves lost or overwhelmed. That is why I am happy to recommend it not only to all my students, but also to colleagues who, like me, sometimes need a reliable beacon (and sometimes a lifeline) in the stormy waters of corpus data.”

I sometimes dream of living in an old lighthouse on the Baltic Sea coast – it will always remain a dream, as I am very much an urbanite who gets nervous when he is more than a few hours away from a major city, but it certainly lets me appreciate Lucie’s maritime metaphor!

A message to my readers

My open-access textbook Corpus Linguistics: A Guide to the Methodology, which took me 15 years to write, was finally published in early 2020, just as the COVID pandemic hit the world.

I had planned to launch the book together with a companion website containing additional resources, study questions, exercises and the like,  but like many colleagues, I was overwhelmed by the sudden COVID-induced task of moving my teaching and my administrative duties online, disrupting all of the comfortable work routines I had adopted to leave time for things like research, family life, and setting up companion websites for textbooks.

As I do not see an end to the pandemic, let alone to the disruptions it has caused, I have decided to launch the website in blog form. On the one hand, this format is more modest than what I had originally planned, as it means that the website will remain perpetually incomplete, growing toward a more complete version of itself post by post whenever I find the time.

On the other hand, this format is more aspirational than I had originally envisioned, as enough time has passed since the publication of the book for me to start thinking about where it might be improved, and this blog will be a place not only for the exercises and study questions I had originally planned, but also (or perhaps, instead) for revisions and additional material to be included in a second edition (which, however, should not be expected for at least another three years, so please keep using the current edition)! Continue reading