|Computers & Texts No. 12
A. P. Berber Sardinha
Applied English Language Studies Unit
University of Liverpool
WordSmith Tools is a recent release by Oxford University Press which assists in the text analysis of either a single text or a large corpus. It features the usual concordancer, frequency and alphabetical list maker as well as other innovations which are likely to please both those wanting a Windows interface for the traditional concordancer and those seeking new tools. The program is described as a 'Swiss Army knife' of lexical analysis, and its major aim is to study 'how words behave in texts'. In this review I intend to give a brief overview of its analytical tools (Concordancer, WordLister, and KeyWords) and then discuss some possible applications of the program.
An advantage of this program is that it is available on the Internet. The 'official' version can be downloaded from the Oxford University Press homepage; a more recent version can be obtained from the author's own homepage. The program is downloaded as a demonstration version. The full registered version can be unlocked on purchase of a license number from OUP.
The Concordancer has a generous capacity-it can cope with up to 16,000 lines, ten times higher than MicroConcord's limit. The creation of a concordance is straightforward. A search word is entered, the texts defined, and 'go' selected. The Concordancer is integrated with the other WordSmith tools. Double-clicking on a word within another tool will call up a concordance for that word.
The main analytical tool is the WordLister, the main function of which is to create and maintain alphabetical and frequency lists of words. The user can choose to create an ordinary one-item-per-entry list, which is interesting in itself, or, in my opinion, the much more exciting (though occasionally difficult to create) 'cluster' word list, in which the entries are made up of sequences of words as they appeared in the texts. The length of the clusters can be determined by the user. The label 'cluster' was preferred by the author to 'phrase' because the former does not imply a 'grammatical relation'. This is not a trivial distinction because although many clusters will be self-standing expressions (e.g. 'a trivial distinction' in this sentence) more often than not they will not be (e.g. 'than not they'). The cluster list does not, however, cross sentence boundaries. This option is particularly relevant, for example, to language teaching where lexical phrases as opposed to individual words are recognised as the building blocks of vocabulary. Thus, not only language teachers but also students will find in 'WordList clusters' a very powerful tool for exploring the combinatory properties of vocabulary.
Once the user has the list or lists they need, it is often necessary to account for different word forms-for instance, should 'walk' and 'walks' be conflated as a single entry or remain separate? WordLister offers a lemmatisation feature for such cases. The user has a choice to join entries either manually or automatically. With or without lemmatisation, it is possible to view the word lists for interesting characteristics such as frequency, presence (or absence) of words, the proportion of different words to the total words, and so on. In relation to the latter, sometimes referred to as 'lexical density', WordLister offers two type-token ratios. The first is the traditional division of running words by different words, expressed as a percentage in WordSmith rather than as a proportion. The second type-token option differs from the traditional method by first extracting a type-token ratio for individual blocks of text and then calculating the mean ratio for all blocks rather than considering the whole text at once. The type-token statistic is expressed by this average (as a percentage) and users should be aware of this difference before attempting to use this statistic. The size of the blocks is defined by the user (the default is 1000 words). Texts which are shorter than the given block size will be reported as having a zero type-token ratio.
It can be interesting to discover whether words in one list are also present in another list. There are two main ways to do this. The first is by typing a list of words one is interested in and then using the 'Match List' facility. This will mark the words in the list (with a tilde) and will produce a ratio under the 'Statistics' menu. A useful application of the 'Match List' ratio is for obtaining the proportion of open-set words to the total of words in the text (e.g. Eggins 1994) by specifying a match list that contains grammatical words. Of course, the difficulty here is to provide a list that is as comprehensive as it is unambiguous, which is nearly impossible given homonymous words such as 'like' (verb, adverb, preposition and conjunction). Significantly, this feature compares well with Paul Nation's Vocabulary Profile program, therefore teacher users can use it to explore vocabulary richness with their students (see Laufer and Nation 1995).
The second way two lists can be compared is through the 'Compare versions' function. The user specifies another WordSmith word list against which they want to compare the current word list. It is not possible to choose precisely which lists will be compared; all open lists will be compared. The output is a table showing the frequencies of all words of the various lists combined and their frequencies in each list. This facility has found its way into EFL research for the investigation of naturalness in student writing (Berber Sardinha 1996).
Fig 1. Sample screens from WordSmith Tools showing wordlists, a concordance, and detailed statistics.
Unlike the WordLister and the Concordancer, which perform mostly well-known functions, the KeyWords utility is intended for less well-established uses. It is described as providing 'a useful way to characterise a text or genre'. Keyword here is defined by frequency. Thus, a word will be key if its frequency is either unusually high (positive keyword) or unusually low (negative keyword). Keyness is obtained by statistical comparison; frequent and infrequent keywords will have occurred more or less often than expected by chance.
A 'keywords' analysis normally involves at least two WordList files. Typically, one will be the target text or texts (the text under consideration), and the other the reference text or texts, but one can simply compare two individual texts. The author of the software has put a huge 'single-word' list of newspaper stories online, containing about 70 million words, which will be excellent as a reference corpus (http://www.liv.ac.uk/~ms2928/homepage.html).
A key words analysis need not involve only two files though. The program can also handle multiple comparisons, that is, many target files against a single reference file. This type of comparison is done through the program's 'batch processing'. Optionally, you may choose a stop list in case you want to weed out the most common words such as 'the' and 'of'.
Once the program has processed the comparison, which it does fairly quickly, it displays a table on the screen containing the words which are 'key' together with their frequencies in the two files and some statistical information (chi-square and p value, if the right conditions are met-see Scott 1996). Although the key words list is useful, the key words plot is more exciting; producing a diagram showing the distribution of the keywords within the text (see figure 2).
Fig 2. KeyWords plot in WordSmith showing distribution of keywords within a newspaper report from The Independent.
The separate key word outputs lead naturally to the question, 'which is the most recurrent of these key words?' The KeyWords program can give a little extra help here because it can compute the number of files in which each key word was key. This is accomplished by means of the 'key keywords' option which picks out those keywords which occurred at least twice and then lists the percentage of files in which they were key. Thus, a word which was key in at least two texts will be a key keyword. This concept has found an interesting application in the comparison of the testimony of major witnesses in the OJ Simpson trial (Berber Sardinha 1995). First, keywords were obtained for the various kinds of examination, for example 'direct', 'cross', 'redirect', 'recross', etc. Then key keywords were extracted, namely those which were key in most examinations. Finally the defence witness's key keywords were compared with those for the prosecution witness. The results indicated consistent choices of keywords over the length of each witness' testimony. Thus, one of the ways key keywords may be interpreted is as being markers of consistency of one's style or stance.
Given its range of applications, WordSmith Tools will soon find its way into the classroom. Of all tools, perhaps it is the Concordancer which can most readily lend itself as a teaching or learning instrument. Those who have been using OUP's older product 'MicroConcord' should experience no problems using the new WordSmith Concordancer (assuming a familiarity with MS-Windows). The Concordancer online help makes a few suggestions concerning its use for teaching (for example, generating concordances to establish word usage). In addition, teachers could use the program to prepare vocabulary learning activities by blanking out the search word and then using the printed concordance as a means for students to look for patterns of word association (e.g. Johns 1994).
The publication of Cobuild's new book on verb patterns (Francis, Hunston, and Manning 1996) suggests that lexical patterns will become part of foreign language teaching methodology. A larger number of teachers and students are likely to start looking for patterns in their own texts. The 'cluster' facilities of WordSmith will be of help and, to a lesser extent, KeyWords 'associates' and 'clumps'. Cobuild's book views lexical patters as co-occurrence within a narrow distance, while the WordSmith 'associates' and 'clumps' address co-occurrence within the same text or group of texts. The KeyWords sense of co- occurrence, therefore, is in many ways similar to the old meaning of 'collocation', advocated by Firth, Sinclair, and Halliday (Scott 1996), which is different from the contemporary meaning of collocation as words which co-occur within a four or five-word span.
There are a few problems with the WordSmith interface which are not serious and are likely to be eliminated in future versions. There are occasional 'General Protection Fault' crashes within particular WordSmith applications which can result in having to quit the whole WordSmith shell before restarting the application.
The number of windows which can be open at any one time can be a mixed blessing. The alt- tab keys do not rotate through all of the open windows which means that some shuffling of windows with the mouse may be necessary in order to locate a specific window. However, this is better than not being able to open as many windows at once.
Other users may find portability to be an area which could be improved upon. One may, for example, want to transfer WordSmith data to another application like SPSS or MS Excel. I have had trouble transferring the data from KeyWords plot links to a spreadsheet because the links are shown word by word in individual windows which meant I had to either save each one as text, clean them up, and then import them or simply type in their contents.
Related to this is the current impossibility of extracting the numerical data on which the KeyWords plot is based. At present a graphic image may be obtained but not the actual locations in numbers, which would be essential for a more precise investigation of the distribution of keywords.
Overall, the problems I have faced are few and mostly related to more advanced uses of the data generated by WordSmith and not ones likely to affecting regular use of the tools.
In this review I have tried to present a broad outline of WordSmith Tools. A few features were not covered in depth (e.g. 'links', 'key word databases', 'associates'). The interface as a whole is well organised and intuitive (compare with Lexa's rather awkward layout) and getting started simply requires basic knowledge of ordinary Windows commands. Its speed is also appreciable; reportedly a 4.2 million word corpus would take 20 minutes to be processed into a word list on a 486-33 computer, and less than 4 minutes on a Pentium-100 machine. The examples of applications given in this review indicating that WordSmith can adapt to the required uses rather than dictating them is a strong point in its favour and in keeping with its description as 'the Swiss Army of lexical analysis'.
A. P. Berber Sardinha (1995). 'The OJ Simpson trial: Connectivity and
consistency'. Paper presented at the BAAL Annual Meeting, Southampton, UK, 14
A. P. Berber Sardinha (1996). 'EFL writing assessment and Corpus Linguistics: The 'sound right' factor'. Paper presented at the Applications of Corpus Linguistics Seminar, 17 April 1996, Aston University, Birmingham, UK.
Suzanne Eggins (1994). An Introduction to Systemic Functional Linguistics. London: Pinter.
Gill Francis, Susan Hunston, and Elizabeth Manning (1996). Verbs. (Grammar Patterns, 1.) London: HarperCollins/Cobuild.
T. Johns (1994). 'From printout to handout: Grammar and vocabulary teaching in the context of Data-driven learning'. In Perspectives on Pedagogical Grammar. (Ed: T. Odlin) Cambridge University Press, 293-313.
Batia Laufer and Paul Nation (1995) 'Vocabulary size and use: Lexical richness in L2 written production'. Applied Linguistics 16, 307-322.
M. Scott (1996) 'PC Analysis of key words-and key key words'. Unpublished manuscript.
[Table of Contents] [Letter to the Editor]
Computers & Texts 12 (1996), 19. Not to be republished in any form
without the author's permission.
HTML Author: Michael Fraser (firstname.lastname@example.org)
Document Created: 22 August 1996
The URL of this document is http://info.ox.ac.uk/ctitext/publish/comtxt/ct12/sardinha.html