CHAPTER ELEVEN - PROCESSING LARGE TEXT CORPORA

This section will focus on exploiting large files containing linguistic material with the use of the commands already covered plus many more.

Compressed files

Often large files are compressed to save disk space. If this is the case then the user must make the file revert to it's original format in order to be able to do anything with it. A popular compressing command is called, simply, compress. The command:

% compress filename

will cause the file to be replaced by a compressed file with a .Z suffix. The command uncompress will cause it to revert to its original format. It is often not necessary to uncompress a file to use it. In fact, the file will often be owned by someone else, and you would have to copy it and then uncompress it, using up a great deal of disk space and processor time. It is often better to use the zcat which sends the uncompressed contents of a compressed file to the standard output, while leaving the compressed version of the file in the filestore.

PRACTICE

Try compressing and uncompressing some of your own files.

Find a large compressed file on your system and search it for some appropriate string using grep without uncompressing the file.

Some useful commands for processing text files

The following is a summary of some useful commands for processing text files, some of which you have met already, some of which are new to you. Both have been included so that this section can easily be used for reference purposes. Not all of these commands are standard Unix, so they may not all work in the way you expect (or at all) on your system. For the same reasons, their syntax is somewhat incongruous and some use different input and output conventions. Not all are included in the command summary in the appendix below. See the relevant manual pages for more details.

sort sort into alphabetical order

sort -n sort into numerical order

sort -m merge sorted files into one sorted file

sort -r sort into reverse order (highest first)

sort -c check a file is already sorted

uniq remove duplicate lines (or partly-duplicate lines)

uniq -d output only duplicate lines

uniq -c count identical lines (or lines with identical fields)

grep find lines containing given string or pattern

grep -v find lines not containing given string or pattern

grep -c count lines containing given string or pattern

grep -n give line numbers of lines containing...

fgrep same as grep except that it does not recognise regular expressions

egrep same as grep except that it recognises all REs grep only recognises certain special characters

wc -c count characters

wc -w count words

wc -l count lines

NOTE

wc -l file will output the number of lines in the file, and the file name.

wc -l < file just gives the bare line count.

head -17 output first 17 lines

tail -17 output last 17 lines

tail +30 output from line 30

cut -f3 delete all but third field of each line

cut -f3,5 delete all but third and fifth fields of each line

cut -f3-5,7 delete all but 3rd, 4th, 5th, 7th fields of each line

cut -c-4,6-8 delete all but 2nd 3rd 4th, 6th 7th 8th characters

cut -f2 -d":" deletes all but the second field where ":" is the field delimiter (tab is the default)

paste combines files horizontally; corresponding lines are appended

paste -d">" pastes with delimiter defined as ">" (tab is default). The special characters "\n" (newline) and "\0" (null string) may be used.

cat concatenates file vertically (appends files to one another)

cat -n precedes each line with a line number in the output

cat -b as above, but does not number blank lines

cat -s reduces any number of successive blank lines to one blank line

tr "abc-e" "kmx-z" translates a, b, c, d, e to k, m, x, y, z respectively.

tr -d "xy" deletes all occurrences of x and y

tr -s "a" "b" translates all a to b and reduces any string of consecutive b to just one b.

To go down to the character, rather than field, level, sed is simplest for line by line processing. sed looks for patterns, so is not very good with column or field positions.

uniq needs an already-sorted file. A common idiom is

sort | uniq

to produce a sorted list of all the different lines in a file. uniq has a peculiar way of spacing its output, so it is difficult to use in a pipeline with another command such as cut.

tr is useful for converting blanks to newlines (hence converting a text to a vertical list of words, which can then be sorted, counted etc.). The command:

%  tr  " "  "\012"  <  filename

will do this. 012 is the octal code for the linefeed character. This is also useful for converting strings of blanks or tabs to single characters. 011 is the octal code for the tab character.

PRACTICE

Try out the following pipeline on a text file:

tr " " "\012" < input_file | sort | uniq > output_file

Using language corpora

A corpus (plural corpora) is a collection of language data. The corpora with which we will be concerned here are electronic, that is they are stored in a computer. Corpora may contain data about written or spoken language. They usually contain texts from one language, but they may also be multilingual. Corpora are usually designed and collated for a specific purpose. Many of the major corpora in use today aim to be representative of different domains of language use, and can facilitate comparative studies. For example, the average length of words in academic texts and newspaper reports could be compared by measuring words in texts from these two domains. Computers obviously make this type of number-crunching (or word-crunching) activity much easier than it would be if you had to count words and letters in a printed text. Corpora are particularly useful for checking the intuitions that we have and the generalisations that are made about language use.

Unix commands can be used to extract information from language corpora. The commands learned in this course can be used for issuing commands and writing simple scripts that can be used to extract information from language corpora.

Types of Corpora

There are many types of corpora, defined by the types of language that they represent and the formats in which that information is stored. Unix commands for handling strings are sufficiently flexible to handle many different formats. Users however need to be sensitive to the arcane minutiae of the format and markup of the different corpora that they use. The 'l' command in the vi editor can be used to view hidden characters (such as spaces and tabs) in a file.

The LOB and Brown corpora

Brown and LOB are parallel corpora, with very similar formats and tagging. Brown, which was constructed first, represents different types of written American English. LOB represents the same categories of British English. All words are lemmatised and given a word class tag. Here is a sample from the so-called 'vertical tagged' version of Brown:

^N01002001	-----	-----	-----
N01002010	-	NP	Alastair
N01002020	-	BEDZ	was
N01002030	-	AT	a
N01002040	-	NN	bachelor
N01002041	-	.	.
^N01002042	-----	-----	-----
N01002050	-	ABN	all
N01002060	-	PP$	his
N01002070	-	NN	life
N01002080	-	PP3A	he
N01002090	-	HVD	had
N01002100	-	BEN	been
N01002110	-	VBN	inclined
N01002120	-	TO	to
N01003010	-	VB	regard
N01003020	-	NNS	women
N01003030	-	IN	as
N01003040	-	PN	something
N01003050	-	WDTR	which
N01003060	-	MD	must
N01003070	-	RB	necessarily
N01003080	-	BE	be
N01003090	-	VBN	subordinated
N01003100	-	IN	to
N01004010	-	PP$	his

And the 'untagged' version of the same passage, plus the following lines:

N01 0010    DAN MORGAN TOLD HIMSELF HE WOULD FORGET Ann Turner. He
N01 0020 was well rid of her. He certainly didn't want a wife who was fickle
N01 0030 as Ann. If he had married her, he'd have been asking for trouble.
N01 0010    DAN MORGAN TOLD HIMSELF HE WOULD FORGET Ann Turner. He
N01 0020 was well rid of her. He certainly didn't want a wife who was fickle
N01 0030 as Ann. If he had married her, he'd have been asking for trouble.
N01 0040    But all of this was rationalization. Sometimes he woke up in
N01 0050 the middle of the night thinking of Ann, and then could not get back
N01 0060 to sleep. His plans and dreams had revolved around her so much and for
N01 0070 so long that now he felt as if he had nothing. The easiest thing would
N01 0080 be to sell out to Al Budd and leave the country, but there was
N01 0090 a stubborn streak in him that wouldn't allow it.   The best antidote
N01 0100 for the bitterness and disappointment that poisoned him was hard
N01 0110 work. He found that if he was tired enough at night, he went to sleep

Users can choose the version (from those available to them) which includes the information that they need. If you are only interested in word frequencies, then the grammatical information encoded in the tagged version is redundant, and the untagged version can be used. If however you are looking for the word 'set' used as a noun, then it would be necessary to use a tagged version, so that this word can be differentiated from 'set' used as a verb or adjective.

Processing LOB and Brown

The Susanne corpus

This corpus uses a section of the Brown corpus and marks it up with syntactic information.

N01:0010a	-	YB	<minbrk>	-	[Oh.Oh]
N01:0010b	-	NP1m	DAN	Dan	[O[S[Nns:s.
N01:0010c	-	NP1s	MORGAN	Morgan	.Nns:s]
N01:0010d	-	VVDv	TOLD	tell	[Vd.Vd]
N01:0010e	-	PPX1m	HIMSELF	himself	[Nos:i.Nos:i]
N01:0010f	-	PPHS1m	HE	he	[Fn:o[Nas:s.Nas:s]
N01:0010g	-	VMd	WOULD	will	[Vdc.
N01:0010h	-	VV0v	FORGET	forget	.Vdc]
N01:0010i	-	NP1f	Ann	Ann	[Nns:o.
N01:0010j	-	NP1s	Turner	Turner	.Nns:o]Fn:o]S]
N01:0010k	-	YF	+.	-	.
N01:0010m	-	PPHS1m	He	he	[S[Nas:s.Nas:s]
N01:0020a	-	VBDZ	was	be	[Vsb.Vsb]
N01:0020b	-	RR	well	well	[Tn:e[R:h.R:h]
N01:0020c	-	VVNt	rid	rid	[Vn.Vn]
N01:0020d	-	IO	of	of	[Po:u.
N01:0020e	-	PPHO1f	her	she	.Po:u]Tn:e]S]
N01:0020f	-	YF	+.	-	.
N01:0020g	-	PPHS1m	He	he	[S[Nas:s.Nas:s]
N01:0020h	-	RR	certainly	certainly	[R:m.R:m]
N01:0020i	-	VDD	did	do	[Vde.
N01:0020j	-	XX	+n<apos>t	not	.
N01:0020k	-	VV0v	want	want	.Vde]
N01:0020m	-	AT1	a	a	[Ns:o101.
N01:0020n	-	NN1c	wife	wife	.
N01:0020p	-	PNQSr	who	who	[Fr[Nq:s101.Nq:s101]

The London-Lund corpus

This corpus differs from the others that we have looked at because it is a transcription of spoken English. Intonation is marked.

1 1 1 10 1 1 B 11 ((of ^Spanish)) . graph\ology#/

1 1 1 20 1 1 A 11 ^w=ell# ./

1 1 1 30 1 1 A 11 ((if)) did ^y/ou _set _that# - /

1 1 1 40 1 1 B 11 ^well !J\oe and _I#/

1 1 1 50 1 1 B 11 ^set it betw\een _us#/

1 1 1 60 1 1 B 11 ^actually !Joe 'set the :p\aper#/

1 1 1 70 1 1 B 20 and *((3 to 4 sylls))*/

1 1 1 80 1 1 A 11 *^w=ell# ./

1 1 1 90 1 1 A 11 "^m/\ay* I _ask#/

1 1 1 100 1 1 A 11 ^what goes !\into that paper n/ow#/

1 1 1 110 1 1 A 11 be^cause I !have to adv=ise# ./

1 1 1 120 1 1 A 21 ((a)) ^couple of people who are !d\oing [dhi: @]/

1 1 1 130 1 1 B 11 well ^what you :d\/o#/

1 1 1 140 1 2 B 12 ^is to - - ^this is sort of be:tween the :tw\/o of /

1 1 1 140 1 1 B 12 _us# /

1 1 1 150 1 1 B 11 ^what *you* :d\/o#/

1 1 1 160 2 1 B 23 is to ^make sure that your 'own . !c\andidate/

1 1 1 170 1 1 A 11 *^[\m]#*/

1 1 1 160 1 2(B 13 is . *.* ^that your . there`s ^something that your /

1 1 1 160 1 1(B 13 :own candidate can :h\/andle# - -/

CUVOALD

This acronym stands for the Computer Usable Version of the Oxford Advanced Learners Dictionary. There are in fact two versions. The most useful is usually in a file called cuv2.dat contains 68742 words including inflected forms and proper nouns. It is most often of use as a wordlist, but the file also contains a phonemic transcription and a part-of-speech tag for every word. Here is a sample of cuv2.dat:

verbs	v3bz	Kj
verdancy	'v3dnsI	L@
verdant	'v3dnt	OA
verdict	'v3dIkt	K6
verdicts	'v3dIkts	Kj
verdigris	'v3dIgrIs	L@
verdure	'v3dj@R	L@
verge	v3dZ	I2,K6	3A
verged	v3dZd	Ic,Id	3A
verger	'v3dZ@R	K6
vergers	'v3dZ@z	Kj
verges	'v3dZIz	Ia,Kj	3A
verging	'v3dZIN	Ib	3A
verifiable	'verIfaI@bl	OA
verification	,verIfI'keISn	M6
verifications	,verIfI'keISnz	Mj
verified	'verIfaId	Hc,Hd	6A
verifies	'verIfaIz	Ha	6A
verify	'verIfaI	H3	6A
verifying	'verIfaIIN	Hb	6A
verily	'ver@lI	Pu
verisimilitude	,verIsI'mIlItjud	M6
verisimilitudes	,verIsI'mIlItjudz	Mj
veritable	'verIt@bl	OA
verities	'verItIz	Mj
verity	'verItI	M8
vermicelli	,v3mI'selI	L@
vermiform	'v3mIfOm	OA
vermilion	v@'mIlI@n	M6,OA

The coding conventions for the phonemic and syntactic tags are explained in a file that comes with dictionary. Some examples of applications that use the dictionary can be found in the appendix of this course.

Other texts

Corpus building is currently a growth area, and there are many, many more corpora as well as the above examples. Currently available or under construction are a number of very large corpora, comprehensive corpora aiming to cover all registers of English, international English corpora, corpora of different languages and specialised corpora covering a single well-defined domain of language.

Exercises

1. Find a large text file with a fixed field format (e.g. the Brown or LOB corpora) and inspect the format. Use zcat to view it if necessary.

3. Use cut to strip away the reference material and leave just the text field.

4. Use tr to strip away any tags that are actually in the text (e.g. attached to the words), so that you are left with just the words.

5. Make a sorted wordlist from the file.

6. Combine the above commands in a shell script so that you have a small program for extracting a wordlist.