This section will focus on exploiting large files containing linguistic material with the use of the commands already covered plus many more.
Often large files are compressed to save disk space. If this is the case then the user must make the file revert to it's original format in order to be able to do anything with it. A popular compressing command is called, simply, compress. The command:
% compress filename
will cause the file to be replaced by a compressed file with a .Z suffix. The command uncompress will cause it to revert to its original format. It is often not necessary to uncompress a file to use it. In fact, the file will often be owned by someone else, and you would have to copy it and then uncompress it, using up a great deal of disk space and processor time. It is often better to use the zcat which sends the uncompressed contents of a compressed file to the standard output, while leaving the compressed version of the file in the filestore.
Try compressing and uncompressing some of your own files.
Find a large compressed file on your system and search it for some appropriate string using grep without uncompressing the file.
The following is a summary of some useful commands for processing text files, some of which you have met already, some of which are new to you. Both have been included so that this section can easily be used for reference purposes. Not all of these commands are standard Unix, so they may not all work in the way you expect (or at all) on your system. For the same reasons, their syntax is somewhat incongruous and some use different input and output conventions. Not all are included in the command summary in the appendix below. See the relevant manual pages for more details.
sort sort into alphabetical order
sort -n sort into numerical order
sort -m merge sorted files into one sorted file
sort -r sort into reverse order (highest first)
sort -c check a file is already sorted
uniq remove duplicate lines (or partly-duplicate lines)
uniq -d output only duplicate lines
uniq -c count identical lines (or lines with identical fields)
grep find lines containing given string or pattern
grep -v find lines not containing given string or pattern
grep -c count lines containing given string or pattern
grep -n give line numbers of lines containing...
fgrep same as grep except that it does not recognise regular expressions
egrep same as grep except that it recognises all REs grep only recognises certain special characters
wc -c count characters
wc -w count words
wc -l count lines
wc -l file will output the number of lines in the file, and the file name.
wc -l < file just gives the bare line count.
head -17 output first 17 lines
tail -17 output last 17 lines
tail +30 output from line 30
cut -f3 delete all but third field of each line
cut -f3,5 delete all but third and fifth fields of each line
cut -f3-5,7 delete all but 3rd, 4th, 5th, 7th fields of each line
cut -c-4,6-8 delete all but 2nd 3rd 4th, 6th 7th 8th characters
cut -f2 -d":" deletes all but the second field where ":" is the field delimiter (tab is the default)
paste combines files horizontally; corresponding lines are appended
paste -d">" pastes with delimiter defined as ">" (tab is default). The special characters "\n" (newline) and "\0" (null string) may be used.
cat concatenates file vertically (appends files to one another)
cat -n precedes each line with a line number in the output
cat -b as above, but does not number blank lines
cat -s reduces any number of successive blank lines to one blank line
tr "abc-e" "kmx-z" translates a, b, c, d, e to k, m, x, y, z respectively.
tr -d "xy" deletes all occurrences of x and y
tr -s "a" "b" translates all a to b and reduces any string of consecutive b to just one b.
To go down to the character, rather than field, level, sed is simplest for line by line processing. sed looks for patterns, so is not very good with column or field positions.
uniq needs an already-sorted file. A common idiom is
sort | uniqto produce a sorted list of all the different lines in a file. uniq has a peculiar way of spacing its output, so it is difficult to use in a pipeline with another command such as cut.
tr is useful for converting blanks to newlines (hence converting a text to a vertical list of words, which can then be sorted, counted etc.). The command:
% tr " " "\012" < filename
will do this. 012 is the octal code for the linefeed character. This is also useful for converting strings of blanks or tabs to single characters. 011 is the octal code for the tab character.
Try out the following pipeline on a text file:
tr " " "\012" < input_file | sort | uniq > output_file
A corpus (plural corpora) is a collection of language data. The corpora with which we will be concerned here are electronic, that is they are stored in a computer. Corpora may contain data about written or spoken language. They usually contain texts from one language, but they may also be multilingual. Corpora are usually designed and collated for a specific purpose. Many of the major corpora in use today aim to be representative of different domains of language use, and can facilitate comparative studies. For example, the average length of words in academic texts and newspaper reports could be compared by measuring words in texts from these two domains. Computers obviously make this type of number-crunching (or word-crunching) activity much easier than it would be if you had to count words and letters in a printed text. Corpora are particularly useful for checking the intuitions that we have and the generalisations that are made about language use.
Unix commands can be used to extract information from language corpora. The commands learned in this course can be used for issuing commands and writing simple scripts that can be used to extract information from language corpora.
There are many types of corpora, defined by the types of language that they represent and the formats in which that information is stored. Unix commands for handling strings are sufficiently flexible to handle many different formats. Users however need to be sensitive to the arcane minutiae of the format and markup of the different corpora that they use. The 'l' command in the vi editor can be used to view hidden characters (such as spaces and tabs) in a file.
Brown and LOB are parallel corpora, with very similar formats and tagging. Brown, which was constructed first, represents different types of written American English. LOB represents the same categories of British English. All words are lemmatised and given a word class tag. Here is a sample from the so-called 'vertical tagged' version of Brown:
^N01002001 ----- ----- ----- N01002010 - NP Alastair N01002020 - BEDZ was N01002030 - AT a N01002040 - NN bachelor N01002041 - . . ^N01002042 ----- ----- ----- N01002050 - ABN all N01002060 - PP$ his N01002070 - NN life N01002080 - PP3A he N01002090 - HVD had N01002100 - BEN been N01002110 - VBN inclined N01002120 - TO to N01003010 - VB regard N01003020 - NNS women N01003030 - IN as N01003040 - PN something N01003050 - WDTR which N01003060 - MD must N01003070 - RB necessarily N01003080 - BE be N01003090 - VBN subordinated N01003100 - IN to N01004010 - PP$ his
And the 'untagged' version of the same passage, plus the following lines:
N01 0010 DAN MORGAN TOLD HIMSELF HE WOULD FORGET Ann Turner. He N01 0020 was well rid of her. He certainly didn't want a wife who was fickle N01 0030 as Ann. If he had married her, he'd have been asking for trouble. N01 0010 DAN MORGAN TOLD HIMSELF HE WOULD FORGET Ann Turner. He N01 0020 was well rid of her. He certainly didn't want a wife who was fickle N01 0030 as Ann. If he had married her, he'd have been asking for trouble. N01 0040 But all of this was rationalization. Sometimes he woke up in N01 0050 the middle of the night thinking of Ann, and then could not get back N01 0060 to sleep. His plans and dreams had revolved around her so much and for N01 0070 so long that now he felt as if he had nothing. The easiest thing would N01 0080 be to sell out to Al Budd and leave the country, but there was N01 0090 a stubborn streak in him that wouldn't allow it. The best antidote N01 0100 for the bitterness and disappointment that poisoned him was hard N01 0110 work. He found that if he was tired enough at night, he went to sleep
Users can choose the version (from those available to them) which includes the information that they need. If you are only interested in word frequencies, then the grammatical information encoded in the tagged version is redundant, and the untagged version can be used. If however you are looking for the word 'set' used as a noun, then it would be necessary to use a tagged version, so that this word can be differentiated from 'set' used as a verb or adjective.
This corpus uses a section of the Brown corpus and marks it up with syntactic information.
N01:0010a - YB <minbrk> - [Oh.Oh] N01:0010b - NP1m DAN Dan [O[S[Nns:s. N01:0010c - NP1s MORGAN Morgan .Nns:s] N01:0010d - VVDv TOLD tell [Vd.Vd] N01:0010e - PPX1m HIMSELF himself [Nos:i.Nos:i] N01:0010f - PPHS1m HE he [Fn:o[Nas:s.Nas:s] N01:0010g - VMd WOULD will [Vdc. N01:0010h - VV0v FORGET forget .Vdc] N01:0010i - NP1f Ann Ann [Nns:o. N01:0010j - NP1s Turner Turner .Nns:o]Fn:o]S] N01:0010k - YF +. - . N01:0010m - PPHS1m He he [S[Nas:s.Nas:s] N01:0020a - VBDZ was be [Vsb.Vsb] N01:0020b - RR well well [Tn:e[R:h.R:h] N01:0020c - VVNt rid rid [Vn.Vn] N01:0020d - IO of of [Po:u. N01:0020e - PPHO1f her she .Po:u]Tn:e]S] N01:0020f - YF +. - . N01:0020g - PPHS1m He he [S[Nas:s.Nas:s] N01:0020h - RR certainly certainly [R:m.R:m] N01:0020i - VDD did do [Vde. N01:0020j - XX +n<apos>t not . N01:0020k - VV0v want want .Vde] N01:0020m - AT1 a a [Ns:o101. N01:0020n - NN1c wife wife . N01:0020p - PNQSr who who [Fr[Nq:s101.Nq:s101]
This corpus differs from the others that we have looked at because it is a transcription of spoken English. Intonation is marked.
1 1 1 10 1 1 B 11 ((of ^Spanish)) . graph\ology#/
1 1 1 20 1 1 A 11 ^w=ell# ./
1 1 1 30 1 1 A 11 ((if)) did ^y/ou _set _that# - /
1 1 1 40 1 1 B 11 ^well !J\oe and _I#/
1 1 1 50 1 1 B 11 ^set it betw\een _us#/
1 1 1 60 1 1 B 11 ^actually !Joe 'set the :p\aper#/
1 1 1 70 1 1 B 20 and *((3 to 4 sylls))*/
1 1 1 80 1 1 A 11 *^w=ell# ./
1 1 1 90 1 1 A 11 "^m/\ay* I _ask#/
1 1 1 100 1 1 A 11 ^what goes !\into that paper n/ow#/
1 1 1 110 1 1 A 11 be^cause I !have to adv=ise# ./
1 1 1 120 1 1 A 21 ((a)) ^couple of people who are !d\oing [dhi: @]/
1 1 1 130 1 1 B 11 well ^what you :d\/o#/
1 1 1 140 1 2 B 12 ^is to - - ^this is sort of be:tween the :tw\/o of /
1 1 1 140 1 1 B 12 _us# /
1 1 1 150 1 1 B 11 ^what *you* :d\/o#/
1 1 1 160 2 1 B 23 is to ^make sure that your 'own . !c\andidate/
1 1 1 170 1 1 A 11 *^[\m]#*/
1 1 1 160 1 2(B 13 is . *.* ^that your . there`s ^something that your /
1 1 1 160 1 1(B 13 :own candidate can :h\/andle# - -/
This acronym stands for the Computer Usable Version of the Oxford Advanced Learners Dictionary. There are in fact two versions. The most useful is usually in a file called cuv2.dat contains 68742 words including inflected forms and proper nouns. It is most often of use as a wordlist, but the file also contains a phonemic transcription and a part-of-speech tag for every word. Here is a sample of cuv2.dat:
verbs v3bz Kj verdancy 'v3dnsI L@ verdant 'v3dnt OA verdict 'v3dIkt K6 verdicts 'v3dIkts Kj verdigris 'v3dIgrIs L@ verdure 'v3dj@R L@ verge v3dZ I2,K6 3A verged v3dZd Ic,Id 3A verger 'v3dZ@R K6 vergers 'v3dZ@z Kj verges 'v3dZIz Ia,Kj 3A verging 'v3dZIN Ib 3A verifiable 'verIfaI@bl OA verification ,verIfI'keISn M6 verifications ,verIfI'keISnz Mj verified 'verIfaId Hc,Hd 6A verifies 'verIfaIz Ha 6A verify 'verIfaI H3 6A verifying 'verIfaIIN Hb 6A verily 'ver@lI Pu verisimilitude ,verIsI'mIlItjud M6 verisimilitudes ,verIsI'mIlItjudz Mj veritable 'verIt@bl OA verities 'verItIz Mj verity 'verItI M8 vermicelli ,v3mI'selI L@ vermiform 'v3mIfOm OA vermilion v@'mIlI@n M6,OA
The coding conventions for the phonemic and syntactic tags are explained in a file that comes with dictionary. Some examples of applications that use the dictionary can be found in the appendix of this course.
Corpus building is currently a growth area, and there are many, many more corpora as well as the above examples. Currently available or under construction are a number of very large corpora, comprehensive corpora aiming to cover all registers of English, international English corpora, corpora of different languages and specialised corpora covering a single well-defined domain of language.
1. Find a large text file with a fixed field format (e.g. the Brown or LOB corpora) and inspect the format. Use zcat to view it if necessary.
3. Use cut to strip away the reference material and leave just the text field.
4. Use tr to strip away any tags that are actually in the text (e.g. attached to the words), so that you are left with just the words.
5. Make a sorted wordlist from the file.
6. Combine the above commands in a shell script so that you have a small program for extracting a wordlist.