Corpus searching with AC/DC

Concordances - Context - Search strings

A concordance - concordância - finds all the sentences containing the search word or phrase, and lists them in a standard format - in this case, in order of occurrence in the corpus, with the search phrase emboldened. (Other programs allow more control over the layout of the output)

Extract from concordance of SEMPRE (14.1.2002)

É uma das mais antigas discotecas do Algarve, situada em Albufeira, que continua a manter os traços decorativos e as clientelas de sempre . e continua a manter os traços decorativos e as clientelas de sempre. É um pouco a versão de uma espécie de «outro lado» da n

E razão desta escolha é, obviamente, a progressão demente da Frente Nacional, que prospera sempre a apontar o imigrante como bode expiatório e simultaneamente como a fonte de todos males do povo francês.

Carrington fez sempre questão de salientar que as hipóteses de sucesso do cessar-fogo dependem sobretudo dos beligerantes .-

Com argumentos economicistas e de operacionalidade, o Executivo de Cavaco Silva sempre se escusou a concretizar o SIED, cujas competências foram, entretanto, transferidas para o SIM (Serviços de Informações Militares) , por via de um polémico acto administrativo do Governo, que assim chamava a si matérias da exclusiva competência da AR.

The current version of AC/DC displays the whole sentence containing the word(s) requested. This has advantages and disadvantages over the previous method of displaying a fixed number of characters on either side, as can be seen from an earlier version of the first example, where que is truncated to an ambiguous e, but part of the following sentence is included:

e continua a manter os traços decorativos e as clientelas de sempre . e continua a manter os traços decorativos e as clientelas de sempre. É um pouco a versão de uma espécie de «outro lado» da n

(To achieve an output of more than one sentence, it is necessary to put punctuation marks in your search string.)

Search strings.

words - sequences of words - sets of words - alternatives - exclusions - phrases - punctuation - lexical items parts of speech- grammatical information

The text-searching program is a device for locating strings, i.e.-sequences of characters in the corpus. It does not 'know' anything about Portuguese grammar or orthography, and very little about punctuation, and so has to be tricked into finding what the user really wants. A number of searching devices make this less difficult, though each of them incurs a risk of finding sequences other than the ones being searched for. In the examples following search strings are in RED and output forms are in PURPLE:

1. Words and parts of words. Typing in a single string will find all cases of that 'word', i.e. that string preceded and followed by a space or major punctuation marks . , ! « » : ; etc

mas finds , mas «mas mas,

Não finds Não. Não! «Não Não» «Não»

To search for a sequence of words, enclose each word in double quotation marks:

"não" "para" finds all phrases containing não para (but not phrases with just não or just para)

"não para" will get no result, as there is no word of that form; and não para will be rejected as obviously two words.

The full stop "." stands for any letter: so

"." will find all one-letter words

"..." will find all three-letter words.

If you want to use the full stop literally, i.e. to indicate the character ".", you must precede it with "\" (the "escape character". (This applies to all characters with a special value in the search commands, including "?" and "*".)

".*" (known as a wildcard character) indicates any number of letters (including zero), and is used to find all words beginning or ending with a given sequence of characters:

cama.* finds cama, camas, camarins, camarária, camarária, camada, camarada, camaradagem, camarata etc.

.*ama.* finds Camara, camarada, Camarata, Samara, viamarense etc.

These devices can be combined in more complex formulae:

"[oa]s" ".*r" will find sequences of plural object pronoun and infinitive ... os fazer ... etc

"[Hh].*" "d.*" will find examples of the haver de construction (with a lot of other things too...)

2. Alternatives.

To instruct the program to look for any one of two or more characters at a given point in the string, place the chosen characters between square brackets:

[Tt]udo or [tT]udo will find Tudo and tudo

Paul[oa] will find Paulo and Paula

[Ff]al[oae] will find Falo, Fala, Fale, falo, fala, fale

To instruct the program to look for one or more sequences of letters, separate the sequences with the vertical line | :

Tudo|tudo will find Tudo and tudo

sim|não will find sim and não

"que|não" "[oa]"will find all phrases with que or não followed by a singular pronoun or article.

The question mark "?" is used to indicate an optional character (i.e. zero or one occurrences of the preceding character):

"quem?" will find que and quem

The asterisk "*" is indicates any number of occurrences (including zero) of the previous character. It is mainly used together with the full stop, but may be used with any character:

"1*" will find the numbers 1, 11, 111, 1111

3. Exclusions:

To limit the number of 'hits', you can exclude words from the search sequence, by including a word or formula in the frame [word!=...].

e.g. To search for the adverb sempre excluding the conjunction sempre que, enter

"[Ss]empre" [word!="que"]

4. Phrases

To search for phrases containing two or more words, use

[] (i.e. square brackets with no text inside) to indicate any single word or punctuation mark, or

[] {0,5} (the same, followed by curly brackets containing digits, to indicate the maximum and minimum number of words in a given position.

e.g. "não" []{1,4} "nunca|nada" yields phrases like

Lusa pedindo anonimato, «não ajuda em nada a dissuasão dos assaltos

esclarece: não houve nada com o Porto

a mesma convicção: que não; nada se tinha passado

«Tudo calmo». «Nada; não se passou nada».

a algumas empresas, que não «vão ter nada a ver» com os actuais empregados da Ce

5. Punctuation

Punctuation marks can be searched for like words. To find the end of a sentence, use

"!|\?|\." i.e. exclamation mark, question mark or full stop (the escape character"\" preceding the question mark and the full stop ensures that they are interpreted as specific letters and not as in section 1 and section 2).

To find the end of a clause, use:

"!|\?|\.|,|;|:"

To find verbs with enclitic pronouns, search for

".*-[l|lh|m|n|s].s?" (it really works!)

To find the word "Porém" at the beginning of sentence, with the whole of the preceding sentence of, use:

: [] "\." "Porém".

6.  Lexical units

To search for all forms of a noun or verb without providing a complete list, you can use the "lema" tag which searches for lexical words or lemmas.  Nouns are searched for by their singular form; adjectives, possessives, relative and interrogative pronouns by their masculine singular form; personal pronouns by their subject form; and verbs by their bare infinitive form.

[lema="livro"] will find livro, livros

[lema="bom"] will find bom, bons, boa, boas

[lema="ele"] will find ele, ela, o, a, lhe, but not eles, elas, os, as, lhes,

[lema="cantar"] will find all forms of the verb cantar

[lema="precisar"] ["de|dos?|das?"] will find most cases of the construction precisar de

7.. Parts of speech

All of the corpora are stored in an "annotated" form, with grammatical information attached to each word. The most useful of these are the "part of speech" or "pos" markers, which can be searched for independently or as part of the specification of a set of words.

The main Part of Speech tags are:

N Noun

V Verb

PERS personal pronoun

PRP Preposition

ADV adverb

ADJ Adjective

DET Determiner (articles, demonstratives)

K Conjunction

NUM Numeral

Subclasses of noun and verb are indicated by additional tags attached to the main tag by the underline symbol "_".

N_,

N_

N_

V_

ADJ_n             adjective also used as noun

ADV_rel          relative adverb

DET_arti          indefinite article

DET_artd         definite article

To get a better idea of the POS tags used, select "Distribuição da categoria gramatical (PoS)" in the Resultados box of AC/DC. (Remember that you have to select the anotado version of the chosen corpus.)

To search for parts of speech, include "[pos=" "] in the search formula. 

[pos="N.*"] will find all nouns. (The search formula needs to use the wildcard character ".*", to allow for subcatrgories, unless these are specifically included).

[pos="DET.*"] [pos="N.*"] will find all sequences of determiner plus noun

"o?" [pos="N.*"] will find all sequences of o or os followed by nouns

[word="o?" & pos="DET.*"] will find all cases of o or os which are determiners (rather than pronouns)
[word="c|Compra" & pos="N.*"] will find all cases of the noun compra but no forms of the of the verb comprar.
Note that you have to use the word=".." format whenever you wish to combine a letter search and a category search.  

7.  Other grammatical information. 

In addition to the pos tag, the corpora use several other indications of grammatical classification,  which can be used in conjunction with other search categories

deriv searches for the words derived from a lexical base

temcagr identifies tense and mood values for verb forms and case for pronouns

pessnum identifies the number of nouns and adjectives, and the person and number for pronouns and verb forms

gen identifies the gender of nouns, adjectives and pronouns

func indicates the grammatical function of words and phrases

For detailed information on the use of these tags, consult the AC/DC website

Top

Last edited 17.7.2003