Linux agrep command
agrep is a version of the grep utility that also matches approximate patterns.
Description
agrep searches the input file names (standard input is the default) for records containing strings which either exactly or approximately match a pattern.
A record is by default a single line, but it can be defined differently using the -d option (see below). Normally, each record found is copied to the standard output. Approximate matching allows finding records that contain the pattern with several errors including substitutions, insertions, and deletions.
For example, "Massechusets" matches "Massachusetts" with two errors (one substitution and one insertion). Running agrep -2 Massechusets foo outputs all lines in the file foo containing any string with (at most) 2 errors from "Massechusets".
agrep supports many kinds of queries including arbitrary wildcards, sets of patterns, and in general, all regular expressions. It supports most of the options supported by the grep family plus several more (but it is not 100% compatible with grep).
As with the rest of the grep family, the characters $, ^, *, [, ], ^, |, (, ), !, and \ can cause unexpected results when included in the pattern, as these special characters are also meaningful to the shell. To avoid these problems, one should always enclose the entire pattern argument in single quotes, i.e., 'pattern'. Do not use double quotes (").
When agrep is applied to more than one input file, the name of the file is displayed at the beginning of each line which matches the pattern. (The file name is not displayed when processing a single file, but in that case if the user wants the file name to appear, they should use /dev/null as a second file in the list, and then the file name will be displayed).
Syntax
agrep [ -#cdehiklnpstvwxBDGIS ] pattern [ -f patternfile ] [ filename... ]
Options
-# | # is a non-negative integer (at most 8) specifying the maximum number of errors permitted in finding the approximate matches. It defaults to zero. Generally, each insertion, deletion, or substitution counts as one error. It is possible to adjust the relative cost of insertions, deletions, and substitutions; see -I -D and -S options. |
-c | Display only the count (number of occurrences) of matching records. |
-d 'delim' | Define delim to be the separator between two records. The default value is '$', which matches the end of a line; therefore, by default, a record is a single line. The delim is a string of up to eight characters (with possible use of ^ and $). Text between two delim's, before the first delim, and after the last delim is considered as one record. For example, -d '$$' defines paragraphs as records (if a paragraph is represented by two newlines) and -d '^From ' defines mail messages as records. agrep matches each record separately. This option does work with regular expressions, but delim itself cannot be a regular expression. |
-e pattern | Same as providing a simple pattern argument, but using -e is useful when the pattern begins with a '-'. |
-f patternfile | Match the patterns in patternfile. The output is all lines that match at least one of the patterns in patternfile. Currently, the -f option works only for exact match and for simple patterns (any meta symbol is interpreted as a regular character). It is compatible only with -c, -h, -i, -l, -s, -v, -w, and -x options. |
-h | Do not display file names. |
-i | Case-insensitive search; e.g., "A" and "a" are considered equivalent. |
-k | Use simple pattern matching, i.e., treat no symbols in the pattern as a meta character. For example, agrep -k 'a(b|c)*d' foo finds the occurrences of the literal string "a(b|c)*d" in foo, whereas agrep 'a(b|c)*d' foo finds substrings in foo that match the regular expression 'a(b|c)*d'. |
-l | List only the names of the files that contain a match. For example, agrep -l 'wonderful' * lists the names of those files in current directory that contain the word wonderful. |
-n | Each line that is printed is prefixed by its record number in the file. |
-p | Find records in the text that contain a supersequence of the pattern. For example, agrep -p DCS foo will match "Department of Computer Science". |
-s | Work silently; that is, display nothing except error messages. |
-t | Output the record starting from the end of delim to (and including) the next delim. This is useful for cases where delim should come at the end of the record. |
-v | Inverse mode — display only those records that do not contain the pattern. |
-w | Search for the pattern as a word only — i.e., only match patterns if they are surrounded by non-alphanumeric characters, such as a space or a dash. The non-alphanumeric must surround the match; they cannot be counted as errors. For example, agrep -w -1 car will match "cars", but not "characters". |
-x | The pattern must match the whole line. |
-y | Used with -B option. When -y is on, agrep always outputs the best matches without giving a prompt. |
-B | Best match mode. When -B is specified and no exact matches are found, agrep continues to search until the closest matches (i.e., the ones with minimum number of errors) are found, at which point the following message will be shown: "the best match contains x errors, there are y matches, output them? (y/n)". The best match mode is not supported for standard input, e.g., pipeline input. When the -#, -c, or -l options are specified, the -B option is ignored. In general, -B may be slower than -#, but not by very much. |
-Dk | Set the cost of a deletion to k (k is a positive integer). This option does not currently work with regular expressions. |
-G | Output the files that contain a match. |
-Ik | Set the cost of an insertion to k (k is a positive integer). This option does not currently work with regular expressions. |
-Sk | Set the cost of a substitution to k (k is a positive integer). This option does not currently work with regular expressions. |
Patterns
agrep supports a large variety of patterns, including simple strings, strings with classes of characters, sets of strings, wildcards, and regular expressions.
Strings
A string is any sequence of characters, including the special symbols ^ for beginning of line and $ for end of line. The special characters listed above ( $, ^, *, [, ^, |, (, ), !, and \ ) should be preceded by \ if they are to be matched as regular characters. For example, \^abc\\ corresponds to the string "^abc\", whereas ^abc corresponds to the string "abc" at the beginning of a line.
Character classes
A class of characters is a list of characters inside "[]" (in order) corresponds to any character from the list, where a dash represents the range between two characters. For example, [a-ho-z] is any character between a and h or between o and z. The symbol ^ inside [] denotes which characters not to match ("complements" the list). For example, [^i-n] denotes any character except characters i through n. The symbol ^ thus has two meanings, but this is consistent with egrep. The symbol . stands for any character except for the newline character.
Boolean operations
agrep supports an AND operation ';' and an OR operation ',', but not a combination of both. For example, fast;network searches for all records containing both "fast" and "network".
Wildcards
The symbol '#' is used to denote a wildcard. # matches zero, or any number of, arbitrary characters. For example, ex#e matches "example". The symbol # is equivalent to .* in egrep. In fact, .* works too, because it is a valid regular expression, but unless this is part of an actual regular expression, # works faster.
Combination of Exact and Approximate Matching
Any pattern inside angle brackets <> must match the text exactly even if the match is with errors. For example, <mathemat>ics matches mathematical with one error (replacing the last s with an a), but mathe<matics> does not match mathematical no matter how many errors we allow.
Regular Expressions
The syntax of regular expressions in agrep is in general the same as that for egrep. The union operation '|', Kleene closure '*', and parentheses () are all supported. Currently '+' is not supported. Regular expressions are currently limited to approximately 30 characters (excluding meta characters). Some options (-d, -w, -f, -t, -x, -D, -I, -S) do not currently work with regular expressions. The maximal number of errors for regular expressions that use '*' or '|' is 4.
Examples
agrep -2 -c ABCDEFG foo
Gives the number of lines in file foo that contain "ABCDEFG" within two errors.
agrep -1 -D2 -S2 'ABCD#YZ' foo
Outputs the lines containing "ABCD" followed within arbitrary distance by "YZ", with up to one additional insertion (-D2 and -S2 make deletions and substitutions too "expensive").
agrep -5 -p abcdefghij /path/to/dictionary/words
Outputs the list of all words in the dictionary located at /path/to/dictionary/words containing at least 5 of the first 10 letters of the alphabet in order.
agrep -1 'abc[0-9](de|fg)*[x-z]' foo
Outputs the lines containing, within up to one error, the string that starts with "abc" followed by one digit, followed by zero or more repetitions of either "de" or "fg", followed by either "x", "y", or "z".
agrep -d '^From ' 'breakdown;internet' mbox
Outputs all mail messages (the pattern "^From " separates mail messages in a mail file) that contain keywords "breakdown" and "internet".
agrep -d '$$' -1 '<word1> <word2>' foo
Finds all paragraphs that contain word1 followed by word2 with one error in place of the blank. In particular, if word1 is the last word in a line and word2 is the first word in the next line, then the space will be substituted by a newline symbol and it will match. Thus, this is a way to overcome separation by a newline. Note that -d '$$' (or another delim which spans more than one line) is necessary, because otherwise agrep searches only one line at a time.
Related commands
grep — Filter text which matches a regular expression.