转载:Perl Tutorial: Regular Expressions (Regex), File IO and Text Processing
Perl is famous for processing text files via regular expressions.
# 1. Regular Expressions in Perl
A Regular Expression (or Regex) is a pattern (or filter) that describes a set of strings that matches the pattern. In other words, a regex accepts a certain set of strings and rejects the rest.
I shall assume that you are familiar with Regex syntax. Otherwise, you could read:
- “Regex Syntax Summary” for a summary of regex syntax and examples.
- “Regular Expressions” for full coverage.
Perl makes extensive use of regular expressions with many built-in syntaxes and operators. In Perl (and JavaScript), a regex is delimited by a pair of forward slashes (default), in the form of /regex/
. You can use built-in operators:
- m/regex/modifier: Match against the
regex
. - s/regex/replacement/modifier: Substitute matched substring(s) by the replacement.
# 1.1 Matching Operator m//
You can use matching operator m//
to check if a regex pattern exists in a string. The syntax is:
1 | m/regex/ |
# Delimiter
Instead of using forward-slashes ( /
) as delimiter, you could use other non-alphanumeric characters such as !
, @
and %
in the form of m!regex!modifiers
m@regex@modifiers
or m%regex%modifiers
. However, if forward-slash ( /
) is used as the delimiter, the operator m
can be omitted in the form of /regex/modifiers
. Changing the default delimiter is confusing, and not recommended.
m//
, by default, operates on the default variable $_
. It returns true if $_
matches regex; and false otherwise.
# Example 1: Regex [0-9]+
1 | #!/usr/bin/env perl |
# Example 2: Extracting the Matched Substrings
The built-in array variables @-
and @+
keep the start and end positions of the matched substring, where $-[0]
and $+[0]
for the full match, and $-[n]
and $+[n]
for back references $1
, $2
, …, $n
, …
1 | #!/usr/bin/env perl |
# Example 3: Modifier ‘g’ (global)
By default, m//
finds only the first match. To find all matches, include ‘g’ (global) modifier.
1 | #!/usr/bin/env perl |
# 1.2 Operators =~ and !~
By default, the matching operators operate on the default variable $_
. To operate on other variable instead of $_
, you could use the =~
and !~
operators as follows:
1 | str =~ m/regex/modifiers # Return true if str matches regex. |
When used with m//
, =~
behaves like comparison ( ==
or eq
).
# Example 4: =~ Operator
1 | #!/usr/bin/env perl |
# 1.3 Substitution Operator s///
You can substitute a string (or a portion of a string) with another string using s///
substitution operator. The syntax is:
1 | s/regex/replacement/ |
Similar to m//
, s///
operates on the default variable $_
by default. To operate on other variable, you could use the =~
and !~
operators. When used with s///
, =~
behaves like assignment ( =
).
# Example 5: s///
1 | #!/usr/bin/env perl |
# 1.4 Modifiers
Modifiers (such as /g
, /i
, /e
, /o
, /s
and /x
) can be used to control the behavior of m//
and s///
.
- g (global): By default, only the first occurrence of the matching string of each line is processed. You can use modifier
/g
to specify global operation. - i (case-insensitive): By default, matching is case-sensitive. You can use the modifier
/i
to enable case in-sensitive matching. - m (multiline): multiline string, affecting position anchor
^
,$
,\A
,\Z
. - s: permits metacharacter
.
(dot) to match the newline.
# 1.5 Parenthesized Back-References & Matched Variables $1, …, $9
Parentheses ( )
serve two purposes in regex:
- Firstly, parentheses
( )
can be used to group sub-expressions for overriding the precedence or applying a repetition operator. For example,/(a|e|i|o|u){3,5}/
is the same as/a{3,5}|e{3,5}|i{3,5}|o{3,5}|u{3,5}/
. - Secondly, parentheses are used to provide the so called back-references. A back-reference contains the matched sub-string. For examples, the regex
/(\S+)/
creates one back-reference(\S+)
, which contains the first word (consecutive non-spaces) in the input string; the regex/(\S+)\s+(\S+)/
creates two back-references:(\S+)
and another(\S+)
, containing the first two words, separated by one or more spaces\s+
.
The back-references are stored in special variables $1
, $2
, …, $9
, where $1
contains the substring matched the first pair of parentheses, and so on. For example, /(\S+)\s+(\S+)/
creates two back-references which matched with the first two words. The matched words are stored in $1
and $2
, respectively.
For example, the following expression swap the first and second words:
1 | s/(\S+) (\S+)/$2 $1/; # Swap the first and second words separated by a single space |
Back-references can also be referenced in your program.
For example,
1 | (my $word) = ($str =~ /(\S+)/); |
The parentheses creates one back-reference, which matches the first word of the $str
if there is one, and is placed inside the scalar variable $word
. If there is no match, $word
is UNDEF
.
Another example,
1 | (my $word1, my $word2) = ($str =~ /(\S+)\s+(\S+)/); |
The 2 pairs of parentheses place the first two words (separated by one or more white-spaces) of the $str
into variables $word1
and $word2
if there are more than two words; otherwise, both $word1
and $word2
are UNDEF
. Note that regular expression matching must be complete and there is no partial matching.
\1
, \2
, \3
has the same meaning as $1
, $2
, $3
, but are valid only inside the s///
or m//
. For example, /(\S+)\s\1/
matches a pair of repeated words, separated by a white-space.
# 1.6 Character Translation Operator tr///
You can use translator operator to translate a character into another character. The syntax is:
1 | tr/fromchars/tochars/modifiers |
replaces or translates fromchars
to tochars
in $_
, and returns the number of characters replaced.
For examples,
1 | tr/a-z/A-Z/ # converts $_ to uppercase. |
Instead of forward slash ( /
), you can use parentheses ()
, brackets []
, curly bracket {}
as delimiter, e.g.,
1 | tr[0-9][##########] # replace numbers by #. |
If tochars
is shorter than fromchars
, the last character of tochars
is used repeatedly.
1 | tr/a-z/A-E/ # f to z is replaced by E. |
tr///
returns the number of replaced characters. You can use it to count the occurrence of certain characters. For examples,
1 | my $numLetters = ($string =~ tr/a-zA-Z/a-zA-Z/); |
# Modifiers /c, /d and /s for tr///
/c
: complements (inverses)fromchars
./d
: deletes any matched but un-replaced characters./s
: squashes duplicate characters into just one.
For examples,
1 | tr/A-Za-z/ /c # replaces all non-alphabets with space |
# 1.7 String Functions: split and join
split(regex, str, [numItems]): Splits the given str using the regex, and return the items in an array. The optional third parameter specifies the maximum items to be processed.
join(joinStr, strList): Joins the items in strList with the given joinStr (possibly empty).
For examples,
1 | #!/usr/bin/env perl |
# 1.8 Functions grep, map
- grep(regex, array): selects those elements of the
array
, that matchesregex
. - map(regex, array): returns a new array constructed by applying
regex
to each element of thearray
.
# 2. File Input/Output
# 2.1 Filehandle
Filehandles are data structure which your program can use to manipulate files. A filehandle acts as a gate between your program and the files, directories, or other programs. Your program first opens a gate, then sends or receives data through the gate, and finally closes the gate. There are many types of gates: one-way vs. two-way, slow vs. fast, wide vs. narrow.
Naming Convention: use uppercase for the name of the filehandle, e.g., FILE
, DIR
, FILEIN
, FILEOUT
, and etc.
Once a filehandle is created and connected to a file (or a directory, or a program), you can read or write to the underlying file through the filehandle using angle brackets, e.g., <FILEHANDLE>
.
Example: Read and print the content of a text file via a filehandle.
1 | #!/usr/bin/env perl |
Example: Search and print lines containing a particular search word.
1 | #!/usr/bin/env perl |
Example: Print the content of a directory via a directory handle.
1 | #!/usr/bin/env perl |
You can use C-style’s printf
for formatted output to file.
# 2.2 File Handling Functions
Function open: open(filehandle, string)
opens the filename given by string and associates it with the filehandle. It returns true if success and UNDEF
otherwise.
- If string begins with
<
(or nothing), it is opened for reading. - If string begins with
>
, it is opened for writing. - If string begins with
>>
, it is opened for appending. - If string begins with
+<
,+>
,+>>
, it is opened for both reading and writing. - If string is
-
,STDIN
is opened. - If string is
>-
,STDOUT
is opened. - If string begins with
-|
or|-
, your process willfork()
to execute the pipe command.
Function close: close(filehandle)
closes the file associated with the filehandle. When the program exits, Perl closes all opened filehandles. Closing of file flushes the output buffer to the file. You only have to explicitly close the file in case the user aborts the program, to ensure data integrity.
A common procedure for modifying a file is to:
- Read in the entire file with
open(FILE, $filename)
and@lines = <FILE>
. - Close the filehandle.
- Operate upon
@lines
(which is in the fast RAM) rather thanFILE
(which is in the slow disk). - Write the new file contents using
open(FILE, “>$filename”)
andprint FILE @lines
. - Close the file handle.
Example: Read the contents of the entire file into memory; modify and write back to disk.
1 | #!/usr/bin/env perl |
Example: Reading from a file
1 | #!/usr/bin/env perl |
Example: Writing to a file
1 | #!/usr/bin/env perl |
Example: Appending to a file
1 | #!/usr/bin/env perl |
# 2.3 In-Place Editing
Instead of reading in one file and write to another file, you could do in-place editing by specifying –i
flag or use the special variable $^I
.
- The
–ibackupExtension
flag tells Perl to edit files in-place. If a backupExtension is provided, a backup file will be created with the backupExtension. - The special variable
$^I=backupExtension
does the same thing.
Example: In-place editing using –i
flag
1 | #!/usr/bin/env perl -i.old # In-place edit, backup as '.old' |
Example: In-place editing using $^I
special variable.
1 | #!/usr/bin/env perl |
# 2.4 Functions seek, tell, truncate
seek(filehandle, position, whence)
: moves the file pointer of the filehandle to position, as measured from whence. seek()
returns 1 upon success and 0 otherwise. File position is measured in bytes. whence of 0 measured from the beginning of the file; 1 measured from the current position; and 2 measured from the end. For example:
1 | seek(FILE, 0, 2); # 0 byte from end-of-file, give file size. |
tell(filehandle)
: returns the current file position of filehandle.
truncate(FILE, length)
: truncates FILE to length bytes. FILE can be either a filehandle or a file name.
To find the length of a file, you could:
1 | seek(FILE, 0, 2); # Move file point to end of file. |
Example: Truncate the last 2 bytes if they begin with \x0D
,
1 | #!/usr/bin/env perl |
# 2.5 Function eof
eof(filehandle)
returns 1 if the file pointer is positioned at the end of the file or if the filehandle is not opened.
# 2.6 Reading Bytes Instead of Lines
The function read(filehandle, var, length, offset)
reads length bytes from filehandle starting from the current file pointer, and saves into variable var starting from offset (if omitted, default is 0). The bytes includes \x0A
, \x0D
etc.
# Example
1 | #!/usr/bin/env perl |
# 2.7 Piping Data To and From a Process
If you wish your program to receive data from a process or want your program to send data to a process, you could open a pipe to an external program.
open(handle, "command|")
lets you read from the output of command.open(handle, "|command")
lets you write to the input of command.
Both of these statements return the Process ID (PID) of the command.
Example: The dir
command lists the current directory. By opening a pipe from dir
, you can access its output.
1 | #!/usr/bin/env perl |
Example: This example shows how you can pipe input into the sendmail program.
1 | #!/usr/bin/env perl |
You cannot pipe data both to and from a command. If you want to read the output of a command that you have opened with the |command
, send the output to a file. For example,
1 | open (PIPETO, "|command > /output.txt"); |
# 2.8 Deleting file: Function unlink
unlink(FILES)
deletes the FILES, returning the number of files deleted. Do not use unlink()
to delete a directory, use rmdir()
instead. For example,
1 | unlink $filename; |
# 2.9 Inspecting Files
You can inspect a file using (-test FILE)
condition. The condition returns true if FILE satisfies test. FILE can be a filehandle or filename. The available test are:
-e
: exists.-f
: plain file.-d
: directory.-T
: seems to be a text file (data from 0 to 127).-B
: seems to be a binary file (data from 0 to 255).-r
: readable.-w
: writable.-x
: executable.-s
: returns the size of the file in bytes.-z
: empty (zero byte).
# Example
1 | #!/usr/bin/env perl |
# 2.10 Function stat and lsstat
The function stat(FILE)
returns a 13-element array giving the vital statistics of FILE. lsstat(SYMLINK)
returns the same thing for the symbolic link SYMLINK.
The elements are:
Index | Value |
---|---|
0 | The device |
1 | The file’s inode |
2 | The file’s mode |
3 | The number of hard links to the file |
4 | The user ID of the file’s owner |
5 | The group ID of the file |
6 | The raw device |
7 | The size of the file |
8 | The last accessed time |
9 | The last modified time |
10 | The last time the file’s status changed |
11 | The block size of the system |
12 | The number of blocks used by the file |
For example: The command
1 | perl -e "$size= (stat('test.txt'))[7]; print $size" |
prints the file size of “ test.txt
”.
# 2.11 Accessing the Directories
opendir(DIRHANDLE, dirname)
opens the directory dirname.closedir(DIRHANDLE)
closes the directory handle.readdir(DIRHANDLE)
returns the next file from DIRHANDLE in a scalar context, or the rest of the files in the array context.glob(string)
returns an array of filenames matching the wildcard in string, e.g.,glob('.dat')
andglob('test??.txt')
.mkdir(dirname, mode)
creates the directory dirname with the protection specified by mode.rmdir(dirname)
deletes the directory dirname, only if it is empty.chdir(dirname)
changes the working directory to dirname.chroot(dirname)
makes dirname the root directory “/” for the current process, used by superuser only.
Example: Print the contents of a given directory.
1 | #!/usr/bin/env perl |
Example: Removing empty files in a given directory
1 | #!/usr/bin/env perl |
Example: Display files matches “ .txt
”
1 | my @files = glob('*.txt'); |
Example: Display files matches the command-line pattern.
1 | $file = shift; |
# 2.12 Standard Filehandles
Perl defines the following standard filehandles:
STDIN
– Standard Input, usually refers to the keyboard.STDOUT
– Standard Output, usually refers to the console.STDERR
– Standard Error, usually refers to the console.ARGV
– Command-line arguments.
For example:
1 | my $line = <STDIN> # Set $line to the next line of user input |
When you use an empty angle brackets <>
to get inputs from user, it uses the STDIN
filehandle; when you get the inputs from the command-line, it uses ARGV
filehandle. Perl fills in STDIN
or ARGV
for you automatically. Whenever you use print()
function, it uses the STDOUT
filehandler.
<>
behaves like <ARGV>
when there is still data to be read from the command-line files, and behave like <STDIN>
otherwise.
# 3. Text Formatting
# 3.1 Function write
write(filehandle)
: printed formatted text to filehandle
, using the format associated with filehandle
. If filehandle
is omitted, STDOUT
would be used.
# 3.2 Declaring format
1 | format name = |
# 3.3 Picture Field @<, @|, @>
@<
: left-flushes the string on the next line of formatting texts.@>
: right-flushes the string on the next line of formatting texts.@|
: centers the string on the next line of the formatting texts.
@<
, @>
, @|
can be repeated to control the number of characters to be formatted. The number of characters to be formatted is same as the length of the picture field. @###.##
formats numbers by lining up the decimal points under “ .
”.
For examples,
[TODO]
# 3.4 Printing Formatting String printf
printf(filehandle, template, array)
: prints a formatted string to filehandle
(similar to C’s fprintf()
). For example,
1 | printf(FILE "The number is %d", 15); |
The available formatting fields are:
Field | Expected Value |
---|---|
%s |
String |
%c |
Character |
%d |
Decimal number |
%ld |
Long decimal Number |
%u |
Unsigned decimal number |
%x |
Hexadecimal number |
%lx |
Long hexadecimal number |
%o |
Octal number |
%lo |
Long octal number |
%f |
Fixed-point floating-point number |
%e |
Exponential floating-point number |
%g |
Compact floating-point number |