Robert's Perl TutorialVersion 4.1.1 |
|
| The last hack was made on: 20th April 1999 so Henry can concentrate |
|
| THIS DOCUMENT IS COPYRIGHTED. Reproduction in whole or part is prohibited. Please email me at robert@netcat.co.uk if you want to use this information anywhere. |
The location of this document is http://www.sthomas.net/oldpages/roberts-perl-tutorial.htm mirrored from http://www.netcat.co.uk/rob/perl/win32perltut.html |
A basic Perl course primarily for use on Win32 platforms. It assumes that the reader knows nothing of programming whatsoever, but needs a solid grounding for further work. After you finish this course you'll be ready to specialise in CGI, sysadmin or whatever you want to do with Perl.
I've had a fair amount of requests for a ToC, so here it is:
More ways to access arrays
Basic changes
You need to be able to differentiate between a PC and a toaster. No programming experience is necessary. You do need to understand the basics of PC operation. If you don't understand what directories and files are then you'll find this difficult. You might find it difficult even if you do :-)
You do need to exercise the brain cells, and you need time.
Note: You don't even need a Win32 PC if you are comfortable installing Perl under other operating systems like Linux, but not all the information here will be relevant.
You don't need a complier. Perl is an interpreted language, which means you run code directly, not compile it then run it.
Just work through from start to finish.
Generally, the explanation follows the code sample. Before you read the explanation, try and work out what the code does. Then check if you're right. In this way, you'll derive maximum value from the tutorial and exercise the old grey cells a little.
When you finish, please send me a critique. In fact, send one even if you don't finish. I appreciate all feedback! Please note -- I am not a source of free technical support. Do not email me your general Perl problems. If you want support, ask on Usenet or the ActiveState mailing lists. That said, I welcome problems related to the tutorial itself.
The humour is non-conventional. I think. Of more importance, the text is coloured strangely in places. My intention is to aid your comprehension, not attempt beautification. The meaning of the colours:
perl
changeworld.pl parm1 datafile.txtwhile (<DATFILE>) {
printf "%2s : $_",$.;
}
split. All the code examples have been tested, and you can just cut'n'paste (brave statement). I haven't listed the output of each example. You need to run it and see for yourself. Consider this course interactive. Consider it any which way you like.
Fine by me, feel free print to a copy for your own use.
Just email me and let me know.
Again, all I ask is an email.
Every so often someone offers to translate the tutorial. Nobody has actually done so. If you want to, the conditions are:
Remember this document is copyrighted and all associated rights are strictly reserved.
--
Robert PepperIf you already understand what Perl is designed to do, know its features and limitations then you can skip this very small but highly informative section, over which I laboured long and hard for those that didn't know. If you are really sure, jump to the Setup Section.
Perl is a programming language. Perl stands for Practical Report and Extraction Language. You'll notice people refer to 'perl' and "Perl". "Perl" is the programming language as a whole whereas 'perl' is the name of the core executable. There is no language called "Perl5" -- that just means "Perl version 5". Versions of Perl prior to 5 are very old and very unsupported.
Some of Perl's many strengths are:
A company named ActiveState exists to provide Perl tools for the Win32 environment. ActiveState used to be ActiveWare, and before that it was sort of a part of Hip Communications. It now appears to be happy with its current name, having not changed it for over a year. Win32 means, at the time of writing, Windows 95, Windows 98 and Windows NT. It does not mean Windows 3.11, even with Win32s installed.
Prior to Perl version 5.005, there was one version of Perl for Win32, and another for all the other systems. The other version was known as the "native version".
The Win32 version was developed by ActiveState, called "Perl for Win32" and typically lagged slightly behind the native version. As of the 5.005 release, Perl for Win32 and the native version have merged -- the native version now supports Win32 directly and doesn't need any tweaking by ActiveState.
ActiveState have dropped "Perl for Win32" and renamed their distribution, which comes with an InstallShield installer, "ActivePerl".
Incidentally, a few months before 5.005 merge the native Perl version was changed so it would run on Win32 directly. This version was best known by the creator's name, "Gurusamy Sarathy". However, there were still quite a few differences between it and Perl for Win32, so many people ran both. The merge brought the best of both worlds together.
Probably. Perl runs on everything from Amigas to Macintoshes to Unix boxen. Perl also runs on Microsoft operating systems, namely Windows 95, Windows 98 and Windows NT 3.51 and later. There are versions of Perl that run on earlier versions of these operating systems but they are no longer developed or supported. See http://www.perl.com/ for full details.
Just two popular examples :
Go surf. Notice how many websites have dynamic pages with .pl
or similar as the filename extension? That's Perl. It is the most popular
language for CGI programming for many reasons, most of which are mentioned
above. In fact, there are a great many more dynamic pages written with perl
that may not have a .pl extension. If you code in Active Server
Pages, then you should try using ActiveState's PerlScript. Quite frankly,
coding in PerlScript rather than VBScript or JScript is like driving a car as
opposed to riding a bicycle. Perl powers a good deal of the Internet.
If you are a Unix sysadmin you'll know about sed, awk and shell scripts. Perl can do everything they can do and far more besides. Furthermore, Perl does it much more efficiently and portably. Don't take my word for it, ask around.
If you are an NT sysadmin, chances are you aren't used to programming. In which case, the advantages of Perl may not be clear. Do you need it? Is it worth it?
After you read this tutorial you will know more than enough to start using Perl productively. You really need very little knowledge to save time. Imagine driving a car for years, then realising it has five gears, not four. That's the sort of improvement learning Perl means to your daily sysadminery. When you are proficient, you find the difference like realising the same car has a reverse gear and you don't have to push it backwards. Perl means you can be lazier. Lazy sysadmins are good sysadmins, as I keep telling my boss.
A few examples of how I use Perl to ease NT sysadmin life:
The question is, "what shouldn't I do with Perl". Write office suites is one answer. Perl, like most scripting languages, is a glue language designed for short and relatively simple tasks. Just don't equate this philosophy with a lack of power or "serious" features.
See the FAQs at www.perl.com. Of course there are Usenet groups, but also many mailing lists. Microsoft Windows users will be interested in those hosted by http://www.activestate.com/ which discuss all things Perl and Windows.
Please, before you ask any question, anywhere:
Think to yourself -- honestly -- if I was a busy Perl Professional, would I want to answer my own question?
Does it clearly state what I want an answer to? Preferably just one question at a time. Am I being unreasonable, for example asking for someone to code it for me? Have I shown evidence that I have tried to help myself? Have I made any mistakes in grammar? Is it polite? Is there enough information in there for the answer to be given?
Why should you care? Well, if you ask poorly-formed questions or those already answered in the FAQ...let's just say you won't get the answers you want. If you care about your online reputation and wasting other people's time -- two more reasons.
There are four stages:
An old version of Perl for Win32 is included with the Windows NT Resource Kit. It is sadly out of date. Follow the steps below to get a newer version. Having said that, you can complete the tutorial with the Resource Kit version but you should upgrade as soon as you can.
Go to http://www.activestate.com/
and follow the links to download ActivePerl. It will be a single file, and the
name will be something like api508e.exe. The i stands
for Intel. If you have an Alpha, download apaXXXe.exe. If you're
not sure, download the Intel version.
The 508e is the version number, so expect this to change quite
rapidly. The file size will be just over 5Mb, so it will take a while to
download via modem. If you know how to use FTP, try
ftp.activestate.com/activeperl/.
When you find ActivePerl, save the file into any directory you please. I
like to organise my downloads into c:\downloads but that is just
personal preference. As long as ActivePerl ends up on your hard disk somewhere
it doesn't matter.
So you now have apixxxx.exe. If you forget where you saved it,
don't panic, just run Windows Explorer and search for api*e.exe
apixxxx.exe. You'll see the fantastic ActivePerl
graphic and be advised to close all open applications before proceeding. The lizard thing
is a gecko, which adorns the famous O'Reilly book "Learning Perl on Win32
Systems". This tutorial is aimed at a more basic level than that book, in terms of
the author's knowledge, intended audience and quality of humour. c:\progs\perl rather than c:\program
files\perl because many Win32 programs don't properly handle long
filenames, let alone those with spaces in. Or you could accept the default.
Your choice. perl myscript.plmyscript.pl. Personally, I prefer double-clicking to allow
me to edit the file so I do not select this option. Also, perl has a plethora
of command line arguments which are difficult to pass to a script if you run it
by association. For the purposes of this tutorial I'm assuming that you haven't
associated .pl with perl. So you know what this tutorial is designed to do. You know what Perl is designed to do, and you have even installed it. It is now time to start the tutorial proper, and actually hack some code.
Assuming all has gone to plan, you can now create your first Perl script. Follow these instructions, but before you start read them through once, then begin. That's a good idea with any form of computer-related procedure. So, to begin:
c:\scripts\, which is what I'll assume you are
using in this tutorial. print "My first Perl script\n";
c:\scripts\myfirst.pl. Be careful! Notepad will may save files
with a .txt extension, so you will end up with myfirst.txt.pl by
default. Perl won't mind, it'll still execute the file. If your version of Notepad does
this, select "All files" before saving or rename the file then load it again.
Better yet, use a decent text editor! cd \scripts . perl myfirst.pl
and you'll see the output. Welcome to the world of Perl ! See what I mean about it being easy to start? However, it is difficult to finish with Perl once you begin :-)
So you typed in perl myfirst.pl and you didn't see
My first Perl script on the screen. If you saw "bad command or
filename" then either you haven't installed Perl or perl.exe is not in your
path. Probably the latter. Reboot, then try again.
If you saw Can't open perl script "xxxx.pl": No such file or
directory then perl is defintely installed, but you have either got the
name of the script wrong or the script is not in the same directory as where
you are trying to run it from. For example, maybe you saved in script in
c:\windows and you are in c:\scripts so of course
Perl complains it can't find the script. Could you? Well, don't expect Perl to
then. You don't have to run the script from the directory in which it resides,
but it is easier.
We need to analyse what's going on here a little. First note that the line
ends with a semicolon ; . Almost all lines of code in
Perl have to end with semicolons, and those that don't have to will accept
semicolons anyway. The moral is -- use semicolons. Sorry; the moral is; use
semicolons.
Oh, one more thing -- if you haven't already done so, continue breathing.
Also note the \n . This is the code to tell Perl
to output a newline. What's a newline? Delete the
print "My first Perl script";
and all should become clear. You have now written your first Perl script.
Almost every Perl book is written for UN*X, which is a problem for Win32. This leads to scripts like:
#!c:/perl/perl.exe print "I'm a cool Perl hacker\n";
The function of the 'shebang' line is to tell the shell how to execute the file. Under UNIX, this makes sense. Under Win32, the system must already know how to execute the file before it is loaded so the line is not needed.
However, the line is not completely ignored, as it is searched for any
switches you may have given Perl (for example -w to
turn on warnings).
You may also choose to add the line so your scripts run directly on UNIX without modification, as UNIX boxes probably do need it. Win32 systems do not. We shall continue with the lesson.
So Perl is working, and you are working with Perl. Now for something more
interesting than simple printing. Variables. Let's take simple scalar variables
first. A scalar variable is a single value. Like $var=10 which sets the variables $var
to the value of 10. Later, we'll look at lists like arrays and hashes,
where @var refers to more than one value. For the
moment, remember that Scalar is Singular. If weird metaphors help, think
of lots of scaly snakes at a singles bar. If that didn't help, I apologise for
putting the thought into your mind.
If you have any experience with other programming languages you might be
surprised by the code $var=10. With most languages,
if you want to assign the value 10 to a variable
called var you'd write var=10.
Not so in Perl. This is a Feature. All variables are prefixed with a symbol
such as $ @ % . This has certain advantages, like
making programs easier to read. Honestly, I'm serious! It just takes some
getting used to. The prefixes mean that you can see where the variables
are quite easily. And not only that, what sort of variable it is.
The human language German has a similar principle (except nouns are
capitalised, not prefixed with $ and Perl is easier
to pronounce). You'll agree later, I think.
So, ever onwards. Time to try some more variables:
$string="perl"; $num1=20; $num2=10.75; print "The string is $string, number 1 is $num1 and number 2 is $num2\n";
A closer look...notice you don't have to say what type of variable
you are declaring. In other languages you need to say if the variable is a
string, array, what sort of number it is and so on. You might even have to
declare what type of number it is. As an example, in Java you'd been saying
things like int var=10 which defines the variable var as an
integer, with the value 10.
So, why do these other programming languages force you to declare exactly what your variables are? Wouldn't it be easier if we could just not bother?
For short programs, yes. For really big projects with many programmers working on the same application, no. That's because forcing variable type declaration also forces a certain discipline and rigour which is what you need on big projects.
As you know, Perl is not designed for gigantic software engineering efforts. It is all about small, quick programs. For these purposes you don't need the rigour of variable controls as much, so Perl doesn't bother.
This idea of forcing a programmer to declare what sort of variable is being created is called typing. As Perl doesn't by default enforce any rules on typing, it is said to be a loosely typed language, as opposed to something like C++ which is strongly typed.
We still haven't finished learning from that humble bit of code. To refresh your memory, here it is again:
$string="perl"; $num1=20; $num2=10.75; print "The string is $string, number 1 is $num1 and number 2 is $num2\n";
Notice the way the variables are used in the string. Sticking variables inside of
strings has a technical term - "variable interpolation". Now, if we didn't have the
handy $ prefix for we'd have to do something like the example
below, which is pseudocode. Pseudocode is code to demonstrate a concept, not designed to
be run. Like certain Microsoft software.
print "The string is ".string." and the number is ".num."\n";
which is much more work. Convinced about those prefixes yet ?
Try running the following code:
$string="perl"; $num=20; print "Doubles: The string is $string and the number is $num\n"; print 'Singles: The string is $string and the number is $num\n';
Double quotes allow the aforementioned variable interpolation. Single quotes do not. Both have their uses as you will see later, depending on whether you wish to interpolate anything.
If you want to add 1 to a variable you can, logically, do this; $num=$num+1 . There is a shorter way to do this, which is
$num++. This is an autoincrement. Guess what this is;
$num-- . Yes, an autodecrement.
This example illustrates the above:
$num=10; print "\$num is $num\n"; $num++; print "\$num is $num\n"; $num--; print "\$num is $num\n"; $num+=3; print "\$num is $num\n";
The last example demonstrates that it doesn't have to be just 1 you can add or decrease by.
There's something else new in the code above. The \
. You can see what this does -- it 'escapes' the special meaning
of $ .
Escaping means that just the $ symbol is printed
instead of it referring to a variable.
Actually \ has a deeper meaning -- it escapes
all of Perl's special characters, not just $ .
Also, it turns some non-special characters into something special. Like what?
Like n . Add the magic \
and the humble 'n' becomes the mighty NewLine ! The \
character can also escape itself. So if you want to print a single \ try:
print "the MS-DOS path is c:\\scripts\\";
Oh, '\' is also used for other things like references. But that's not even covered here.
There is a technical term for these 'special characters' such as @ $ %. They are called metacharacters. Perl uses
plenty of metacharacters. In fact, you'll wear your keyboard pretty evenly
during a night's perl hacking. I think it is safe to say that Perl uses every
possible keystroke and shifted keystroke on a standard US PC keyboard.
You'll be working with all sorts of obscure characters in your Perl hacking career, and I also mean those on your keyboard. This has earned perl a reputation for being difficult to understand. That's entirely true. Perl does have such a reputation, no doubt about it.
Is the reputation justified? In my opinion, Perl does have a short but steep learning curve to begin with simply because it is so different. However, once you learn the character meanings reading perl code becomes much easier precisely because of all these strange characters.
Perl uses so many weird characters that there aren't enough to go round. So
sometimes the same character has two or more meanings, depending on its
context. As an example, the humble dot . can join
two variables together, act as a wildcard or become a range operator if there
are two of them together. The caret ^ has different
effects in [^abc] as opposed to [a^bc] .
If this sounds crazy, think about the English language. What do the following mean to you ?
Mean is, in one context, is a word to used describe the purpose of something. It is also another word for average. Furthermore, it describes a nasty person, or a person who doesn't like spending money, and is used in slang to refer to something impressive and good.
That's five different uses for 'mean', and you don't have any trouble
understanding which one I
Polish, when capitalised, can either mean pertaining to the country Poland, or the act of making something shiny. And 'like' can mean similar to, or affection for.
So, when you speak or write English (think of two, to and too) you know what these words mean by their context. It is exactly the same way with Perl. Just don't assume a given metacharacter always means what you first thought it did.
To finish off this section, try the following:
$string="perl"; $num=20; $mx=3; print "The string is $string and the number is $num\n"; $num*=$mx; $string++; print "The string is $string and the number is $num\n";
Note the easy shortcut *= meaning 'multiply $num
by $mx' or, $num=$num*$mx . Of course Perl supports
the usual + - * / ** % operators. The last two are
exponentiation (to the power of) and modulus (remainder of x divided by y).
Also note the way you can increment a string ! Is this language flexible or
what?
The print function is a list operator.
That means it accepts a list of things to print, separated by commas. As an
example:
print "a doublequoted string ", $var, 'that was a variable called var', $num," and a newline \n";
Of course, you just put all the above inside a single doublequoted string:
print "a doublequoted string $var that was a variable called var $num and a newline \n";
to achieve the same effect. The advantage of using the print
function in list context is that expressions are evaluated before being printed. For example, try
this:
$var="Perl"; $num=10; print "Two \$nums are $num * 2 and adding one to \$var makes $var++\n"; print "Two \$nums are ", $num * 2," and adding one to \$var makes ", $var++,"\n";
You might have been slightly surprised by the result of that last
experiment. In particular, what happened to our variable $var? It should have been incremented by one, resulting in
Perm. The reason being that 'm' is the next letter after 'l' :-)
Actually, it was incremented by 1. We are
postincrementing $var++ the variable, rather
than preincrementing it.
The difference is that with
postincrements, the value of the variable is returned, then the operation is
performed on it. So in the example above, the current value of $var was returned to the print
function, then 1 was added. You can prove this to yourself by adding the line
print "\$var is now $var\n"; to the end of the example
above.
If we want the operation to be performed on $var before the value is returned to the print function,
then preincrement is the way to go. ++$var will do
the trick.
Let's take a another look at the example we used to show how the autoincrement system works. Messy, isn't it ? This is Batch File Writing Mentality. Notice how we use exactly the same code four times. Why not just put it in a subroutine?
$num=10; # sets $num to 10
&print_results; # prints variable $num
$num++;
&print_results;
$num*=3;
&print_results;
$num/=3;
&print_results;
sub print_results {
print "\$num is $num\n";
}
Easier and neater. The subroutine can go anywhere in your script, at the beginning, end, middle...makes no difference. Personally I put all mine at the bottom and reserve the top part for setting variables and main program flow.
A subroutine is just some code you want to use more than once in the same script. In Perl, a subroutine is a user-defined function. There is no difference. For the purposes of clarity I'll refer to them as subroutines.
A subroutine is defined by starting with sub then
the name. After that you need a curly left bracket { ,
then all the code for your subroutine. Finish it off with a closing brace
} . The area between the two braces is called a
block. Remember this. There are such things as anonymous
subroutines but not here. Everything here has a name.
Subroutines are usually called by prefixing their name with an ampersand,
that is one of these -- & , like so
&print_results; . It used to be cool to omit the
& prefix but all perl hackers are now encouraged
to use it to avoid ambiguity. Ambiguity can hurt you if you don't avoid it.
If you are worrying about variable visibility, don't. All the variables we are using so far are visible everywhere. You can restrict visibility quite easily, but that's not important right now. If you weren't worrying about variable visibility, please don't start. I'd tell you it's not important but that'll only make you worried. (paranoid ?) We'll cover it later.
Did you see a # crept in there. That's a comment.
Everything after a # is ignored. You can't continue
it onto a newline however, so if your comment won't fit on one line start a new
one with # . There are ways to create Plain Old
Documentation (POD) and more ways to comment but they are not detailed here.
An if statement is simple. if the day is
Sunday, then lie in bed. A simple test, with two outcomes. Perl
conversion (don't run this):
if ($day eq "sunday") {
&lie_in_bed;
}
You already know that &lie_in_bed is a
call to a subroutine. We assume $day is set earlier
in the program. If $day is not equal to 'Sunday'
&lie_in_bed is not executed (pity). You don't
need to say anything else. Try this:
$day="sunday";
if ($day eq "sunday") {
print "Zzzzz....\n";
}
Note the syntax. The if statement requires
something to test for Truth. This expression must be in (parens), then you have
the braces to form a block.
There are many Perl functions which test for Truth. Some are if,
while, unless . So it is important you know what truth is, as
defined by Perl, not your tax forms. There are three main rules:
"" and "0". 0. This includes
negative numbers. Some example code to illustrate the point:
&isit; # $test1 is at this moment undefined
$test1="hello"; # a string, not equal to "" or "0"
&isit;
$test1=0.0; # $test1 is now a number, effectively 0
&isit;
$test1="0.0"; # $test1 is a string, but NOT effectively 0 !
&isit;
sub isit {
if ($test1) { # tests $test1 for truth or not
print "$test1 is true\n";
} else { # else statement if it is not true
print "$test1 is false\n";
}
}
The first test fails because $test1 is undefined.
This means it has not been created by assigning a value to it. So according to
Rule 3 it is false. The last two tests are interesting. Of course, 0.0 is the
same as 0 in a numeric context. But it is not the
same as 0 in a string context, so in that case it is true.
So here we are testing single variables. What's more useful is testing the
result of an expression. For example, this is an expression; $x * 2
and so is this; $var1 + $var2 . It is the
end result of these expressions that is evaluated for truth.
An example demonstrates the point:
$x=5;
$y=5;
if ($x - $y) {
print '$x - $y is ',$x-$y," which is true\n";
} else {
print '$x - $y is ',$x-$y," which is false\n";
}
The test fails because 5-5 of course is 0, which is false. The
print statement might look a little strange. Remember
that print is a list operator? So we hand it a list.
First item, a single-quoted string. It is single quoted because it we do not
want to perform variable interpolation on it. Next item is an expression
which is evaluated, and the result printed. Finally, a double-quoted string is
used because we want to print a newline, and without the doublequotes the
\n won't be interpolated.
What is probably more useful than testing a specific variable for truth is equality testing. For example, has your lucky number been drawn?
$lucky=15;
$drawnum=15;
if ($lucky == $drawnum) {
print "Congratulations!\n";
} else {
print "Guess who hasn't won!\n";
}
The important point about the above code is the equality operator,
== .
Now pay close attention, otherwise you'll end up posting an annoying question somewhere. This is a FAQ, as in a Frequently Asked Question.
The symbol = is an assignment operator,
not a comparison operator. Therefore:
if ($x = 10) is always true, because
$x has been assigned the value 10 successfully.
if ($x == 10) compares the two values,
which might not be equal. So far we have been testing numbers, but there is more to life than numbers. There are strings too, and these need testing too.
$name = 'Mark';
$goodguy = 'Tony';
if ($name == $goodguy) {
print "Hello, Sir.\n";
} else {
print "Begone, evil peon!\n";
}
Something seems to have gone wrong here. Obviously Mark is different to Tony, so why does perl consider them equal?
Mark and Tony are equal -- numerically. We should be testing them as
strings, not as numbers. To do this, simply substitute
== for eq and everything will work as
expected.
There are two types of comparison operator; numeric and
string. You've already seen two, == and
eq. Run this:
$foo=291;
$bar=30;
if ($foo < $bar) {
print "$foo is less than $bar (numeric)\n";
}
if ($foo lt $bar) {
print "$foo is less than $bar (string)\n";
}
The lt operator compares in a string context, and
of course < compares in a numeric context.
Alphabetically, that is in a string context, 291 comes before 30. It is actually decided by the ASCII value, but alphabetically is close enough. Change the numbers around a little. Notice how Perl doesn't care whether it uses a string comparison operator on a numeric value, or vice versa. This is typical of Perl's flexibility.
Bondage and discipline are pretty much alien concepts to Perl (and the author). This flexibility does have a drawback. If you're on a programming precipice, threatening suicide by jumping off, Perl won't talk you out of your decision but will provide several ways of jumping, stepping or falling to your doom while silently watching your early conclusion. So be careful.
The Perl Motto is; "There is More Than One Way to Do It" or TIMTOWTDI. Pronounced 'Tim-Toady'. This tutorial doesn't try and mention all possible ways of doing everything, mainly because the author is far too lazy. Write your Perl programs the way you want to.
The rest of the operators are:
| Comparison | Numeric | String |
| Equal | == | eq |
| Not equal | != | ne |
| Greater than | > | gt |
| Less than | < | lt |
| Greater than or equal to | >= | ge |
| Less than or equal to | <= | le |
They may be odious, but remember the following:
More about if statements. Run this:
$age=25;
$max=30;
if ($age > $max) {
print "Too old !\n";
} else {
print "Young person !\n";
}
It is easy to see what else does. If the
expression is false then whatever is in the else
block is evaluated (or carried out, executed, whatever term you choose
to use). Simple. But what if you want another test ? Perl can do that too.
$age=25;
$max=30;
$min=18;
if ($age > $max) {
print "Too old !\n";
} elsif ($age < $min) {
print "Too young !\n";
} else {
print "Just right !\n";
}
If the first test fails, the second is evaluated. This carries on until
there are no more elsif statements, or an
else statement is reached. An else
statement is optional, and no elsif statements
should come after it. Logical, really.
There is a big difference between the above example the one below:
if ($age > $max) {
print "Too old !\n";
}
if ($age < $min) {
print "Too young !\n";
}
If you run it, it will return the same result - in this case. However, it is
Bad Programming Practice. In this case we are testing a number, but suppose we
were testing a string to see if it contained R or S. It is possible that a
string could contain both R and S. So it would pass both 'if' tests.
Using an elsif avoids this. As soon as the first
statement is true, no more elsif statements (and no
else statement) are executed.
You don't need to take up a whole three lines:
print "Too old\n" if $age > $max; print "Too old\n" unless $age < $max;
I added some whitespace there for aesthetic beauty. There are other
operators that you can use instead of if and
unless , but that's for later on.
Incidentally, the two lines of code above do not do exactly the same thing. Consider a maximum age of 50 and input age of 50. Therefore, you should be very careful about your logic when writing code (nice obvious statement there).
For those that were wondering, Perl has no case statement. This is all explained in the FAQ, which is located at http://www.perl.com/.
Sometimes you have to interact with the user. It is a pain, but sometimes necessary, especially for the live ones. To ask for input and do something with it try this:
print "Please tell me your name: "; $name=<STDIN>; print "Thanks for making me happy, $name !\n";
New things to learn here. Firstly, <STDIN> .
STDIN is a filehandle. Filehandles are what you use to interact with things
such as files, console input, socket connections and more.
You could say STDIN is the standard source for input. Guess what STDIN stands for. In this case the STDIN filehandle is reading from the console.
The angle brackets <> read data from a
filehandle. Exactly how much is dependent on what you do, but in this case it
is whatever was input at the prompt.
So we are reading from the STDIN filehandle. The value is assigned to $name and printed. Any idea why the ! ends up on a
new line ? on a new line on a newline ????
As you pressed
Enter, you of course included a newline with your name. The easy way to get rid
of it is to chop it off:
print "Please tell me your name: "; $name=<STDIN>; chop $name print "Thanks for making me happy, $name !\n"
and that fails with a syntax error. Can you spot why? Look at the error
code, look at the line number and see where the syntax is wrong. The answer is
a missing semicolon ( ; ) on the end of the last two lines.
If you add a ; to the end of line 3, but not to
the last line, then the program works as it should. This is because Perl
doesn't need a semicolon to end the last statement of a block. However, I'd
advise ending all your statements with semicolons because you may well be
adding more code to them and it is only one little keystroke.
When you add the semicolon(s), the program runs correctly. The
chop function removes the last character of whatever
it is given to chop, in this case removing the newline for us. In fact, that
can be shortened:
print "Please tell me your name: "; chop ($name=<STDIN>); print "Thanks for making me happy, $name !";
The parentheses ( ) force
chop to act on the result of what is inside them. So
$name=<STDIN> is evaluated first, then the
result from that, which is $name , is chopped. Try
it without.
You can read from STDIN as much as you like. For your entertainment I have created a sophisticated multinational greeting machine:
print "Please tell me your name: ";
chop ($name=<STDIN>);
print "Please tell me your nationality: ";
chop ($nation=<STDIN>);
if ($nation eq "British" or $nation eq "New Zealand") {
print "Hallo $name, pleased to meet you!\n";
} elsif ($nation eq "Dutch" or $nation eq "Flemish") {
print "Hoi $name, hoe gaat het met u vandaag?!\n";
} else {
print "HELLO!!! SPEAKEEE ENGLIEESH???\n";
}
Aside from demonstrating the native English speaker's linguistic talents,
this script also introduces the or logical operator.
We'll cover or and its associates in more detail
later on. First, a word of warning.
Chopping is dangerous, as my friend One Hand Harold will tell you. Everyone is concerned about various forms of safety these days, and your perl code should be no exception.
Rather than just wantonly remove the last character regardless of whatever
it is, without a care in the world, just simply consigning the poor little
thing to the Great Bit Bucket in the Sky, you can remove the last character
only if it is a newline with chomp :
chomp ($name=<STDIN>);
At this point the perl gurus are screaming "I found an error !". Well,
chomp doesn't always remove the last character if it
is a newline but if it doesn't, you have set a special variable, namely
$/ , to something different. I presume that if you do
set $/ you know what it does. It is explained later
in this very document. Of course, being a good pupil, you wouldn't experiment
with the unknown, blindly changing things just for the hell of it to see what
happens.
If you don't, you'll never learn anything useful.
Perl has two types of array, associative arrays (hashes) and arrays. Both types are lists. A list is just a collection of variables referred to as the collection, not as individual elements.
You can think of Perl's lists as a herd of animals. List context refers to the entire herd, scalar context refers to a single element. A list is a herd of variables. The variables don't have to be all of the same type -- you might have a herd of ten sheep, three lions and two wolves. It would probably be just three lions and one wolf before long, but bear with me. In the same way, you might have a Perl list of three scalar variables, two array elements and ten hash elements.
Certain types of lists are known by certain names. Just as a herd of sheep is called a flock, a herd of lions is called a pride, a herd of wolves is called a pack and a herd of managers a confusion, some types of Perl list have a special names.
For example, an array is an ordered list of scalar variables. This
list can be referred to as a whole, or you can refer to individual elements in
the list. The program below defines a an array, called
@names . It puts five values into the array.
@names=("Muriel","Gavin","Susanne","Sarah","Anna");
print "The elements of \@names are @names\n";
print "The first element is $names[0] \n";
print "The third element is $names[2] \n";
print 'There are ',scalar(@names)," elements in the array\n";
Firstly, notice how we define @names . As it is
in a list context, we are using parens. Each value is comma separated,
which is Perl's default list delimiter. The double quotes are not
necessary, but as these are string values it makes it easier to read and change
later on.
Next, notice how we print it. Simply refer to it as a whole, that is in
list context.. List context means referring to more than one element of
a list at a time. The code print @names; will work
perfectly well too. But....
I usually learn something about Perl every time I work with it. When running a course, a student taught me this trick which he had discovered:
@names=("Muriel","Gavin","Susanne","Sarah","Anna","Paul","Trish","Simon");
print @names;
print "\n";
print "@names";
When a list is placed inside doublequotes, it is space delimited when interpolated. Useful.
If we want to do anything with the array as a list, that is doing
something with more than one value, then refer to the array as
@array . That's important. The
@ prefix is used when you want to refer to more than
one element of a list.
When you refer to more than one, but not all elements of an array that is known as a slice . Cake analogies are appropriate. Pie analogies are probably healthier but equally accurate.
Arrays are not much use unless we can get to individual elements. Firstly,
we are dealing with a single element of the list, so we cannot use
@ which refers to multiple elements of the array.
It is a single, scalar variable, so $ is used.
Secondly, we must specify which element we want. That's easy -
$array[0] for the first,
$array[1] for the second and so forth. Array indexes
start at 0, unless you do something which is so highly deprecated ('deprecated'
means allowed, usually for backwards compatibility, but disapproved of because
there are better ways) I'm not even going to mention it.
Finally, we force what is normally list context (more than one element) into
scalar context (single element) to give us the amount of elements in the array.
Without the scalar , it would be the same as the
second line of the program.
Please understand this:
$myvar="scalar variable";
@myvar=("one","element","of","an","array","called","myvar");
print $myvar; # refers to the contents of a scalar variable called myvar
print $myvar[1]; # refers to the second element of the array myvar
print @myvar; # refers to all the elements of array myvar
The two variables $myvar and
@myvar are not, in any way, related. Not even
distantly. Technically, they are in different namespaces.
Going back to the animal analogy, it is like having a dog named 'Myvar' and a goldfish called 'Myvar'. You'll never get the two mixed up because when you call 'Myvar !!!!' or open a can of dog food the 'Myvar' dog will come running and goldfish won't. Now, you couldn't have two dogs called 'Myvar' and in the same way you can't have two Perl variables in the same namespace called 'Myvar'.
The element number can be a variable.
print "Enter a number :";
chomp ($x=<STDIN>);
@names=("Muriel","Gavin","Susanne","Sarah","Anna");
print "You requested element $x who is $names[$x]\n";
print "The index number of the last element is $#names \n";
This is useful. Notice the last line of the example. It returns the index
number of the last element. Of course you could always just do this
$last=scalar(@names)-1; but this is more efficient.
It is an easy way to get the last element, as follows:
print "Enter the number of the element you wish to view :";
chomp ($x=<STDIN>);
@names=("Muriel","Gavin","Susanne","Sarah","Anna","Paul","Trish","Simon");
print "The first two elements are @names[0,1]\n";
print "The first three elements are @names[0..2]\n";
print "You requested element $x who is $names[$x-1]\n"; # starts at 0
print "The elements before and after are : @names[$x-2,$x]\n";
print "The first, second, third and fifth elements are @names[0..2,4]\n";
print "a) The last element is $names[$#names]\n"; # one way
print "b) The last element is @names[-1]\n"; # different way
It looks complex, but it is not. Really. Notice you can have multiple values
separated by a comma. As many as you like, in whatever order. The range
operator .. gives you everything between and
including the values. And finally look at how we print the last element -
remember $#names gives us a number ? Simply
enclose it inside square brackets
and you have the last element.
Do also note that because element accesses such as
[0,1] are more than one variable, we cannot use the
scalar prefix, namely the $ symbol.
We are accessing the array in list context, so we use the
@ symbol. Doesn't matter that it is not the
entire array. Remember, accessing more than one element of an array but not the
entire array is called a slice. I won't go over the food analogies again.
All well and good, but what if we want to load each element of the array in turn ? Well, we could build a for loop like this:
@names=("Muriel","Gavin","Susanne","Sarah","Anna","Paul","Trish","Simon");
for ($x=0; $x <= $#names; $x++) {
print "$names[$x]\n";
}
which sets $x to 0, runs the loop once, then adds
one to $x , checks it is less than
$#names , if so carries on. By the way, that was your
introduction to for loops. Just to go into a little
detail there, the for loop has three parts to it:
In this case, the variable $x is initialised to 0.
It is immediately tested to see if it is smaller than, or equal to
$#names . If that is true, then the block is executed
once. Critically, if it is not true the block is not executed at
all.
Once the block has been executed, the modification expression is evaluated.
That's $x++ . Then, the test condition is checked to
see if the block should be executed or not.
There is a another version:
for $x (0 .. $#names) {
print "$names[$x]\n";
}
which takes advantage of the range operator ..
(two dots together). This simply gives $x the value of 0, then
increments $x by 1 until it is equal to
$#names.
For true beauty we must use foreach .
foreach $person (@names) {
print "$person";
}
This goes through each element ('iterates', another good technical word to
use) of @names , and assigns each element in turn to
the variable $person . Then you can do what you like
with the variable. Much easier. You can use
for $person (@names) {
print "$person";
}
if you want. Makes no difference at all, aside from a little clarity.
In fact, that gets shorter. And now I need to introduce you to
$_ , which is the Default Input and Pattern
Searching Variable.
foreach (@names) {
print "$_";
}
If you don't specify a variable to put each element into,
$_ is used instead as it is the default for this
operation, and many, many others in Perl. Including the
print function :
foreach (@names) {
print ;
}
As we haven't supplied any arguments to print ,
$_ is printed as default. You'll be seeing
a lot of $_ in Perl. Actually, that statement is not
exactly true. You will be seeing lot of places where
$_ is used, but quite often when it is used, it is
not actually written. In the above example, you don't actually see
$_ but you know it is there.
A loop, by its nature, continues. If that didn't make sense, start reading this sentence again.
The old jokes are the best, aren't they?
The joke above is a loop. You continue re-reading the sentence until you realise I'm trying to be funny. Then you exit the loop. Or maybe somebody doesn't exit it. Whatever, loops always run until the expression they are testing returns false. In the case of the examples above, a false value is returned when all the elements of the array have been cycled through, and the loop ends.
If you want an everlasting loop, just test an condition you know will always be true:
while (1) {
$x++;
print "$x: Did you know you can press CTRL-C to interrupt a perl program?\n";
}
Another way to exit a loop is a simple foreach
over the elements, as we have seen. But if we don't know when we want to exit a
loop? For example, suppose we want to print out a list of names but stop when
we find one with a particular title? You are throwing a huge party, someone is
allergic to vodka, and this person has drunk from the punch bowl despite being
assured by someone holding two empty bottles of Absolut that he was just using
the bottles to convey yet more orange juice into said punch bowl. So you need a
doctor, and so you write a Perl script to find one from the list of attendees,
wanting the doctor's name to be the last item printed:
@names=('Mrs Smith','Mr Jones','Ms Samuel','Dr Jansen','Sir Philip');
foreach $person (@names) {
print "$person\n";
last if $person=~/Dr /;
}
The last operator is our friend. Don't worry about
the /Dr / business -- that is a regular expression which we cover
next. All you need to know is that it returns true if the name begins
with 'Dr '. When it does return true, last is
operated and the loop ends early.
So that's easy enough. But wait! We need a medical, human-fixer type doctor, not just anyone with a PhD. So, the same principle applies in this example here:
@names =('Mrs Smith','Mr Jones','Ms Samuel','Dr Jansen','Sir Philip');
@medics =('Dr Black','Dr Waymour','Dr Jansen','Dr Pettle');
foreach $person (@names) {
print "$person\n";
if ($person=~/Dr /) {
foreach $doc (@medics) {
print "\t$doc\n";
last if $doc eq $person;
}
}
}
Aside from showing one way to indent your code, this also demonstrates a
nested loop. A nested loop is a loop within a loop. What happens is that the
@names array is searched for a 'Dr ', and if it is found then the
@medics array is searched to make sure the doctor is a
human-fixing doctor not a professor of physics or something. The regular
expression has been shifted into an if statement,
where it works nicely as it only returns true or false.
The problem with the code is that after we find our medical doctor we want it to stop. But it doesn't. It only stops the loop it is in, so Dr Pettle never gets printed. However, the code just carries on with Sir Philip who is terribly sorry old chap, but can't be of any bally use at all, what ho! What we need is a way to break out of the entire loop from within a nest. Like so:
@names =('Mrs Smith','Mr Jones','Ms Samuel','Dr Jansen','Sir Philip');
@medics =('Dr Black','Dr Waymour','Dr Jansen','Dr Pettle');
LBL: foreach $person (@names) {
print "$person\n";
if ($person=~/Dr /) {
foreach $doc (@medics) {
print "\t$doc\n";
last LBL if $doc eq $person;
}
}
}
Only two changes here. We have defined a label, namely LBL.
Instead of breaking out from the current loop, which is the default, we specify
a label to break out to, which is in the outer loop. This works with as many
nested loops as your brain can handle. You don't have to use uppercase names
but for namespace reasons it is recommended, and you can call your labels
whatever you please. I was just being unimaginative with the name of LBL, feel
free to invent labels called DORIS or MATILDA if that's what floats your
personal boat.
So we have @names . We want to change it. Run
this:
print "Enter a name :";
chomp ($x=<STDIN>);
@names=("Muriel","Gavin","Susanne","Sarah");
print "@names\n";
push (@names, $x);
print "@names\n";
Fairly self explanatory. The push function just
adds a value on to the end of the array. Of course, Perl being Perl, it
doesn't have to be just the one value:
print "Enter a name :";
chop ($x=<STDIN>);
@names=("Muriel","Gavin","Susanne","Sarah");
@cities=("Brussels","Hamburg","London","Breda");
print "@names\n";
push (@names, $x, 10, @cities[2..4]);
print "@names\n";
This is worth looking at in more detail. It appears there is no fifth
element of @cities , as referred to by
@cities[2..4] .
Actually, there is a fifth element. Add this to the end of the example :
print "There are ",scalar(@names)," elements in \@names\n";
There appear to be 8 elements in @names . However,
we have just proved there are in fact 9. The reason there are 9 is that we
referred to non-existent elements of @cities , and
Perl has quite happily extended @names to suit. The array
@cities remains unchanged. Try
poping the array if you don't believe me.
So that's push . Now for some...
@names=("Muriel","Gavin","Susanne","Sarah");
@cities=("Brussels","Hamburg","London","Breda");
&look;
$last=pop(@names);
unshift (@cities, $last);
&look;
sub look {
print "Names : @names\n";
print "Cities: @cities\n";
}
Now we have two arrays. The pop function removes
the last element of an array and returns it, which means you can do something
like assign the returned value to a variable. The
unshift function adds a value to the beginning of the
array. Hope you didn't forget that
&subroutinename calls a subroutine. Presented
below are the functions you can use to work with arrays:
| push | Adds value to the end of the array |
| pop | Removes and returns value from end of array |
| shift | Removes and returns value from beginning of array |
| unshift | Adds value to the beginning of array |
Now, accessing other elements of arrays. May I present the
splice function ?
@names=("Muriel","Sarah","Susanne","Gavin");
&look;
@middle=splice (@names, 1, 2);
&look;
sub look {
print "Names : @names\n";
print "The Splice Girls are: @middle\n";
}
The first argument for splice is an array. Then
second is the offset. The offset is the index number of the list element to
begin splicing at. In this case it is 1. Then comes the number of elements to
remove, which is sensibly 1 or more in this case. You can set it to 0 and perl,
in true perl style, won't complain. Setting to 0 is handy because
splice can add elements to the middle of an array,
and if you don't want any deleted 0 is the number to use. Like so:
@names=("Muriel","Gavin","Susanne","Sarah");
@cities=("Brussels","Hamburg","London","Breda");
&look;
splice (@names, 1, 0, @cities[1..3]);
&look;
sub look {
print "Names : @names\n";
print "Cities: @cities\n";
}
Notice how the assignment to @middle has gone --
it is no longer relevant.
If you assign the result of a splice to a scalar
then:
@names=("Muriel","Sarah","Susanne","Gavin");
&look;
$middle=splice (@names, 1, 2);
&look;
sub look {
print "Names : @names\n";
print "The Splice Girls are: $middle\n";
}
then the scalar is assigned the last element removed, or undef if it doesn't work at all.
The splice function is also a way to delete
elements from an array. In fact, a discussion of :
is in order. Suppose we want to delete Hamburg from the following array. How do we do it ? Perhaps:
@cities=("Brussels","Hamburg","London","Breda");
&look;
$cities[1]="";
&look;
sub look {
print "Cities: ",scalar(@cities), ": @cities\n";
}
would be appropriate. Certainly Hamburg is removed. Shame, such a great
lake. But note, the array element still exists. There are still four elements
in @cities. So what we need is the appropriate
splice function, which removes the element entirely.
splice (@cities, 1, 1);
Now that's all well and good for arrays. What about ordinary variables, such as these:
$car ="Porsche 911";
$aircraft="G-BBNX";
&look;
$car="";
&look;
sub look {
print "Car :$car: Aircraft:$aircraft:\n";
print "Aircraft exists !\n" if $aircraft;
print "Car exists !\n" if $car;
}
It looks like we have deleted the $car variable.
Pity. But think about it. It is not deleted, it is just set to the null string
"". As you recall (hopefully) from previous ramblings, the null string
evaluates to false so the if test fails.
Just because something is false doesn't mean to say it doesn't exist. A wig is false hair, but a wig exists. Your variable is still there. Perl does have a function to test if something exists. Existence, in Perl terms, means defined. So:
print "Car is defined !\n" if defined $car;
will evaluate to true, as the $car variable does
in fact exist.
This begs the question of how to really wipe variables from the face of the earth, or at least your Perl script. Simple.
$car ="Porsche 911";
$aircraft="G-BBNX";
&look;
undef $car; # this undefines $car
&look;
sub look {
print "Car :$car: Aircraft:$aircraft:\n";
print "Aircraft exists !\n" if $aircraft;
print "Car exists !\n" if defined $car;
}
This variable $car is eradicated, deleted, killed,
destroyed.
And now for something completely different....
Or regex for short. These can be a little intimidating. But I'll bet you have already used some regex in your computing life so far. Have you even said "I'll have any Dutch beer ?" That's a regex which will match a Grolsch or Heineken, but not a Budweiser, orange juice or cheese toastie. What about dir *.txt ? That's a regular expression too, listing any files ending in .txt.
Perl's regex often look like this:
$name=~/piper/
That is saying "If 'piper' is inside $name, then
True."
The regular expression itself is between
/ / slashes, and the =~
operator assigns the target for the search.
An example is called for. Run this, and answer it with 'the faq'. Then try 'my tealeaves' and see what happens.
print
"What do you read before joining any Perl discussion ? ";
chomp ($_=<STDIN>);
print "Your answer was : $_\n";
if ($_=~/the faq/) {
print "Right ! Join up !\n";
} else {
print "Begone, vile creature !\n";
}
So here $_ is searched for 'the faq'. Guess
what we don't need ! The =~ .
This works just as well:
if (/the faq/) {
because if you don't specify a variable, then perl searches
$_ by default. In this particular case, it would be
better to use
if ($_ eq "the faq") {
as we are testing for exact matches.
But what if someone enters 'The FAQ' ? It fails, because the regex is case sensitive. We can easily fix that:
if (/the faq/i) {
with the /i switch, which specifies
case-insensitivity. Now it works for all variations, such as "the Faq" and
"the FAQ".
Now you can appreciate why a regular expression is better in this situation
than a simple test using eq . As the regex searches
one string for another string, a response of "I would read the FAQ first !"
will also work, because "the FAQ" will match the regex.
Study this example just to clarify the above. Tabs and spaces have been added for aesthetic beauty:
$_="perl for Win32"; # sets the string to be searched
if ($_=~/perl/) { print "Found perl\n" }; # is 'perl' inside $_ ? $_ is "perl for Win32".
if (/perl/) { print "Found perl\n" }; # same as the regex above. Don't need the =~ as we are testing $_
if (/PeRl/) { print "Found PeRl\n" }; # this will fail because of case sensitivity
if (/er/) { print "Found er\n" }; # this will work, because there is an 'er' in 'perl'
if (/n3/) { print "Found n3\n" }; # this will work, because there is an 'n3' in 'Win32'
if (/win32/) { print "Found win32\n" }; # this will fail because of case sensitivity
if (/win32/i) { print "Found win32 (i)\n" }; # this will *work* because of case insensitivity (note the /i)
print "Found!\n" if / /; # another way of doing it, this time looking for a space
print "Found!!\n" unless $_!~/ /; # both these are the same, but reversing the logic with unless and !
print "Found!!\n" unless !/ /; # don't do this, it will always never not confuse nobody :-)
# the ~ stays the same, but = is changed to ! (negation)
$find=32; # Create some variables to search for
$find2=" for "; # some spaces in the variable too
if (/$find/) { print "Found '$find'\n" }; # you can search for variables like numbers
if (/$find2/) { print "Found '$find2'\n" }; # and of course strings !
print "Found $find2\n" if /$find2/; # different way to do the above
As you can see from the last example, you can embed a variable in the regex too. Regular expressions could fill entire books (and they have done, see the book critiques at http://www.perl.com/) but here are some useful tricks:
@names=qw(Karlson Carleon Karla Carla Karin Carina Needanotherword);
foreach (@names) { # sets each element of @names to $_ in turn
if (/[KC]arl/) { # this line will be changed a few times in the examples below
print "Match ! $_\n";
} else {
print "Sorry. $_\n";
}
}
This time @names is initialised using whitespace
as a delimiter instead of a comma. qw refers to
'quote words', which means split the list by words. A word ends with whitespace
(like tabs, spaces, newlines etc).
The square brackets enclose single characters to be matched. Here
either Karl or
Carl must be in each element. It doesn't have to be
two characters, and you can use more than one set. Change Line 4 in the above
program to:
if (/[KCZ]arl[sa]/) {
matches if something begins with K, C, or Z, then arl, then either s or a. It does not match KCZarl. Negation is possible too, so try this :
if (/[KCZ]arl[^sa]/) {
which returns things beginning with K, C or Z, then
arl, and then anything EXCEPT s or a. The caret
^ has to be the first character, otherwise it doesn't
work as the negation. Having said [ ] defines single
characters only, I should mention than these two are the same :
/[abcdeZ]arl/; /[a-eZ]arl/;
if you use a hyphen then you get the list of characters including the start and finish characters. And if you want to match a special character (metacharacter), you must escape it:
/[\-K]arl/;
matches Karl or -arl. Although the
- character is represented by two characters,
it is just the one character to match.
If you want to match at the end of the line, make sure a
$ is the last character in the regex. This one pulls
out all those names ending in a. Slot it into the example above :
if (/a$/) {
And there is a corresponding character, the caret
^ , which in this context matches at the
beginning of the string. Yes, the caret also negates a character class
like this [^KCZ]arl but in this case it anchors
the match to the beginning of the string.
if (/n/i) {
if (/^n/i) {
The first one is true if the word contains an 'n' anywhere in it. The second specifies that the 'n' must be at the beginning of the string to be matched. Use this anchor where you can, because it makes the whole regex faster, and safer if you know what the first character must be.
If you want to negate the entire regex change =~ to
!~ (Remember ! means
'not equal to'.)
if ($_ !~/[KC]arl/) {
Of course, as we are testing $_
this works too:
if (!/[KC]arl/) {
Now things get interesting. What if we want pull something out of a string ? So far all we have done is test for truth, that is say yea or nay if a string matches, but not return what we found. Run this:
$_='My email address is <Robert@NetCat.co.uk>.'; /(<robert\@netcat.co.uk>)/i; print "Found it ! $1\n";
Firstly, note the single quotes when $_ is
assigned. If there were double quotes, we'd need \@
instead of @
. Remember, double quotes "" allow
variable interpolation, so Perl looks for an array called
@NetCat which does not exist.
Secondly, look at the parens around the entire regex. If you use parens, a
side effect is that the first match is put into a variable called
$1 . We'll get to the main effect later. The second
match goes into $2 and so on. Also note that the
\@ has been escaped, so perl doesn't think it is an
array. Remember \ either escapes a special character,
or gives a special meaning. Think of it as Superman's telephone box. Imagine
Clark Kent walking around with his magic partner Back Slash.
Notice how we specify in the regex case-insensitivity with
/i and the regex returns the case-sensitive
string - that is, exactly what it found.
Try the regex without parens. Then try this one:
/<(robert)\@netcat.co.uk>/i;
You can put the parens anywhere. More or less. Now, run this :
$_='My email address is <Robert@NetCat.co.uk>.'; /<(robert)\@(netcat.co.uk)>/i; print "Found it ! $1 at $2\n";
See, you can have more than one ! Look at the above regex. Looks easy now, don't you think ? What about five minutes ago ? It would have looked like a typing mistake ! Well, there are some hairier regex to come, but you'll have a good barber.
What if we didn't know what the email address was going to be ?
$_='My email address is <webslave@work.com>.'; print "Found it ! :$1:" if /(<.*>)/i;
When you see an if
statement like this, read it right to left. The print
statement is only executed if code on the right of the expression is
true.
We'll discuss this. Firstly, we have the opening parens ( .
So everything from ( to )
will be put into $1 if the match
is successful. Then the first character of what we are searching for,
< . Then we have a dot, or period
. . For this regex, we can assume . matches any character at all.
So we are
now matching < followed by any
character. The * means 0 or more of the previous
character. The regex finishes by requiring > .
This is important. Get the basics right and all regex are easy (I read somewhere once). An example best illustrates the point. Slot this regex in instead:
$_='My email address is <webslave@work.com>.'; print "Found it ! :$1:" if /(<*>)/i;
What's happening here ?
The regex starts, logically, at the start of the string. This doesn't mean it starts a 'M', it starts just before M. There is a 'nothing' between the string start and 'M'.
The regex is searching for <* , which is 0
or more < .
The first thing it finds is not < , but
the nothing in between the start of the string and the 'M' from 'My email...". Does
this match ?
As the regex is looking for "0 or more" < ,
we can certainly say that there are 0 < at the
start of the string. So the match is, so far, successful. We have dealt with
<* .
However, the next item to match is > .
Unfortunately, the next item in the string is 'M', from 'My email..". The match
fails at this point. Sure, it matched < without
any problem, but the complete match has to work.
The only two characters that can match successfully at this point are
< or > .
The 'point' being that <* has been matched
successfully, and we need either > to
complete the match or more of < to continue
the '0 or more' match denoted by * .
'M' is neither of them, so it fails at this point, when it has matched
Quick clarification - the regex cannot successfully match <
, then skip on ahead through the string until it matches > . The characters in the string between < > also need to match the regex, and
they don't in this case.
All is not lost. Regexes are hardy little beasts and don't give up easily. An attempt is made to match the regex wherever possible. The regex system keeps trying the match at every possible place in the string, working towards the end.
Let's look at the match when it reaches the 'm' in 'work.com'.
Again, we have here 0 < . So the match
works as before. After success on <* the next
character is analysed - it is a > , so the
match is successful.
But, be warned. The match may be successful but your job is not done. Assuming the objective of was to return the email address within the angle brackets then that regex is a miserable failure. Watch for traps of this nature when regexing.
That's * explained. Just to consolidate, a
quick look at:
$_='My email address is <webslave@work.com>.'; print "Match 1 worked :$1:" if /(<*)/i; $_='<My email address is <webslave@work.com>.'; print "Match 2 worked :$1:" if /(<*)/i; $_='My email address is <webslave@work.com<<<<>.'; print "Match 3 worked :$1:" if /(<*>)/i;
Match 1 is true. It doesn't return anything, but it is true
because there are 0 < at the very
start of the string.
Match 2 works. After the 0 < at the start of
the string, there is 1 < so the regex can
match that too.
Match 3 works. After the failing on the first < ,
it jumps to the second. After that, there are plenty more to match right up until the
required ending.
Glad you followed that. Now, pay even closer attention! Concentrate fully on the task at hand ! This should be straightforward now:
$_='HTML <I>munging</I> time !.'; /<I>(.*)<\/I>/i; print "Found it ! $1\n";
Pretty much the same as the above, except the parens are moved so
we return what's only inside the tags, not including the tags themselves. Also
note how / is escaped like so; \/
otherwise Perl thinks that's the end of the regex.
Now, suppose we change $_ to :
$_='HTML <I>munging</I> time is here <I>again</I> !.';
and run it again. Interesting effect, eh ? This is known as Greedy
Matching. What happens is that when Perl finds the initial match, that is <I> it jumps right to the end of the string and
works back from there to find a match, so the longest string
matches. This is fine unless you want the shortest string. And there is a
solution:
/<I>(.*?)<\/I>/i;
Just add a question mark and Perl does stingy matching. No nationalistic jokes. I have Dutch and Scottish friends I don't want to offend.
You know what * means, namely match 0 or
more. If you want to match 1 or more, then use + .
The difference is important.
$_='The number is 2200 and the day is Monday'; ($star)=/([0-9]*)/; ($plus)=/([0-9]+)/; print "Star is '$star' and Plus is '$plus'\n";
You'll note that $star has no value. The match was successful
though. It managed to match 0 or more characters from 0 to 9 at the very start
of the regex.
The second regex with $plus worked a little better, because we
are matching one or more characters from 0 to 9. Therefore, unless one 0 to 9
is found the match will fail. Once a 0-9 is found, the match continues as long
as the next character is 0-9, then it stops.
Now we know this, there is another way to remove an email address from within angle brackets:
$_='My email address is <robert@netcat.co.uk> !.'; /<([^>]+)/i; print "Found it ! $1\n";
This regex matches <. Then the capturing parens
start. They have no effect on this regex other than to capture the match.
After that, there is a character class, containing one character. As ^
is the first character is the class, it negates the class. That's why
we are using a character class with only one character in it, because it can
be negated.
So far we have matched < and anything that is not
>. The + ensures we match as many characters that
are not <'s as we can. This has the same effect as
.*? but is more efficient. It may also suit your purposes, as
.*? relies on you knowing what you want to match up to, whereas
[^>]+ simply contines matching until it finds something that
fails its criteria. Just make sure you understand the difference because it is
a crucial part of regexery.
Suppose we didn't know what HTML tag we had to match ? It could be B, I, EM or whatever, and we want everything that is in between. Well, HTML container tags like B and EM have end tags which are the same as the start tag, except for the / . So what we could do is:
Can this be done ? Of course. This is perl, all things are possible. Now, remember the
side effect of parens. I promise I'll explain the primary effect at some point. If
whatever is in (parens) matches, the result is stored in a variable called $1 . So we can use <(.*?)>
which will find us < then as
many anythings (the . and *
) up to the next, not last > (the
? forces stingy matching).
The result is stored in $1 because we used
parens. Next, we need everything up to the closing tag. That's easy : (.*?) matches everything up until the next character or set of characters. And how exactly do we define where to stop ?
We can use $1 even in the same regex it was
found in. However, it is not referred to within a regex as $1 ,
but \1 .
So we want to match </$1> which in perl
code is <\/\1> . The /
must be escaped because it is the end of the regex, and 1 is escaped so it refers to $1
instead of matching the number 1.
Still here ? This is what it looks like:
$_='HTML <I>munging</I> time is here <I>again</I> !.'; /<(.*?)>(.*?)<\/\1>/i; print "Found it ! $2\n";
If you want to know how to return all the matches above, read on. But before that:
You want to match this; http://language.perl.com/faq/ . That's a real
(useful) URL by the way. Hint. To match it, you need to do this:
/http:\/\/language\.perl\.com\/faq\//;
which should make the awful metaphor above clearer, if not
funnier. The slash, / , is not
normally a metacharacter but as it is being used for the regular expression
delimiters, it needs to be escaped. We already know that .
is special.
Fortunately for our eyes, Perl allows you to pick your delimiter if you
prefix it with 'm' as this example shows. We'll use a #:
m#http://language\.perl\.com/faq/#;
Which is a huge improvement, as we change /
to # . We can go further with readability by quoting
everything:
m#\Qhttp://language.perl.com/faq/\E#;
The \Q escapes everything
up until \E or the regex delimiter (so
we don't really need the \E above). In this case #
will not be escaped, as it delimits the regex.
Someone once posted a question about this to the Perl-Win32-Users mailing
list and I was so intrigued about this apparently undocumented trick I spent
the next twenty minutes figuring it out by trial and error, and posted a reply.
Next day I found lots of messages telling the poster to read the manual because
it was clearly documented. <face colour='red' intensity='high'> My excuse
was I didn't have the docs to hand....moral of the story - RTFM and RTF FAQs !
Suppose you want to replace bits of a string. For example, 'us' with 'them'.
$_='Us ? The bus usually waits for us, unless the driver forgets us.'; print "$_\n"; s/Us/them/; # operates on $_, otherwise you need $foo=~s/Us/them/; print "$_\n"; What happens here is that the string 'Us' is searched for, and when a match is found it is replaced with the right side of the expression, in this case 'them'. Simple.
You'll notice that only one substitution was made. To match globally use
/g which runs through the entire string,
changing wherever it can. Try:
s/Us/them/g;
which fails. This is because regexes are not, by default, case-sensitive. So:
s/us/them/ig;
would be a better bet. Now, everything is changed. A little too much, but
one problem at a time. Everything you have learn about regex so far
can be used with s/// , like parens, character
classes [ ] , greedy and stingy matching and much
more. Deleting things is easy too. Just specify nothing as the replacement
character, like so s/Us//; .
So we can use some of that knowledge to fix this problem. We need to make sure that a space precedes the 'us'. What about:
s/ us/them/g;
An small improvement. The first 'Us' is now no longer changed, but one problem at a time ! We'll first consider the problem of the regex changing 'usually' and other words with 'us' in them.
What we are looking for is a space, then 'us', then a comma, period or space. We know how to specify one of a number of options - the character class.
s/ us[. ,]/them/g;
Another tiny step. Unfortunately, that step wasn't really in the right direction, more on the slippery slope to Poor Programming Practice. Why? Because we are limiting ourselves. Suppose someone wrote ' send it to us; when we get it'.
You can't think of all the possible permutations. It is often easier, and safer, to simply state what must not follow the match. In this case, it can be anything except a letter. We can define that as a-z. So we can add that to the regex.
s/ us[^a-z]/ them/g;
the caret ^ negates the character class, and
a-z represents every alphabet from a to z
inclusive. A space has been added to the substitution part - as the original
space was matched, it should be replaced to maintain readability.
What would be more useful is to use a-zA-Z
instead. If we weren't using /i we'd need
that. As a-zA-Z is such a common construct, Perl
provides an easy shorthand:
s/ us[^\w]/ them/g;
The \w construct actually means 'word' -
equivalent to a-zA-Z_0-9 .
So we'll use that instead.
To negate any construct, simply capitalise it:
s/ us[\W]/ them/g;
and of course we don't need the negating caret now. In fact, we don't even need the character class !
s/ us\W/ them/g;
So far, so good. Matching the first 'us' is going to be difficult though.
Fortunately, there is an easy solution. We've seen Perl's definition of a
word - \w . Between each word is a boundary. You
can match this with \b .
s/\bus\W/ them/g;
that's \b followed by 'us', not 'bus' :-) Now, we
require a word boundary before 'us'. As there is a 'nothing' at the start of
the string, we have a match. There is a space after the first 'Us', so the
match is successful. You might notice an extra space has crept in - that's
the space we added earlier. The match doesn't include the space any more - it
matches on the word boundary, that is just before the word begins. The space
doesn't count.
Did you notice the final period and the comma are replaced ? They are part of the match - it is the
\W that matches them. We can't avoid that. We can
however put back that part of the match.
s/\bus(\W)/them\1/g;
We start with capturing whatever the \W
matches, using parens. Then, we add it to the replacement string. The
capture is of course in $1 , but as it is in a regex
we refer to it as \1 .
The final problem is of course capitalising the replacement string when appropriate. Which in old versions of the tutorial I left as an exercise to the reader, having run out of motivation. A reader by the name of Paul Trafford duly solved the problem, and I have just inserted his excellent explanation for the elucidation of all concerned:
# Solution to the us/them problem... # # The program works through the text assigning the # variable $1 to 'U' or 'u' for any words where this # letter is followed by 's' and then by non 'word' # characters. The latter is assigned to variable $2. # # For each such matching occurrence, $1 is replaced by # the letter that precedes it in the alphabet using # operations 'ord' and 'chr' that return the ASCII value # of a character and the character corresponding to a # given natural number. After this 'hem' is tacked on # followed by $2, to retain the shape of the original # sentence. The '/e' switch is used for evaluation. # # NOTES # 1. This solution will not replace US (short for # United States) with Them or them. # # 2. If a 'magical' decrement operator '--' existed for # strings then the solution could be simplified for we # wouldn't need to use the 'chr' and 'ord' operators.
$_='Us ? The bus usually waits for us, unless the driver forgets us.'; print "$_\n"; s/\b([Uu])s(\W)/chr(ord($1)-1).hem.$2/eg; print "$_\n";
An excellent solution, thanks Paul.
There are several more constructs. We'll take a quick look at
\d which means anything that is a digit, that is
0-9 . First we'll use the negated form,
\D , which is anything except
0-9 :
print "Enter a number :";
chop ($input=<STDIN>);
if ($input=~/\D/) {
print "Not a number !!!!\n";
} else {
print 'Your answer is ',$input x 3,"\n";
}
this checks that there are no non-number characters in
$x . It's not perfect because it'll choke on
decimal points, but it's just an example. Writing your own number-checker is
actually quite difficult, but it is an interesting exercise. Try it, and see
how accurate yours is.
I hope you trusted me and typed the above in exactly as it is show (or
pasted it), because the x is not a mistake,
it is a feature. If you were too smart and changed it to a
* or something change it back and see what it does.
Of course, there is another way to do it :
unless ($input=~/\d/) {
print 'Your answer is ',$input x 3,"\n";
} else {
print "Not a number !!!!\n";
}
which reverses the logic with an unless
statement.
Assume we have:
$_='HTML <I>munging</I> time is here <I>again</I> !.';
and we want to find all the italic words. We know that
/g will match globally, so surely this will work :
$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !'; $match=/<i>(.*?)<\/i>/ig; print "$match\n";
except it returns 1, and there were definitely two matches. The match
operator returns true or false, not the number of matches. So you can test
it for truth with functions like if, while,
unless Incidentally, the s///
operator does return the number of substitutions.
To return what is matched, you need to supply a list.
($match) = /<i>(.*?)<\/i>/i;
which handily puts all the first match into
$match . Note that an = is used (for
assignment), as opposed to =~ (to point the regex at a variable
other than $_.
The parens force a list context in this case. There is just the one element in the list, but it is still a list. The entire match will be assigned to the list, or whatever is in the parens. Try adding some parens:
$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !'; ($word1, $word2) = /<i>(.*?)<\/i>/ig; print "Word 1 is $word1 and Word 2 is $word2\n";
In the example above notice /g has been added
so a global replacement is done - this means perl carries on matching
even after it finds the first match. Of course, you might not know how many
matches there will be, so you can just use an array, or any other type of list:
$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !';
@words = /<i>(.*?)<\/i>/ig;
foreach $word (@words) {
print "Found $word\n";
}
and @words will be grown to the appropriate
size for the matches. You really can supply what you like to be assigned to:
($word1, @words[2..3], $last) = /<i>(.*?)<\/i>/ig;
you'll need more italics for that last one to work. It was only a demonstration.
There is another trick worth knowing. Because a regex returns true each time
it matches, we can test that and do something every time it returns true. The
ideal function is while which means 'do something as
long the condition I'm testing is true'. In this case, we'll print out the
match every time it is true.
$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !';
while (/<(.*?)>(.*?)<\/\1>/g) {
print "Found the HTML tag $1 which has $2 inside\n";
}
So the while operator runs the regex, and if it is true, carries out the statements inside the block.
Try running the program above without the /g .
Notice how it loops forever ? That's because the expression always evaluates to
true. By using the /g we force the match to move on
until it eventually fails.
Now we know this, an easy way to find the number of matches is:
$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !'; $found++ while /<i>.*?<\/i>/ig; print "Found $found matches\n";
You don't need braces in this case as nothing apart from the expression
to be evaluated follows the while function.
The real use for them. Precedence. Try this, and yes you can try it at home:
$_='One word sentences ? Eliminate. Avoid
clichés like the plague. They are old hat.';
while (/o(rd|ne|ld)/gi) {
print "Matched $1\n";
}
Firstly, notice the subtle introduction of the
or operator, in this case |
, the pipe. What I really want to explain however, is that this regex
matches o followed by rd, ne or ld. Without the parens it would be
/ord|ne|ld/ which is definitely not what we want.
That matches just plain ord, or ne or ld.
In the interests of efficiency, consider this:
print "Give me a name :"; chop($_=<STDIN>); print "Good name\n" if /Pe(tra|ter|nny)/;
The code above functions correctly. If you were wondering what a good name is, Petra, Peter and Penny qualify. The regex is not as efficient as it could be though. Think about what Perl is doing with the regex, that you are just ignoring. Simply throwing away casually. Without consideration as to the effort that has gone into creating it for you. The resources squandered. The little bytes of memory whose sole function in life is to store this information, which will never be used.
What's happening is that because parens are used, perl is creating
$1 for your usage and abusage. While this may not seem important,
a fair amount of resources go into creating $1, $2
and so on. Not so much the memory used to store them, more the CPU effort
involved. So, if you aren't going to use the parens for capturing purposes, why
bother capturing the match?
print "Give me a name :"; chop($_=<STDIN>); print "Good name\n" if /Pe(?:tra|ter|nny)/; print "The match is :$1:\n";
The second print statement demonstrates that nothing is captured this time. You get the benefits of the paren's precedence-changing capabilities, but without the overhead of the capturing. This benefit is especially worthwhile if you are writing CGI programs which use parens in regex -- with CGI, every little of bit efficiency counts.
Finally, take a look at this :
$_='I am sleepy....zzzz....DING ! Wake Up!';
if (/(z{5})/) {
print "Matched $1\n";
} else {
print "Match failed\n";
}
The braces { } specify how many of the
preceding character to match. So z{2} matches exactly
two 'z's and so on. Change z{5} to
z{4} and see how it works. And there's more...
| /z{3}/ | 3 z only |
| /z{3,}/ | At least 3 z |
| /z{1,3}/ | 1 to 3 z |
| /z{4,8}/ | 4 to 8 z |
To any of the above you may suffix a question mark, the effect of which is demonstrated in the following program. Run it a couple of times, inputting 2, 3 and 4:
print "How many letters do you want to match ? ";
chomp($num=<STDIN>);
# we assign and print in one smooth move
print $_="The lowest form of wit is indeed sarcasm, I don't think.\n";
print "Matched \\w{$num,} : $1 \n" if /(\w{$num,})/;
print "Matched \\w{$num,?}: $1 \n" if /(\w{$num,}?)/;
The first match is 'match any word (that's a-Z0-9_) equal to or
longer than $num character, and return it.' So if you
enter 4, then 'lowest' is returned. The word 'The' doesn't match.
The second match is exactly the same, but the ?
forces a minimal match, so only the part actually matched is returned.
Just to clear this up, amend the program thus:
print "\nMatched \\w{$num,} :";
print "$1 " while /(\w{$num,})/g;
print "\nMatched \\w{$num,?} :";
print "$1 " while /(\w{$num,}?)/g;
Note the addition of /g . Try it without - notice
how the match never moves on ?
And now on the Regex Programme Today, we have guest stars Prematch, Postmatch and Match. All of whom are going to slow our entire programme down, but are useful anyway :
$_='I am sleepy....snore....DING ! Wake Up!'; /snore/; # look, no parens ! print "Postmatch: $'\n"; print "Prematch: $`\n"; print "Match: $&\n";
If you are wondering what the difference between match and using parens is
you should remember than you can move the parens around, but you can't vary
what $& and its ilk return. Also, using any of
the above three operators does slow your entire program, whereas using parens
will just slow the particular regex you use them for. However, once you've used
one of the three matches you might as well use them all over the place as
you've paid the speed penalty. Use p