Everybody talks about regular expression, but everyone hates regular expression yet ends up using regular expression!
So what is regular expression? umm, we need to go deeper? So yeah, Let’s dive into building blocks of regex with a short intro..
regular expression:
Regular expression or rational expression itself is an object and describes a pattern of characters. It allows us to search for specific patterns of text it also help match, locate, and manage text. Though they look pretty complicated yet they are very powerful, you can absolutely create a regex for almost any pattern of text you think.
Building block of regular expression:
Metacharacters are the building blocks of regular expressions. Characters in regex are understood to be either a metacharacter with a special meaning or a regular character with a literal meaning.
Reserved meta-characters:
Meta characters that are reserved and need to be escaped:
.[{()\^$|?*+
we gonna see example of escaping later.
Other common meta characters are:
Caret (^):
(^)
Matches the start of the string, and in multiline mode also matches immediately after each newline.example:
^\d{3} will match with patterns like "456" in "456-112-112".
Dollar ($):
($)
Matches the end of the string or just before the newline at the end of the string, and in multiline mode also matches before a newline.example:
\d{3}$ will match with patterns like "112" in "456-112-112".
\d:
\d
matches whole number or digit (0–9)
. Here number of \d
determines the umber of digits our regex will match for. i,e: \d
means single digit \d\d = double digits
and so on.example:
\d\d\d
will match 327 , 123, 787 but not 1223
as there are 4 digits in “1223” and our regex is a match for 3 digits.\d =1
\d\d = 12
\d\d\d ≠ 473847
as it returns 3 digits but 473847
contains 6 digits.\d\d\d ≠ cat
because it will match only digits but cat contains letter.\D:
Reverse of
\d
. Matches anything except digits.example:
\D\D = AB
\D\D = xy
\D\D ≠ 12
as it won’t match numeric character.\w:
Matches any alpha-numeric(word) characters.
example:
\w\w\w = 467
\w\w\w\w = Crow
\w\w\w ≠ python
\w\w\w
doesn’t return python because python contains 6 characters.\W:
Similar to \D \W is reverse of \w i,e: Matches anything but alpha-numeric characters
example:
\W\W = ,, or !! or @#
\W\W\W = !@#
\W\W\W\W != Titanic2
as every character is alpha-numeric./s :
Matches any white-space characters such as space and tab.
For example from upper example_text the regex
\s
will match only the space between two words and ignore everything else./S:
Matches any non-whitespace characters unlike
\s
Repeaters (
*, + and { }
):*, + and { }
are called repeaters as they denote that the preceding character is to be used for more than one time.Asterisk symbol ( * ):
Asterisk matches when the character preceding
*
matches 0 or more times. i.e: It tells the computer to match the preceding character (or set of characters) for 0 or more times (upto infinite).example:
Gre*n = Green(e is found 2 times), Grn(e is found 0 time), Greeeeen (e is found 5 times)
and so on ..tre* != trees
as there is “s” followes by “ee”.Plus symbol ( + ):
(+)
sign matches when the character preceding ‘+’
matches atleast one or more times (upto infinite).example:
Gre+n = Green, Greeeen, Gren
and so on..Gre+n != Grn
as “e” is absent here.Dot(.):
The period matches any alphanumeric character or symbol. Interestingly it can take place of any other symbol and for that reason it is being called Wildcard.
example:
Gre. = Gree, Gren, Gre1
and so onGre. != Green
as . by itself will only match for a single character, here, in the 4th position of the term. n is the 5th character and is not accounted for in the RegEx.but
Gre.*
will match Green as it tells to match any characterAlternation (|):
Allows for alternate matches. | works like the Boolean OR.
example:
A|B
creates a regular expression that will match either A or B
H(i!|ey!)
will match either Hi! or Hey!
M(s|r|rs)\.?\s[A-Z]\w+
will match any name started with Ms, Mr or Mrs
.Question mark (?):
Matches when the character preceding ? occurs 0 or 1 time only, making the character match optional.
example:
Favou?rite = Favourite
(u is found 1 time)Favou?rite = Favorite
(u is found 0 time)Character set ([]):
[]
is used to indicate a set of characters. In a set: [cat]
will match 'c', 'a', or 't'
. Ranges of characters can be indicated by giving two characters and separating them by a '-'
,example:
[A-Z]
will match any uppercase ASCII letter,[0–9]
will match any digit from 0 to 9
.[0-3][0-3]
will match all the two-digits numbers from 00 to 33
[0-9A-Fa-f]
will match any hexadecimal digit. [A\-Z])
or if it’s placed as the first or last character (e.g. [A-])
, it will match a literal '-'.[(+*)]
will match any of the literal characters '(', '+', '*', or ')'
[()[\]{}]
and []()[{}]
will both match a parenthesis.Character group ():
A character group is indicated by () matches the characters in exact order.
example:
(abc) = abc
not acb
(123) = 123
not 321
https?://(www\.)?(\w+)(\.\w+)
will match any url. There are 3 groups here.1st group: the optional
www.
2nd group: the domain name
google, facebook
etc3rd group: top level domain
.com, .net, .org
There is another implicit group group 0 group 0 is everything that we captured in our case the entire
url
.Quantifiers:
regex use quantifiers to indicate the scope of a search string. We can use multiple quantifiers in our search string. quantifiers are:
{n}:
Matches when the preceding character, or character group, occurs n times exactly.
example:
\d{3}=123
pand[ora]{2} = pandar, pandoo
pand[ora]{2} ≠ pandora
as the quantifier {2}
only allows for 2 letters from the character set [ora]
.{n,m}:
Matches when the preceding character, or character group, occurs at least n times, and at most m times.
example:
\d{2,6} = 430973, 4303, 38238
\d{2, 6} ≠ 3
3 does not match because it is 1 digit, so outside of the character range.Escaping Metacharacters:
To search for a character that is a reserved metacharacter (any of .[{()\^$|?*+), we can use the backslash \ to escape the character so it can be recognized.
Example:
Below regex will match any valid mail id. Here we’ve used \ to escape reserved character.
^([a-zA-Z0–9_\-\.]+)@([a-zA-Z0–9_\-\.]+)\.([a-zA-Z]{2,5})$
Congratulations! Now you know the very basic of regex and it’s already too much for a day!
In my upcoming article we will practice regex with python.