paint-brush
A Programmer's Guide to Regex or Regular Expressionsby@tasnuva.zaman
1,568 reads
1,568 reads

A Programmer's Guide to Regex or Regular Expressions

by Tasnuva ZamanJune 8th, 2020
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Regular expression is an object and describes a pattern of characters. It allows us to search for specific patterns of text. It also help match, locate, and manage text. The building blocks of regular expression are metacharacters and reserved meta-characters. You can create a regular expression for almost any pattern of text you think. The number of digits our regex will match for is the umber of digits we match for. Metacharacter is a special character with a special meaning or a literal meaning.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - A Programmer's Guide to Regex or Regular Expressions
Tasnuva Zaman HackerNoon profile picture

Everybody talks about regular expression, but everyone hates regular expression yet ends up using regular expression!

So what is regular expression? umm, we need to go deeper? So yeah, Let’s dive into building blocks of regex with a short intro..

regular expression:

Regular expression or rational expression itself is an object and describes a pattern of characters. It allows us to search for specific patterns of text it also help match, locate, and manage text. Though they look pretty complicated yet they are very powerful, you can absolutely create a regex for almost any pattern of text you think.

Building block of regular expression:

Metacharacters are the building blocks of regular expressions. Characters in regex are understood to be either a metacharacter with a special meaning or a regular character with a literal meaning.

Reserved meta-characters:

Meta characters that are reserved and need to be escaped:

.[{()\^$|?*+

we gonna see example of escaping later.

Other common meta characters are:

Caret (^):

(^)
Matches the start of the string, and in multiline mode also matches immediately after each newline.

example:

^\d{3} will match with patterns like "456" in "456-112-112".

Dollar ($):

($)
Matches the end of the string or just before the newline at the end of the string, and in multiline mode also matches before a newline.

example:

\d{3}$  will match with patterns like "112" in "456-112-112".

\d:

\d
matches whole number or digit
(0–9)
. Here number of
\d
determines the umber of digits our regex will match for. i,e:
\d
means single digit
\d\d = double digits
and so on.

example:

\d\d\d
will match
327 , 123, 787 but not 1223
as there are 4 digits in “1223” and our regex is a match for 3 digits.

\d =1

\d\d = 12

\d\d\d ≠ 473847
as it returns 3 digits but
473847
contains 6 digits.

\d\d\d ≠ cat
because it will match only digits but cat contains letter.

\D:

Reverse of

\d
. Matches anything except digits.

example:

\D\D = AB

\D\D = xy

\D\D ≠ 12
as it won’t match numeric character.

\w:

Matches any alpha-numeric(word) characters.

example:

\w\w\w = 467

\w\w\w\w = Crow

\w\w\w ≠ python

\w\w\w
doesn’t return python because python contains 6 characters.

\W:

Similar to \D \W is reverse of \w i,e: Matches anything but alpha-numeric characters

example:

\W\W = ,, or !! or @#

\W\W\W = !@#

\W\W\W\W != Titanic2
as every character is alpha-numeric.

/s :

Matches any white-space characters such as space and tab.

For example from upper example_text the regex

\s
will match only the space between two words and ignore everything else.

/S:

Matches any non-whitespace characters unlike

\s

Repeaters (

*, + and { }
):

*, + and { }
are called repeaters as they denote that the preceding character is to be used for more than one time.

Asterisk symbol ( * ):

Asterisk matches when the character preceding

*
matches 0 or more times. i.e: It tells the computer to match the preceding character (or set of characters) for 0 or more times (upto infinite).

example:

Gre*n = Green(e is found 2 times), Grn(e is found 0 time), Greeeeen (e is found 5 times) 
and so on ..

tre* != trees
as there is “s” followes by “ee”.

Plus symbol ( + ):

(+)
sign matches when the character preceding
‘+’
matches atleast one or more times (upto infinite).

example:

Gre+n = Green, Greeeen, Gren
and so on..

Gre+n != Grn
as “e” is absent here.

Dot(.):

The period matches any alphanumeric character or symbol. Interestingly it can take place of any other symbol and for that reason it is being called Wildcard.

example:

Gre. = Gree, Gren, Gre1
and so on

Gre. != Green
as . by itself will only match for a single character, here, in the 4th position of the term. n is the 5th character and is not accounted for in the RegEx.

but

Gre.*
will match Green as it tells to match any character
used any number of times.

Alternation (|):

Allows for alternate matches. | works like the Boolean OR.

example:

A|B
creates a regular expression that will match either
A or B

H(i!|ey!) 
will match either
Hi! or Hey!

M(s|r|rs)\.?\s[A-Z]\w+
will match any name started with
Ms, Mr or Mrs
.

Question mark (?):

Matches when the character preceding ? occurs 0 or 1 time only, making the character match optional.

example:

Favou?rite = Favourite
(u is found 1 time)

Favou?rite = Favorite
(u is found 0 time)

Character set ([]):

  1. []
    is used to indicate a set of characters. In a set:
  2. Characters can be listed individually, e.g.
     [cat]
    will match
    'c', 'a', or 't'
    . Ranges of characters can be indicated by giving two characters and separating them by a
    '-'
    ,
  3. example:

    • [A-Z]
      will match any uppercase ASCII letter,
    • [0–9]
      will match any digit from
      0 to 9
      .
    • [0-3][0-3]
      will match all the two-digits numbers from
      00 to 33
    • [0-9A-Fa-f]
      will match any hexadecimal digit.
    • If - is escaped (e.g.
       [A\-Z])
      or if it’s placed as the first or last character (e.g.
      [A-])
      , it will match a literal '-'.
  4. The order of the characters does not matter.
  5. Special characters lose their special meaning inside sets.For example,
    [(+*)]
    will match any of the literal characters
    '(', '+', '*', or ')'
  6. To match a literal '{' inside a set, precede it with a backslash, or place it at the beginning of the set. For example, both
    [()[\]{}]
    and
    []()[{}]
    will both match a parenthesis.

Character group ():

A character group is indicated by () matches the characters in exact order.

example:

(abc) = abc
not
acb

(123) = 123
not
321

https?://(www\.)?(\w+)(\.\w+)
will match any url. There are 3 groups here.

1st group: the optional

www.

2nd group: the domain name

google, facebook
etc

3rd group: top level domain 

.com, .net, .org

There is another implicit group group 0 group 0 is everything that we captured in our case the entire

url
.

Quantifiers:

regex use quantifiers to indicate the scope of a search string. We can use multiple quantifiers in our search string. quantifiers are:

{n}:

Matches when the preceding character, or character group, occurs n times exactly.

example:

\d{3}=123

pand[ora]{2} = pandar, pandoo

pand[ora]{2} ≠ pandora
as the quantifier
{2}
only allows for 2 letters from the character set
 [ora]
.

{n,m}:

Matches when the preceding character, or character group, occurs at least n times, and at most m times.

example:

\d{2,6} = 430973, 4303, 38238

\d{2, 6} ≠ 3
3 does not match because it is 1 digit, so outside of the character range.

Escaping Metacharacters:

To search for a character that is a reserved metacharacter (any of .[{()\^$|?*+), we can use the backslash \ to escape the character so it can be recognized.

Example:

Below regex will match any valid mail id. Here we’ve used \ to escape reserved character.

^([a-zA-Z0–9_\-\.]+)@([a-zA-Z0–9_\-\.]+)\.([a-zA-Z]{2,5})$

Congratulations! Now you know the very basic of regex and it’s already too much for a day!

In my upcoming article we will practice regex with python.