Friday, March 7, 2014

Regular expression to parse URL

To check URL structure or extract its parts the following regular expression can be used:
^(?<scheme>[a-zA-Z]*):\/\/
 (?:(?<login>[^:\/]*)(?::(?<password>[^\/]*))?@)?
 (?<host>[^\/:@]+)
 (?::(?<port>[0-9]+))?
 (?:\/?(?<path>.*?)(?<file>[^\/]*?))
 (?:\?(?<query>.*))?$

This expression has limitation. It can not extract password from URL, if this password contains slash character.

Below is a sample Perl code with tests.

#!/usr/bin/perl
use strict;
use warnings;

sub parseUrl
{
  my $url = shift;
  
  if ($url =~ /^(?<scheme>[a-zA-Z]*):\/\/
                (?:(?<login>[^:\/]*)(?::(?<password>[^\/]*))?@)?
                (?<host>[^\/:@]+)
                (?::(?<port>[0-9]+))?
                (?:\/?(?<path>.*?)(?<file>[^\/]*?))
                (?:\?(?<query>.*))?
                $/x)
  {
    my %parts;
    $parts{'scheme'} = (defined $1) ? $1 : "";
    $parts{'login'} = (defined $2) ? $2 : "";
    $parts{'password'} = (defined $3) ? $3 : "";
    $parts{'host'} = (defined $4) ? $4 : "";
    $parts{'port'} = (defined $5) ? $5 : "";
    $parts{'path'} = (defined $6) ? $6 : "";
    $parts{'file'} = (defined $7) ? $7 : "";
    $parts{'query'} = (defined $8) ? $8 : ""; 
    
    return \%parts;
  }
}

sub test
{
  my ($url, $scheme, $login, $password, $host, $port, $path, $file, $query) = @_;
  
  print "-------------------\n$url\n";
  my $parts = parseUrl($url);
  
  if (!$parts) {
    print "[FAILED] Invalid URL\n";
    return;
  }
  
  if ($parts->{'scheme'}   eq $scheme &&
      $parts->{'login'}    eq $login &&
      $parts->{'password'} eq $password &&
      $parts->{'host'}     eq $host &&
      $parts->{'port'}     eq $port &&
      $parts->{'path'}     eq $path &&
      $parts->{'file'}     eq $file &&
      $parts->{'query'}    eq $query)
  {
    print "[OK]\n";
  } else {
    print "[FAILED] Parsed results are:\n";
    print "scheme   = $parts->{'scheme'}\n";
    print "login    = $parts->{'login'}\n";
    print "password = $parts->{'password'}\n";
    print "host     = $parts->{'host'}\n";
    print "port     = $parts->{'port'}\n";
    print "path     = $parts->{'path'}\n";
    print "file     = $parts->{'file'}\n";
    print "query    = $parts->{'query'}\n";
  }
}

sub main
{
  # Posotove tests
  test("ftp://d1:qwe:qwe\@172.168.1.1/bkp1.tar",  "ftp", "d1", "qwe:qwe", "172.168.1.1", "", "", "bkp1.tar", "");
  test("ftp://d1:qwe:qwe\@172.168.1.1/dir:dir/subdir\@subdir/bkp1.tar",     "ftp", "d1", "qwe:qwe", "172.168.1.1", "", "dir:dir/subdir\@subdir/", "bkp1.tar", "");
  test("ftp://172.168.1.1/dirdir/subdir\@subdir/bkp1.tar",  "ftp", "", "", "172.168.1.1", "", "dirdir/subdir\@subdir/", "bkp1.tar", "");
  test("ftp://172.168.1.1/dirdir/subdir/bkp1.tar",          "ftp", "", "", "172.168.1.1", "", "dirdir/subdir/", "bkp1.tar", "");
  test("ftp://d1:qwe:qwe\@172.168.1.1:21/bkp1.tar",         "ftp", "d1", "qwe:qwe", "172.168.1.1", "21", "", "bkp1.tar", "");
  test("ftp://d1:qwe\@qwe\@172.168.1.1:21/bkp1.tar",         "ftp", "d1", "qwe\@qwe", "172.168.1.1", "21", "", "bkp1.tar", "");
  test("ftp://d1\@d1:qwe\@qwe\@172.168.1.1:21/bkp1.tar",     "ftp", "d1\@d1", "qwe\@qwe", "172.168.1.1", "21", "", "bkp1.tar", "");
  test("http://172.168.1.1/dir/index.php?var=value",         "http", "", "", "172.168.1.1", "", "dir/", "index.php", "var=value");
  
  # Invalid URL
  test("/var/www/mysite",  "", "", "", "", "", "", "", "");
  
  # Parsing error - regular expression can not parse URLs with password containing slash
  test("ftp://d1:qwe/qwe\@localhost/bkp1.tar",    "ftp", "d1", "qwe/qwe", "localhost", "", "", "bkp1.tar", "");
}

main();

You can use this expression with other programming languages. In the example to simplify reading expression written on several lines. To use it as is you need to enable "Ignore pattern whitespace" option for regex parser or rewrite this expression in single line.

No comments: