Extract domain name

Extract domain name

am 12.11.2004 17:02:57 von Shabam

How do you fetch just the domain name part of a variable in a script? The
variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
"http://sub.domain.com/blahblah/whatever/page.htm".

What I need is to extract just the "domain.com".

Re: Extract domain name

am 12.11.2004 17:23:55 von Paul Lalli

[removed non-existant groups, removed off topic AOL group, set followups
to c.l.p.m.]

"Shabam" wrote in message
news:3u-dnd1_9JRvQAncRVn-ig@adelphia.com...
> How do you fetch just the domain name part of a variable in a script?
The
> variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
> "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".

Try using the Regexp::Common module from CPAN. I seem to recall it has
a method for parsing URIs

Paul Lalli

Re: Extract domain name

am 12.11.2004 17:23:55 von Paul Lalli

[removed non-existant groups, removed off topic AOL group, set followups
to c.l.p.m.]

"Shabam" wrote in message
news:3u-dnd1_9JRvQAncRVn-ig@adelphia.com...
> How do you fetch just the domain name part of a variable in a script?
The
> variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
> "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".

Try using the Regexp::Common module from CPAN. I seem to recall it has
a method for parsing URIs

Paul Lalli

Re: Extract domain name

am 12.11.2004 18:38:38 von Ryan Thompson

[ Cross-post trimmed ]

Shabam wrote to :

> How do you fetch just the domain name part of a variable in a script?
> The variable can be "http://www.domain.com/blahblah/whatever/page.htm"
> or "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".

This is definitely a non-trivial problem. Fortunately, it's been
partially solved already. I'm involved in the SpamAssassin and SURBL
projects, where this really became obvious when spammers started
obfuscating URIs, and using domains from many different TLDs where it
takes a lot of research to determine where to chop the hostname to get
the actual registrar domain.

There's much more to it than using a library or regexp.

See get_uri_list() in SpamAssassin 3's PerMsgStatus.pm for one
"industrial strength" solution to this problem, which still has room for
improvement.

- Ryan

--
Ryan Thompson

SaskNow Technologies - http://www.sasknow.com
901-1st Avenue North - Saskatoon, SK - S7K 1Y4

Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon
Toll-Free: 877-727-5669 (877-SASKNOW) North America

Re: Extract domain name

am 12.11.2004 21:09:29 von Andrew Tkachenko

Look for URI module. IMHO, its a good and simple thing for parsing URLs

use URI;
($domain = URI->new("http://www.domain.com/blahblah/whatever/page.htm") ->authority) =~ s/^www\.//i


Regards,
Andrew

Shabam wrote on 12 Ноябрь 2004 16:02:

> How do you fetch just the domain name part of a variable in a script? The
> variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
> "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".

--
Andrew

Re: Extract domain name

am 12.11.2004 21:09:29 von Andrew Tkachenko

Look for URI module. IMHO, its a good and simple thing for parsing URLs

use URI;
($domain = URI->new("http://www.domain.com/blahblah/whatever/page.htm") ->authority) =~ s/^www\.//i


Regards,
Andrew

Shabam wrote on 12 Ноябрь 2004 16:02:

> How do you fetch just the domain name part of a variable in a script? The
> variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
> "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".

--
Andrew

Re: Extract domain name

am 12.11.2004 21:40:45 von Andrew Tkachenko

Sorry, did'nt pay attention to sub-domains in your example.
So, IMHO, it depends on your task - if it allows to guess possible
TLD values, then just split domain name into parts and leave just matched
TLD and SLD.

Regards,
Andrew

Ryan Thompson wrote on 12 Ноябрь 2004 17:38:

> [ Cross-post trimmed ]
>
> Shabam wrote to :
>
>> How do you fetch just the domain name part of a variable in a script?
>> The variable can be "http://www.domain.com/blahblah/whatever/page.htm"
>> or "http://sub.domain.com/blahblah/whatever/page.htm".
>>
>> What I need is to extract just the "domain.com".
>
> This is definitely a non-trivial problem. Fortunately, it's been
> partially solved already. I'm involved in the SpamAssassin and SURBL
> projects, where this really became obvious when spammers started
> obfuscating URIs, and using domains from many different TLDs where it
> takes a lot of research to determine where to chop the hostname to get
> the actual registrar domain.
>
> There's much more to it than using a library or regexp.
>
> See get_uri_list() in SpamAssassin 3's PerMsgStatus.pm for one
> "industrial strength" solution to this problem, which still has room for
> improvement.
>
> - Ryan
>

--
Andrew

Re: Extract domain name

am 14.11.2004 09:22:05 von Joe Smith

Shabam wrote:

> How do you fetch just the domain name part of a variable in a script? The
> variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
> "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".

The problem is not well defined.

For "http://www.tacp.toshiba.com/" do you want "tacp.toshiba.com" or just
"toshiba.com"? For "http://story.news.yahoo.com", is "news" included or not?
You can't just use the last two components in all cases, such as
"http://www.toyota.co.jp" or "http://www.bbc.co.uk".

-Joe

Re: Extract domain name

am 14.11.2004 09:22:05 von Joe Smith

Shabam wrote:

> How do you fetch just the domain name part of a variable in a script? The
> variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
> "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".

The problem is not well defined.

For "http://www.tacp.toshiba.com/" do you want "tacp.toshiba.com" or just
"toshiba.com"? For "http://story.news.yahoo.com", is "news" included or not?
You can't just use the last two components in all cases, such as
"http://www.toyota.co.jp" or "http://www.bbc.co.uk".

-Joe

Re: Extract domain name

am 14.11.2004 12:12:24 von Shabam

> The problem is not well defined.
>
> For "http://www.tacp.toshiba.com/" do you want "tacp.toshiba.com" or just
> "toshiba.com"? For "http://story.news.yahoo.com", is "news" included or
not?
> You can't just use the last two components in all cases, such as
> "http://www.toyota.co.jp" or "http://www.bbc.co.uk".

What I would need is just the domain name part. In this case it would be
"toshiba.com" only. No subdomains. My domains will be simple
(com/net/org), so complicated situations like "toyota.co.jp" wouldn't apply.

Re: Extract domain name

am 14.11.2004 12:12:24 von Shabam

> The problem is not well defined.
>
> For "http://www.tacp.toshiba.com/" do you want "tacp.toshiba.com" or just
> "toshiba.com"? For "http://story.news.yahoo.com", is "news" included or
not?
> You can't just use the last two components in all cases, such as
> "http://www.toyota.co.jp" or "http://www.bbc.co.uk".

What I would need is just the domain name part. In this case it would be
"toshiba.com" only. No subdomains. My domains will be simple
(com/net/org), so complicated situations like "toyota.co.jp" wouldn't apply.

Re: Extract domain name

am 18.11.2004 08:40:27 von Sam

Shabam wrote:

>>The problem is not well defined.
>>
>>For "http://www.tacp.toshiba.com/" do you want "tacp.toshiba.com" or just
>>"toshiba.com"? For "http://story.news.yahoo.com", is "news" included or
>
> not?
>
>>You can't just use the last two components in all cases, such as
>>"http://www.toyota.co.jp" or "http://www.bbc.co.uk".
>
>
> What I would need is just the domain name part. In this case it would be
> "toshiba.com" only. No subdomains. My domains will be simple
> (com/net/org), so complicated situations like "toyota.co.jp" wouldn't apply.
>
>
I m not an expert, but the following regex will apply:

$url = "http://www.abc.xyz.toy-0-ota.com";
($domain) = ($url =~ /http:\/\/.*\.([0-9a-zA-Z\-]+\.com|net|org)/);
print $domain . "\n";

Sam

Re: Extract domain name

am 18.11.2004 08:40:27 von Sam

Shabam wrote:

>>The problem is not well defined.
>>
>>For "http://www.tacp.toshiba.com/" do you want "tacp.toshiba.com" or just
>>"toshiba.com"? For "http://story.news.yahoo.com", is "news" included or
>
> not?
>
>>You can't just use the last two components in all cases, such as
>>"http://www.toyota.co.jp" or "http://www.bbc.co.uk".
>
>
> What I would need is just the domain name part. In this case it would be
> "toshiba.com" only. No subdomains. My domains will be simple
> (com/net/org), so complicated situations like "toyota.co.jp" wouldn't apply.
>
>
I m not an expert, but the following regex will apply:

$url = "http://www.abc.xyz.toy-0-ota.com";
($domain) = ($url =~ /http:\/\/.*\.([0-9a-zA-Z\-]+\.com|net|org)/);
print $domain . "\n";

Sam