Extract domain name
am 12.11.2004 17:02:57 von Shabam
How do you fetch just the domain name part of a variable in a script? The
variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
"http://sub.domain.com/blahblah/whatever/page.htm".
What I need is to extract just the "domain.com".
Re: Extract domain name
am 12.11.2004 17:23:55 von Paul Lalli
[removed non-existant groups, removed off topic AOL group, set followups
to c.l.p.m.]
"Shabam" wrote in message
news:3u-dnd1_9JRvQAncRVn-ig@adelphia.com...
> How do you fetch just the domain name part of a variable in a script?
The
> variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
> "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".
Try using the Regexp::Common module from CPAN. I seem to recall it has
a method for parsing URIs
Paul Lalli
Re: Extract domain name
am 12.11.2004 17:23:55 von Paul Lalli
[removed non-existant groups, removed off topic AOL group, set followups
to c.l.p.m.]
"Shabam" wrote in message
news:3u-dnd1_9JRvQAncRVn-ig@adelphia.com...
> How do you fetch just the domain name part of a variable in a script?
The
> variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
> "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".
Try using the Regexp::Common module from CPAN. I seem to recall it has
a method for parsing URIs
Paul Lalli
Re: Extract domain name
am 12.11.2004 18:38:38 von Ryan Thompson
[ Cross-post trimmed ]
Shabam wrote to :
> How do you fetch just the domain name part of a variable in a script?
> The variable can be "http://www.domain.com/blahblah/whatever/page.htm"
> or "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".
This is definitely a non-trivial problem. Fortunately, it's been
partially solved already. I'm involved in the SpamAssassin and SURBL
projects, where this really became obvious when spammers started
obfuscating URIs, and using domains from many different TLDs where it
takes a lot of research to determine where to chop the hostname to get
the actual registrar domain.
There's much more to it than using a library or regexp.
See get_uri_list() in SpamAssassin 3's PerMsgStatus.pm for one
"industrial strength" solution to this problem, which still has room for
improvement.
- Ryan
--
Ryan Thompson
SaskNow Technologies - http://www.sasknow.com
901-1st Avenue North - Saskatoon, SK - S7K 1Y4
Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon
Toll-Free: 877-727-5669 (877-SASKNOW) North America
Re: Extract domain name
am 12.11.2004 21:09:29 von Andrew Tkachenko
Look for URI module. IMHO, its a good and simple thing for parsing URLs
use URI;
($domain = URI->new("http://www.domain.com/blahblah/whatever/page.htm") ->authority) =~ s/^www\.//i
Regards,
Andrew
Shabam wrote on 12 ÐоÑбÑÑ 2004 16:02:
> How do you fetch just the domain name part of a variable in a script? The
> variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
> "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".
--
Andrew
Re: Extract domain name
am 12.11.2004 21:09:29 von Andrew Tkachenko
Look for URI module. IMHO, its a good and simple thing for parsing URLs
use URI;
($domain = URI->new("http://www.domain.com/blahblah/whatever/page.htm") ->authority) =~ s/^www\.//i
Regards,
Andrew
Shabam wrote on 12 ÐоÑбÑÑ 2004 16:02:
> How do you fetch just the domain name part of a variable in a script? The
> variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
> "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".
--
Andrew
Re: Extract domain name
am 12.11.2004 21:40:45 von Andrew Tkachenko
Sorry, did'nt pay attention to sub-domains in your example.
So, IMHO, it depends on your task - if it allows to guess possible
TLD values, then just split domain name into parts and leave just matched
TLD and SLD.
Regards,
Andrew
Ryan Thompson wrote on 12 ÐоÑбÑÑ 2004 17:38:
> [ Cross-post trimmed ]
>
> Shabam wrote to :
>
>> How do you fetch just the domain name part of a variable in a script?
>> The variable can be "http://www.domain.com/blahblah/whatever/page.htm"
>> or "http://sub.domain.com/blahblah/whatever/page.htm".
>>
>> What I need is to extract just the "domain.com".
>
> This is definitely a non-trivial problem. Fortunately, it's been
> partially solved already. I'm involved in the SpamAssassin and SURBL
> projects, where this really became obvious when spammers started
> obfuscating URIs, and using domains from many different TLDs where it
> takes a lot of research to determine where to chop the hostname to get
> the actual registrar domain.
>
> There's much more to it than using a library or regexp.
>
> See get_uri_list() in SpamAssassin 3's PerMsgStatus.pm for one
> "industrial strength" solution to this problem, which still has room for
> improvement.
>
> - Ryan
>
--
Andrew
Re: Extract domain name
am 14.11.2004 09:22:05 von Joe Smith
Shabam wrote:
> How do you fetch just the domain name part of a variable in a script? The
> variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
> "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".
The problem is not well defined.
For "http://www.tacp.toshiba.com/" do you want "tacp.toshiba.com" or just
"toshiba.com"? For "http://story.news.yahoo.com", is "news" included or not?
You can't just use the last two components in all cases, such as
"http://www.toyota.co.jp" or "http://www.bbc.co.uk".
-Joe
Re: Extract domain name
am 14.11.2004 09:22:05 von Joe Smith
Shabam wrote:
> How do you fetch just the domain name part of a variable in a script? The
> variable can be "http://www.domain.com/blahblah/whatever/page.htm" or
> "http://sub.domain.com/blahblah/whatever/page.htm".
>
> What I need is to extract just the "domain.com".
The problem is not well defined.
For "http://www.tacp.toshiba.com/" do you want "tacp.toshiba.com" or just
"toshiba.com"? For "http://story.news.yahoo.com", is "news" included or not?
You can't just use the last two components in all cases, such as
"http://www.toyota.co.jp" or "http://www.bbc.co.uk".
-Joe
Re: Extract domain name
am 14.11.2004 12:12:24 von Shabam
> The problem is not well defined.
>
> For "http://www.tacp.toshiba.com/" do you want "tacp.toshiba.com" or just
> "toshiba.com"? For "http://story.news.yahoo.com", is "news" included or
not?
> You can't just use the last two components in all cases, such as
> "http://www.toyota.co.jp" or "http://www.bbc.co.uk".
What I would need is just the domain name part. In this case it would be
"toshiba.com" only. No subdomains. My domains will be simple
(com/net/org), so complicated situations like "toyota.co.jp" wouldn't apply.
Re: Extract domain name
am 14.11.2004 12:12:24 von Shabam
> The problem is not well defined.
>
> For "http://www.tacp.toshiba.com/" do you want "tacp.toshiba.com" or just
> "toshiba.com"? For "http://story.news.yahoo.com", is "news" included or
not?
> You can't just use the last two components in all cases, such as
> "http://www.toyota.co.jp" or "http://www.bbc.co.uk".
What I would need is just the domain name part. In this case it would be
"toshiba.com" only. No subdomains. My domains will be simple
(com/net/org), so complicated situations like "toyota.co.jp" wouldn't apply.
Re: Extract domain name
am 18.11.2004 08:40:27 von Sam
Shabam wrote:
>>The problem is not well defined.
>>
>>For "http://www.tacp.toshiba.com/" do you want "tacp.toshiba.com" or just
>>"toshiba.com"? For "http://story.news.yahoo.com", is "news" included or
>
> not?
>
>>You can't just use the last two components in all cases, such as
>>"http://www.toyota.co.jp" or "http://www.bbc.co.uk".
>
>
> What I would need is just the domain name part. In this case it would be
> "toshiba.com" only. No subdomains. My domains will be simple
> (com/net/org), so complicated situations like "toyota.co.jp" wouldn't apply.
>
>
I m not an expert, but the following regex will apply:
$url = "http://www.abc.xyz.toy-0-ota.com";
($domain) = ($url =~ /http:\/\/.*\.([0-9a-zA-Z\-]+\.com|net|org)/);
print $domain . "\n";
Sam
Re: Extract domain name
am 18.11.2004 08:40:27 von Sam
Shabam wrote:
>>The problem is not well defined.
>>
>>For "http://www.tacp.toshiba.com/" do you want "tacp.toshiba.com" or just
>>"toshiba.com"? For "http://story.news.yahoo.com", is "news" included or
>
> not?
>
>>You can't just use the last two components in all cases, such as
>>"http://www.toyota.co.jp" or "http://www.bbc.co.uk".
>
>
> What I would need is just the domain name part. In this case it would be
> "toshiba.com" only. No subdomains. My domains will be simple
> (com/net/org), so complicated situations like "toyota.co.jp" wouldn't apply.
>
>
I m not an expert, but the following regex will apply:
$url = "http://www.abc.xyz.toy-0-ota.com";
($domain) = ($url =~ /http:\/\/.*\.([0-9a-zA-Z\-]+\.com|net|org)/);
print $domain . "\n";
Sam