don"t know where to start??? comparing files

don"t know where to start??? comparing files

am 12.10.2011 16:01:15 von Natalie Conte

HI All,
I have 2 sets of files I want to compare,and I don't know where to start
to get what I want :(
I have a reference file ( see ref for example) with a chromosome name, a
start and a end position
Chr7 115249090 115859515
Chr8 25255496 29565459
Chr13 198276698 298299815
ChrX 109100951 109130998


and I have a file (file_test) file I want to parse against this
reference ref.txt
Chr1 115249098
Chr1 1362705
Chr8 25255996
Chr8 1362714
Chr1 1362735
ChrX 109100997

So if the position on the file_test is found in ref_file it is kept in a
new file, if not discarded.

I am looking for advises /modules I could use to compare those 2 files .
many thanks in advance for any tips
Nat


--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: don"t know where to start??? comparing files

am 12.10.2011 16:32:16 von Shlomi Fish

Hi Nathalie,

On Wed, 12 Oct 2011 15:01:15 +0100
Nathalie Conte wrote:

> HI All,
> I have 2 sets of files I want to compare,and I don't know where to start
> to get what I want :(
> I have a reference file ( see ref for example) with a chromosome name, a
> start and a end position
> Chr7 115249090 115859515
> Chr8 25255496 29565459
> Chr13 198276698 298299815
> ChrX 109100951 109130998
>
>
> and I have a file (file_test) file I want to parse against this
> reference ref.txt
> Chr1 115249098
> Chr1 1362705
> Chr8 25255996
> Chr8 1362714
> Chr1 1362735
> ChrX 109100997
>
> So if the position on the file_test is found in ref_file it is kept in a
> new file, if not discarded.

What I would do is construct a large array of the ranges where the indices can
be found (using start/end), while merging overlapping ranges, and then sort it
to have a sorted array of ([$start1,$end1],[$start2, $end2]...) ranges.

Then I will lookup these points in the array using binary search:

* http://search.cpan.org/dist/Search-Binary/

* http://search.cpan.org/~stevan/Tree-Binary-0.07/lib/Tree/Bin ary/Search.pm#OTHER_TREE_MODULES

Regards,

Shlomi Fish

>
> I am looking for advises /modules I could use to compare those 2 files .
> many thanks in advance for any tips
> Nat
>
>



--
------------------------------------------------------------ -----
Shlomi Fish http://www.shlomifish.org/
Stop Using MSIE - http://www.shlomifish.org/no-ie/

Larry Wall is lazy, impatient and full of hubris.

Please reply to list if it's a mailing list post - http://shlom.in/reply .

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: don"t know where to start??? comparing files

am 12.10.2011 17:17:52 von Shawn H Corey

On 11-10-12 10:01 AM, Nathalie Conte wrote:
> HI All,
> I have 2 sets of files I want to compare,and I don't know where to start
> to get what I want :(
> I have a reference file ( see ref for example) with a chromosome name, a
> start and a end position
> Chr7 115249090 115859515
> Chr8 25255496 29565459
> Chr13 198276698 298299815
> ChrX 109100951 109130998
>
>
> and I have a file (file_test) file I want to parse against this
> reference ref.txt
> Chr1 115249098 Chr1 1362705 Chr8 25255996 Chr8 1362714 Chr1 1362735 ChrX
> 109100997
> So if the position on the file_test is found in ref_file it is kept in a
> new file, if not discarded.
>
> I am looking for advises /modules I could use to compare those 2 files .
> many thanks in advance for any tips
> Nat
>
>

Try:

#!/usr/bin/env perl

use strict;
use warnings;

# file names; change as needed
my $ref_file = 'ref.txt';
my $data_file = 'test.txt';

# a hash for hold the start and end positions from the ref file
my %ref = ();

# main
load_ref();
scan();

# load the ref file into %ref
sub load_ref {

open my $ref_fh, '<', $ref_file or die "could not open $ref_file: $!\n";

while( my $line = <$ref_fh> ){

# extract the items from the line
my ( $id, $start, $end ) = split ' ', $line;

# store as HoH
$ref{$id} = {
start => $start,
end => $end,
};
}

close $ref_fh;
}

sub scan {

open my $data_fh, '<', $data_file or die "could not open $data_file:
$!\n";

while( my $line = <$data_fh> ){

# extract each pair of IDs and numbers
while( $line =~ m{ \s* (\S+) \s* (\S+) }gmsx ){
my $id = $1;
my $number = $2;

# see if the number is between the start and end
if( exists $ref{$id}
&& $ref{$id}{start} <= $number
&& $number <= $ref{$id}{end}
){
printf "%-7s % 15s\n", $id, $number;
}
}
}
close $data_fh;
}

__END__

--
Just my 0.00000002 million dollars worth,
Shawn

Confusion is the first step of understanding.

Programming is as much about organization and communication
as it is about coding.

The secret to great software: Fail early & often.

Eliminate software piracy: use only FLOSS.

"Make something worthwhile." -- Dear Hunter

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: don"t know where to start??? comparing files

am 13.10.2011 04:07:58 von merlyn

>>>>> "Shawn" == Shawn H Corey writes:

Shawn> #!/usr/bin/env perl

Please. Don't.

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095

Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
See http://methodsandmessages.posterous.com/ for Smalltalk discussion

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: don"t know where to start??? comparing files

am 13.10.2011 13:34:19 von dermot

On 13 October 2011 03:07, Randal L. Schwartz wrote:
>>>>>> "Shawn" == Shawn H Corey writes:
>
> Shawn> #!/usr/bin/env perl
>
> Please. =A0Don't.

This is quite relevant for me at the moment. I have a couple of
projects where I will not be using the system perl and I was under
the impression that using `env perl` was the preferred method. So it
you using perlbrew, local::lib or just build a perl in some exotic
directory, are you suggested we give the path to the perl you want to
use? What's the reasoning?
Thanks,
Dermot

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: don"t know where to start??? comparing files

am 13.10.2011 15:43:32 von Igor Dovgiy

--0016e65a076e99894804af2e5369
Content-Type: text/plain; charset=ISO-8859-1

May be this'll help? )


#!/usr/bin/perl
use strict;
use warnings;
die 'Usage: ' . __FILE__ . " file1[ file2...]\n" unless @ARGV;

my $ref_file = 'ref.txt';
my $new_file = 'new.txt';

open my $ref_fh, '<', $ref_file
or die "Failed to open reference file - $!\n";
my %limits_for;
while (<$ref_fh>) {
next unless /\d/; # skipping infoless lines
my ($chromosome, $start, $end) = split;
$limits_for{ $chromosome } = {
start => $start,
end => $end,
};
}
close $ref_fh;

my %positions_for;
while (<>) {
my ($chromosome, $pos) = split;
push @{ $positions_for{ $chromosome } }, $pos;
}

my %in_limits_for;
foreach my $chromosome (keys %positions_for) {
next unless exists $limits_for{ $chromosome };
my @in_limits = grep {
$limits_for{ $chromosome }->{start} <= $_
&&
$_ <= $limits_for{ $chromosome }->{end}
} @{ $positions_for{ $chromosome } };
$in_limits_for{ $chromosome } = \@in_limits;
}

open my $new_fh, '>', $new_file
or die "Failed to write out results - $!\n";
foreach my $chromosome (keys %in_limits_for) {
foreach my $pos ( @{ $in_limits_for{ $chromosome } } ) {
printf $new_fh
"%-7s %15s\n", $chromosome, $pos;
}
print $new_fh '=' x 80 . "\n";
}
close $new_fh;

-- iD

2011/10/12 Nathalie Conte

> HI All,
> I have 2 sets of files I want to compare,and I don't know where to start to
> get what I want :(
> I have a reference file ( see ref for example) with a chromosome name, a
> start and a end position
> Chr7 115249090 115859515
> Chr8 25255496 29565459
> Chr13 198276698 298299815
> ChrX 109100951 109130998
>
>
> and I have a file (file_test) file I want to parse against this reference
> ref.txt
> Chr1 115249098 Chr1 1362705 Chr8 25255996 Chr8 1362714 Chr1
> 1362735 ChrX 109100997
> So if the position on the file_test is found in ref_file it is kept in a
> new file, if not discarded.
>
> I am looking for advises /modules I could use to compare those 2 files .
> many thanks in advance for any tips
> Nat
>
>
> --
> The Wellcome Trust Sanger Institute is operated by Genome Research Limited,
> a charity registered in England with number 1021457 and a company registered
> in England with number 2742969, whose registered office is 215 Euston Road,
> London, NW1 2BE.
> --
> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
> For additional commands, e-mail: beginners-help@perl.org
> http://learn.perl.org/
>
>
>

--0016e65a076e99894804af2e5369--

Re: don"t know where to start??? comparing files

am 14.10.2011 08:08:46 von merlyn

>>>>> "Dermot" == Dermot writes:

Dermot> On 13 October 2011 03:07, Randal L. Schwartz om> wrote:
>>>>>>> "Shawn" == Shawn H Corey writes:
>>=20
Shawn> #!/usr/bin/env perl
>>=20
>> Please.  Don't.

Dermot> This is quite relevant for me at the moment. I have a couple of
Dermot> projects where I will not be using the system perl and I was und=
er
Dermot> the impression that using `env perl` was the preferred method. So=
it
Dermot> you using perlbrew, local::lib or just build a perl in some exoti=
c
Dermot> directory, are you suggested we give the path to the perl you wan=
t to
Dermot> use? What's the reasoning?

Because this uses *my* environment when I run *your* Perl script.
That's broken.

Hardcode the path. Or install it using any of the module tools, which
will replace #!perl with the proper hardcoded local Perl path.

--=20
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 00=
95

Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
See http://methodsandmessages.posterous.com/ for Smalltalk discussion

--=20
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: don"t know where to start??? comparing files

am 14.10.2011 15:50:33 von Shawn H Corey

On 11-10-14 02:08 AM, Randal L. Schwartz wrote:
> Because this uses*my* environment when I run*your* Perl script.
> That's broken.

Then you should un-break your environment. I can help you if you're
using Linux. If you're using Windows, I'm sure there are many on the
list who can help.


--
Just my 0.00000002 million dollars worth,
Shawn

Confusion is the first step of understanding.

Programming is as much about organization and communication
as it is about coding.

The secret to great software: Fail early & often.

Eliminate software piracy: use only FLOSS.

"Make something worthwhile." -- Dear Hunter

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Deployment Issues (Was: don"t know where to start??? comparingfiles)

am 14.10.2011 17:05:45 von RWeidner

DQo+IEJlY2F1c2UgdGhpcyB1c2VzICpteSogZW52aXJvbm1lbnQgd2hlbiBJ IHJ1biAqeW91ciog
UGVybCBzY3JpcHQuDQo+IFRoYXQncyBicm9rZW4uDQoNCj4gSGFyZGNvZGUg dGhlIHBhdGguICBP
ciBpbnN0YWxsIGl0IHVzaW5nIGFueSBvZiB0aGUgbW9kdWxlIHRvb2xzLCB3 aGljaA0KPiB3aWxs
IHJlcGxhY2UgIyFwZXJsIHdpdGggdGhlIHByb3BlciBoYXJkY29kZWQgbG9j YWwgUGVybCBwYXRo
Lg0KDQpJIGFsc28gdmlldyB0aGlzIGFzIGEgZGVwbG95bWVudCBwcm9ibGVt IHdoaWNoIGhhcyBw
cm9iYWJseSBiZWVuIHNvbHZlZCBtYW55IHRpbWVzIGJ5DQpub3cuICBUaGUg b3RoZXIgc2ltaWxh
ciBidXQgbm90IGlkZW50aWNhbCBkZXBsb3ltZW50IHByb2JsZW0gcmV2b2x2 ZXMgYXJvdW5kIEBJ
TkMuICBCb3RoDQpvZiB0aGVzZSBkZXBsb3ltZW50IGNoYWxsZW5nZXMgYXJl IGV4YWdnZXJhdGVk
IGJ5IGxhY2sgb2Ygcm9vdCBwcml2aWxlZ2VzIG9uIGxpbnV4DQpib3hlcy4g IENhbiBzb21lb25l
IHJlY29tbWVuZCBhbnkgcGFydGljdWxhcmx5IGdvb2QgcmVzb3VyY2VzIG9y IHRvb2xzIGZvciBh
ZGRyZXNzaW5nDQp0aGVzZSBpc3N1ZXM/DQoNCi0tDQpSb25hbGQgV2VpZG5l cg0KDQoNCioqKioq
KioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioq KioqKioqKioqKioq
KioqKioqKioNClRoaXMgZS1tYWlsIGlzIGludGVuZGVkIHNvbGVseSBmb3Ig dGhlIGludGVuZGVk
IHJlY2lwaWVudCBvciByZWNpcGllbnRzLiBJZiB0aGlzIGUtbWFpbCBpcyBh ZGRyZXNzZWQgdG8g
eW91IGluIGVycm9yIG9yIHlvdSBvdGhlcndpc2UgcmVjZWl2ZSB0aGlzIGUt bWFpbCBpbiBlcnJv
ciwgcGxlYXNlIGFkdmlzZSB0aGUgc2VuZGVyLCBkbyBub3QgcmVhZCwgcHJp bnQsIGZvcndhcmQg
b3Igc2F2ZSB0aGlzIGUtbWFpbCwgYW5kIHByb21wdGx5IGRlbGV0ZSBhbmQg ZGVzdHJveSBhbGwg
Y29waWVzIG9mIHRoaXMgZS1tYWlsLiANClRoaXMgZW1haWwgbWF5IGNvbnRh aW4gaW5mb3JtYXRp
b24gdGhhdCBpcyBjb25maWRlbnRpYWwsIHByb3ByaWV0YXJ5IG9yIHNlY3Jl dCBhbmQgc2hvdWxk
IGJlIHRyZWF0ZWQgYXMgY29uZmlkZW50aWFsIGJ5IGFsbCByZWNpcGllbnRz LiBUaGlzIGUtbWFp
bCBtYXkgYWxzbyBiZSBhIGNvbmZpZGVudGlhbCBhdHRvcm5leS1jbGllbnQg Y29tbXVuaWNhdGlv
biwgY29udGFpbiBhdHRvcm5leSB3b3JrIHByb2R1Y3QsIG9yIG90aGVyd2lz ZSBiZSBwcml2aWxl
Z2VkIGFuZCBleGVtcHQgZnJvbSBkaXNjbG9zdXJlLiBJZiB0aGVyZSBpcyBh IGNvbmZpZGVudGlh
bGl0eSBvciBub24tZGlzY2xvc3VyZSBhZ3JlZW1lbnQgb3IgcHJvdGVjdGl2 ZSBvcmRlciBjb3Zl
cmluZyBhbnkgaW5mb3JtYXRpb24gY29udGFpbmVkIGluIHRoaXMgZS1tYWls LCBzdWNoIGluZm9y
bWF0aW9uIHNoYWxsIGJlIHRyZWF0ZWQgYXMgY29uZmlkZW50aWFsIGFuZCBz dWJqZWN0IHRvIHJl
c3RyaWN0aW9uIG9uIGRpc2Nsb3N1cmUgYW5kIHVzZSBpbiBhY2NvcmRhbmNl IHdpdGggc3VjaCBh
Z3JlZW1lbnQgb3Igb3JkZXIsIGFuZCB0aGlzIG5vdGljZSBzaGFsbCBjb25z dGl0dXRlIGlkZW50
aWZpY2F0aW9uLCBsYWJlbGluZyBvciBtYXJraW5nIG9mIHN1Y2ggaW5mb3Jt YXRpb24gYXMgY29u
ZmlkZW50aWFsLCBwcm9wcmlldGFyeSBvciBzZWNyZXQgaW4gYWNjb3JkYW5j ZSB3aXRoIHN1Y2gg
YWdyZWVtZW50IG9yIG9yZGVyLg0KVGhlIHRlcm0gJ3RoaXMgZS1tYWlsJyBp bmNsdWRlcyBhbnkg
YW5kIGFsbCBhdHRhY2htZW50cy4NCioqKioqKioqKioqKioqKioqKioqKioq KioqKioqKioqKioq
KioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioNCg==

Re: don"t know where to start??? comparing files

am 18.10.2011 16:13:00 von merlyn

Installing CPAN Params::Validate and DateTime on Mac OS X 10.6.8

am 18.10.2011 16:20:24 von j.madrigal2

Re: Installing CPAN Params::Validate and DateTime on Mac OS X 10.6.8

am 18.10.2011 16:26:55 von Rob Coops

Re: Installing CPAN Params::Validate and DateTime on Mac OS X 10.6.8

am 18.10.2011 16:30:33 von merlyn

Re: don"t know where to start??? comparing files

am 18.10.2011 16:33:37 von Zachary Zebrowski

Re: Installing CPAN Params::Validate and DateTime on Mac OS X 10.6.8

am 18.10.2011 17:05:14 von j.madrigal2

Re: Installing CPAN Params::Validate and DateTime on Mac OS X 10.6.8

am 18.10.2011 17:17:30 von Ryan Munson

Re: Installing CPAN Params::Validate and DateTime on Mac OS X 10.6.8

am 18.10.2011 17:22:31 von j.madrigal2

Re: Installing CPAN Params::Validate and DateTime on Mac OS X 10.6.8

am 18.10.2011 17:32:49 von Phil Dobbin

Re: Installing CPAN Params::Validate and DateTime on Mac OS X 10.6.8

am 18.10.2011 20:56:51 von j.madrigal2