parsing file need help please

am 08.06.2011 16:48:22 von Natalie Conte

HI,
Me again with another parsing script :)
this time, I have this file format, see below, I want to assess if the
genotyping data ( 2 letters at the beginning) are the same in column 1
or 2, NN should be ignored (as no results provided).
I should count as well the number of heterozygous ( when the 2 letters
are different) and homozygous ( when the 2 letters are identical).
calculate homo_same (for homo: same 2 identical letters for both column,
ex line 8) and homo_diff(for homo: different 2 identical letter for both
column, ex line 4).
calculate het_same( ex line 1)and het_diff (ex line3).

AG;1.000 \t AG;1.000\n
NN;0.775 NN;0.805
AC;0.999 TC;0.998
AA;0.998 TT;0.998
GG;0.997 GG;0.997
AG;1.000 AG;1.000
AG;0.979 NN;0.661
GG;0.996 GG;0.989

I have managed to calculate the proportion of each genotype for both
column, but I am struggling to find a way to calculate homo_same and
homo_diff, het_same and het_ diff

this is my script which should work with the example file above:
#!/software/bin/perl
use warnings;
use strict;

my
$file=".txt";

my @hets=qw{AT AC AG TA TC TG GA GC GT CA CT CG};
my @homos=qw{AA TT CC GG};

#my ($homo_count_same,$homo_count_diff);
#my ($het_count_same,$het_count_diff);
my %geno;

my $line;

open( my $FH , '<' , $file ) or die( $! );
open(OUT, ">>Test.txt");

while( <$FH> ) {
my @Snps = split /\t/;
foreach my $Snp (@Snps) {
$line++;
if ($Snp =~/^NN/)
{ next; }
foreach my $het (@hets){ #test if $snp is heterozygous
if ($Snp=~/$het/){
print OUT
$line,"\t",$Snp,"\t","heterozygous","\t",$het,"\n"; #print the genotype het

$geno{$het}++;#count het in hashe
}
}
foreach my $hom (@homos){#test if $snp is homo
if ($Snp=~/$hom/){

print OUT $line,"\t",$Snp,"\t","homozygous","\t",$hom,"\n";
#print the genotype homo
$geno{$hom}++;#count homo in hashe
}

}
}
#I guess I need to calculate my $homo_same and Homo_diff from here, but
I am not sure how

}

foreach my $gen (keys %geno){
print "$gen is present ",$geno{$gen}," times","\n";
}

many thanks for any suggestions,
Nat

--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: parsing file need help please

am 08.06.2011 18:48:01 von Jim Gibson

On 6/8/11 Wed Jun 8, 2011 7:48 AM, "Nathalie Conte"
scribbled:

> HI,
> Me again with another parsing script :)
> this time, I have this file format, see below, I want to assess if the
> genotyping data ( 2 letters at the beginning) are the same in column 1
> or 2, NN should be ignored (as no results provided).
> I should count as well the number of heterozygous ( when the 2 letters
> are different) and homozygous ( when the 2 letters are identical).
> calculate homo_same (for homo: same 2 identical letters for both column,
> ex line 8) and homo_diff(for homo: different 2 identical letter for both
> column, ex line 4).
> calculate het_same( ex line 1)and het_diff (ex line3).
>
> AG;1.000 \t AG;1.000\n
> NN;0.775 NN;0.805
> AC;0.999 TC;0.998
> AA;0.998 TT;0.998
> GG;0.997 GG;0.997
> AG;1.000 AG;1.000
> AG;0.979 NN;0.661
> GG;0.996 GG;0.989

So what numbers do you get for this data set? Do you ever have mixed data
(heterozygous in one column and homozygous in the other)?

>
> I have managed to calculate the proportion of each genotype for both
> column, but I am struggling to find a way to calculate homo_same and
> homo_diff, het_same and het_ diff
>
> this is my script which should work with the example file above:

Here are some helpful hints to help us help you:

1. Use the file handle to include data at the end of your source file
so that people can cut and paste into a single file.
2. Print results to standard output instead of a file to reduce lines of
code.
3. Indent your source code properly to make it more readable (fewer people
will even read your message if the indenting is sloppy).
4. Leave out comments -- we don't really care what your program is doing for
you, just whether it is doing what you expect it to do.

> #!/software/bin/perl
> use warnings;
> use strict;
>
> my
> $file=".txt";
>
> my @hets=qw{AT AC AG TA TC TG GA GC GT CA CT CG};
> my @homos=qw{AA TT CC GG};

Put the above values in a hash as keys to avoid searching an array:

my( %hets, %homos );
$hets{$_} = 1 for @hets;
$homos{$_} = 1 for @homos;

>
> #my ($homo_count_same,$homo_count_diff);
> #my ($het_count_same,$het_count_diff);
> my %geno;
>
> my $line;
>
> open( my $FH , '<' , $file ) or die( $! );

Good use of: a) lexical file handle, b) three-argument open, c) error
checking.

> open(OUT, ">>Test.txt");

Not so good on all three counts!

>
> while( <$FH> ) {
> my @Snps = split /\t/;

You can just split on whitespace.

> foreach my $Snp (@Snps) {
> $line++;
> if ($Snp =~/^NN/)
> { next; }

That can be written as:

next if $Snp =~ /^NN/;

> foreach my $het (@hets){ #test if $snp is heterozygous
> if ($Snp=~/$het/){
> print OUT
> $line,"\t",$Snp,"\t","heterozygous","\t",$het,"\n"; #print the genotype het
>
> $geno{$het}++;#count het in hashe
> }
> }
> foreach my $hom (@homos){#test if $snp is homo
> if ($Snp=~/$hom/){
>
> print OUT $line,"\t",$Snp,"\t","homozygous","\t",$hom,"\n";
> #print the genotype homo
> $geno{$hom}++;#count homo in hashe
> }
>
> }
> }
> #I guess I need to calculate my $homo_same and Homo_diff from here, but
> I am not sure how

Compare the genotypes from the left and right columns and see if they are:
1) heterozygous or homozygous, and 2) the same or different.

>
>
> }
>
> foreach my $gen (keys %geno){
> print "$gen is present ",$geno{$gen}," times","\n";
> }
>
> many thanks for any suggestions,

I suggest you extract the actual genotype fields from the column data and
use hashes and the eq operator to make the appropriate tests. Here is one
example:

#!/usr/local/bin/perl
use strict;
use warnings;

my @hets=qw{AT AC AG TA TC TG GA GC GT CA CT CG};
my @homos=qw{AA TT CC GG};

my( %hets, %homos );
$hets{$_} = 1 for @hets;
$homos{$_} = 1 for @homos;

my( $homo_count_same, $homo_count_diff, $het_count_same, $het_count_diff ) =
(0)x4;
my %geno;

while( ) {
my( $snp1, $snp2 ) = split;
my( $gt1 ) = split(/;/,$snp1);
my( $gt2 ) = split(/;/,$snp2);

if( $homos{$gt1} && $homos{$gt2} ) {
if( $gt1 eq $gt2 ) {
$homo_count_same++;
}else{
$homo_count_diff++;
}
}elsif( $hets{$gt1} && $hets{$gt2} ) {
if( $gt1 eq $gt2 ) {
$het_count_same++;
}else{
$het_count_diff++;
}
}elsif( $gt1 ne 'NN' || $gt2 eq 'NN' ) {
# don't compare left and right columns
}else{
print "Mixed homogeneous and heterogeneous genotypes ($gt1,$gt2)\n";
}

$geno{$gt1}++ unless $gt1 eq 'NN';
$geno{$gt2}++ unless $gt2 eq 'NN';
}

foreach my $gen (keys %geno){
print "$gen is present ",$geno{$gen}," times","\n";
}

print "Homozygous same: $homo_count_same\n";
print "Homozygous diff: $homo_count_diff\n";
print "Heterozygous same: $het_count_same\n";
print "Heterozygous diff: $het_count_diff\n";

__DATA__
AG;1.000 AG;1.000
NN;0.775 NN;0.805
AC;0.999 TC;0.998
AA;0.998 TT;0.998
GG;0.997 GG;0.997
AG;1.000 AG;1.000
AG;0.979 NN;0.661
GG;0.996 GG;0.989

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/