parsing script duplication of lines issue, please advise

parsing script duplication of lines issue, please advise

am 21.07.2011 13:00:57 von Natalie Conte

--------------080808000102040605040301
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

HI,
I want to create a simple script where I am parsing a file and writing
only the lines where I can find a certain value in a new output file
this is my Infile format: workable example attached
I want to keep only the lines where there is a 1 not the ones with -1,
there are 10 in this example and when I produce my outfile it is 20
lines long! They are duplicated and I am not sure why, I would
appreciate any advise. the example infile attached contain 50 and
produce a outfile of 100...
18 3016088 3016288 -1
18 3035364 3035564 -1
18 3163934 3164134 -1
18 3167351 3167551 1
18 3176373 3176573 1
18 3198845 3199045 -1
18 3215936 3216136 1
18 3275482 3275682 -1
18 3281089 3281289 -1
18 3388675 3388875 -1
18 3517500 3517700 -1
18 3588447 3588647 1
18 3667294 3667494 -1
18 3746503 3746703 -1
18 3771167 3771367 -1
18 3779418 3779618 -1
18 3916005 3916205 -1
18 3933642 3933842 1
18 3975635 3975835 1
18 3992344 3992544 -1
18 4084642 4084842 1
18 4127586 4127786 -1
18 4149689 4149889 -1
18 4158287 4158487 -1
18 4189973 4190173 1
18 4402882 4403082 -1
18 4441582 4441782 1
18 4454914 4455114 -1
18 4549176 4549376 1
18 4557665 4557865 -1
18 4557697 4557897 -1
18 4600101 4600301 -1


####this is my script
#!/software/bin/perl
use warnings;
use strict;




my $file="./infile.txt";

open( IN , '<' , $file ) or die( $! );
open(OUT, ">>outfile.txt");


while (){

my @line=split(/\t/);

if($line[3]==-1) {
print OUT $line[0],"\t",$line[1],"\t",$line[2],"\t",$line[3],"\n";
}

}
close OUT;
close IN;

thanks a lot
Nat



--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

--------------080808000102040605040301
Content-Type: text/plain;
name="Infile.txt"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="Infile.txt"

18 3016088 3016288 -1
18 3035364 3035564 -1
18 3163934 3164134 -1
18 3167351 3167551 1
18 3176373 3176573 1
18 3198845 3199045 -1
18 3215936 3216136 1
18 3275482 3275682 -1
18 3281089 3281289 -1
18 3388675 3388875 -1
18 3517500 3517700 -1
18 3588447 3588647 1
18 3667294 3667494 -1
18 3746503 3746703 -1
18 3771167 3771367 -1
18 3779418 3779618 -1
18 3916005 3916205 -1
18 3933642 3933842 1
18 3975635 3975835 1
18 3992344 3992544 -1
18 4084642 4084842 1
18 4127586 4127786 -1
18 4149689 4149889 -1
18 4158287 4158487 -1
18 4189973 4190173 1
18 4402882 4403082 -1
18 4441582 4441782 1
18 4454914 4455114 -1
18 4549176 4549376 1
18 4557665 4557865 -1
18 4557697 4557897 -1
18 4600101 4600301 -1
18 4655821 4656021 1
18 4823384 4823584 1
18 4926583 4926783 1
18 5014539 5014739 -1
18 5016121 5016321 1
18 5259109 5259309 -1
18 5410893 5411093 -1
18 5569191 5569391 1
18 5712820 5713020 -1
18 5779451 5779651 1
18 5833552 5833752 1
18 5857140 5857340 1
18 5881514 5881714 1
18 6086628 6086828 -1
18 6136835 6137035 1
18 6150173 6150373 -1
18 6162670 6162870 -1
18 6285079 6285279 1
18 6313885 6314085 -1
18 6330508 6330708 -1
18 6474218 6474418 -1
18 6480666 6480866 -1
18 6492824 6493024 1
18 6564120 6564320 1
18 6615712 6615912 1
18 6639668 6639868 -1
18 6672462 6672662 1
18 6748872 6749072 1
18 6798372 6798572 1
18 6893386 6893586 1
18 6958909 6959109 1
18 6959109 6959309 -1
18 7064521 7064721 1
18 7144521 7144721 1
18 7247001 7247201 1
18 7247380 7247580 1
18 7341855 7342055 -1
18 7364831 7365031 1
18 7607559 7607759 -1
18 8193879 8194079 1
18 8238292 8238492 1
18 8263636 8263836 -1
18 8369450 8369650 -1
18 8375540 8375740 -1
18 8451208 8451408 1
18 8532328 8532528 1
18 8549356 8549556 1
18 8741513 8741713 1
18 8875256 8875456 1
18 8909214 8909414 -1
18 8936872 8937072 1
18 8952466 8952666 -1
18 9102418 9102618 1
18 9109811 9110011 -1
18 9144187 9144387 -1
18 9150741 9150941 -1
18 9177129 9177329 -1
18 9192311 9192511 -1
18 9227813 9228013 -1
18 9317676 9317876 1
18 9425447 9425647 -1
18 9522253 9522453 -1
18 9683323 9683523 -1
18 9691794 9691994 -1
18 9733220 9733420 -1
18 9856687 9856887 1
18 9924738 9924938 -1
18 9951261 9951461 -1
18 9969697 9969897 -1
18 9989764 9989964 -1
18 10013428 10013628 1
18 10277437 10277637 1
18 10287118 10287318 1
18 10290645 10290845 1
18 10331372 10331572 1
18 10381115 10381315 -1


--------------080808000102040605040301
Content-Type: text/plain; charset=us-ascii

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

--------------080808000102040605040301--

Re: parsing script duplication of lines issue, please advise

am 21.07.2011 13:33:12 von Brian Fraser

--001517478334c984ea04a892b6f4
Content-Type: text/plain; charset=UTF-8

On Thu, Jul 21, 2011 at 8:17 AM, Nathalie Conte wrote:

> I forgot to say that the extra lines are empty... but I don't understand
> why they are there :)
>
>
That's rather simpler, since there's nothing in the program that could cause
double output (unless you ran it twice :P).
The issue is that you read in lines, and the end of each line has a \n. So
assuming there's no tab after each -1, $line[3] will actually contain "-1\n"
- and your print also has a \n. So there's your empty line. Either add
chomp; before the line with the slit, or remove the newline from the print.

Also, a comment on your code:
print OUT $line[0],"\t",$line[1],"\t",$line[2],"\t",$line[3],"\n";

That's rather unwieldy, isn't it? You could simplify it a bit by taking
advantage of interpolation, like this:

print OUT "$line[0]\t$line[1]\t$line[2]\t$line[3]";

But that's still a mouthful, so you could use join[0] instead:

print OUT join "\t", @line;

[0] http://perldoc.perl.org/functions/join.html

--001517478334c984ea04a892b6f4--