Help Parsing a Tab delimited file

am 13.07.2011 18:59:40 von Tiago Hori

--0015174736649be8cc04a7f65711
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi All,

I work with microarrays and get huge tab delimited files as outputs from th=
e
software that analysis these microarrays. The result is a tab-delimted Exce=
l
type of file that has 160000 rows and about 20 columns.

Every 44K rows make one unit within the data. These units are identified by
the Second data column, called meta arrow. So the first 44K rows have the
value 1 on Meta row, the next 44K have the value 2 and so for.

I would like to be able to separate these files into 4 different files, eac=
h
one containing each unit of data. So all the rows that have meta row 1 woul=
d
go to one file, and the ones with meta row 2 would go to another file and s=
o
forth.

I have been reading beginning perl to tried to figure this out, but I
haven't be able to come up with anything.

I have many questions: I know I can use a filhandle to connect to the file,
but how would I store the data to begin with?

Is there a way to iteratively read through the rows and then copy them to a
variable as long as their metarow column read let's say 1? and then out put
that as a new file?

Any help would be greatly appriciated, even if is just hints on how to get
started.

Cheers,

Tiago

--=20
"Education is not to be used to promote obscurantism." - Theodonius
Dobzhansky.

"Gracias a la vida que me ha dado tanto
Me ha dado el sonido y el abecedario
Con =E9l, las palabras que pienso y declaro
Madre, amigo, hermano
Y luz alumbrando la ruta del alma del que estoy amando

Gracias a la vida que me ha dado tanto
Me ha dado la marcha de mis pies cansados
Con ellos anduve ciudades y charcos
Playas y desiertos, monta=F1as y llanos
Y la casa tuya, tu calle y tu patio"

Violeta Parra - Gracias a la Vida

Tiago S. F. Hori
PhD Candidate - Ocean Science Center-Memorial University of Newfoundland

--0015174736649be8cc04a7f65711--

Re: Help Parsing a Tab delimited file

am 13.07.2011 20:06:00 von Shlomi Fish

Hi Tiago,

On Wed, 13 Jul 2011 14:29:40 -0230
Tiago Hori wrote:

> Hi All,
>=20
> I work with microarrays and get huge tab delimited files as outputs from =
the
> software that analysis these microarrays. The result is a tab-delimted Ex=
cel
> type of file that has 160000 rows and about 20 columns.
>=20
> Every 44K rows make one unit within the data. These units are identified =
by
> the Second data column, called meta arrow. So the first 44K rows have the
> value 1 on Meta row, the next 44K have the value 2 and so for.
>=20
> I would like to be able to separate these files into 4 different files, e=
ach
> one containing each unit of data. So all the rows that have meta row 1 wo=
uld
> go to one file, and the ones with meta row 2 would go to another file and=
so
> forth.
>=20
> I have been reading beginning perl to tried to figure this out, but I
> haven't be able to come up with anything.
>=20
> I have many questions: I know I can use a filhandle to connect to the fil=
e,
> but how would I store the data to begin with?

You can write the data directly to the four filehandles as you go over them.
You can have filehandles as the values of arrays or hashes.=20

Just make sure you are using a CSV parsing and output module:

http://beta.metacpan.org/release/Text-CSV

>=20
> Is there a way to iteratively read through the rows and then copy them to=
a
> variable as long as their metarow column read let's say 1?=20

Yes, there is, use a conditional or a hash for that. But you shouldn't keep
everything in memory - use multiple file-handles.

> and then out put
> that as a new file?
>=20
> Any help would be greatly appriciated, even if is just hints on how to get
> started.
>=20
> Cheers,
>=20
> Tiago
>=20
>=20

Regards,

Shlomi Fish

--=20
------------------------------------------------------------ -----
Shlomi Fish http://www.shlomifish.org/
Freecell Solver - http://fc-solve.berlios.de/

mplayer 0.9.999.2010.03.11-rc5-adc83b19e793491b1c6ea0fd8b46cd9f32e59 2fc now
available for download.
â=94 Shlomi Fish and d3x.

Please reply to list if it's a mailing list post - http://shlom.in/reply .

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Help Parsing a Tab delimited file

am 13.07.2011 20:15:55 von Leo Susanto

On Wed, Jul 13, 2011 at 11:06 AM, Shlomi Fish wrote:
> Hi Tiago,
>
>
> You can write the data directly to the four filehandles as you go over them.
> You can have filehandles as the values of arrays or hashes.
>
> Just make sure you are using a CSV parsing and output module:
>
> http://beta.metacpan.org/release/Text-CSV
>

I usually use Text::CSV_XS, stemming from the ancient belief that it
was faster than Text::CSV.

has anyone bench-marked Text::CSV vs Text::CSV_XS?

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Help Parsing a Tab delimited file

am 13.07.2011 20:27:47 von Shlomi Fish

Hi Tiago,

You sent this message to me in private, so I'm CCing the list. Next time please
hit "Reply to all" instead of "Reply" (see my signature for more information.)

On Wed, 13 Jul 2011 15:51:00 -0230
Tiago Hori wrote:

> Hi Shlomi,
>
> Thanks a LOT!
>

You're welcome.

> This may be a silly question BUT:
>
> My input is Tab Delimited and I would like the output to be Tab Delimited,
> would that module still work?
>

Yes, you can configure Text::CSV to process tab-separated files instead of
CSV ones.

Regards,

Shlomi Fish

> Cheers,
>
> Tiago
>

[SNIPPED]

--
------------------------------------------------------------ -----
Shlomi Fish http://www.shlomifish.org/
Escape from GNU Autohell - http://www.shlomifish.org/open-source/anti/autohell/

Chuck Norris is the greatest man in history. He killed all the great men who
could ever pose a competition.

Please reply to list if it's a mailing list post - http://shlom.in/reply .

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Help Parsing a Tab delimited file

am 13.07.2011 20:29:56 von Shlomi Fish

On Wed, 13 Jul 2011 11:15:55 -0700
Leo Susanto wrote:

> On Wed, Jul 13, 2011 at 11:06 AM, Shlomi Fish wrote:
> > Hi Tiago,
> >
> >
> > You can write the data directly to the four filehandles as you go over them.
> > You can have filehandles as the values of arrays or hashes.
> >
> > Just make sure you are using a CSV parsing and output module:
> >
> > http://beta.metacpan.org/release/Text-CSV
> >
>
> I usually use Text::CSV_XS, stemming from the ancient belief that it
> was faster than Text::CSV.
>
> has anyone bench-marked Text::CSV vs Text::CSV_XS?

Reading from:

http://beta.metacpan.org/module/Text::CSV

Text::CSV provides facilities for the composition and decomposition of
comma-separated values using Text::CSV_XS or its pure Perl version.

So if it detects you have Text::CSV_XS it will use that.

Regards,

Shlomi Fish

--
------------------------------------------------------------ -----
Shlomi Fish http://www.shlomifish.org/
The Case for File Swapping - http://shlom.in/file-swap

Chuck Norris is the ghost author of the entire Debian GNU/Linux distribution.
And he wrote it in 24 hours, while taking snack breaks.

Please reply to list if it's a mailing list post - http://shlom.in/reply .

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Help Parsing a Tab delimited file

am 13.07.2011 21:18:44 von derykus

On Jul 13, 9:59=A0am, tiago.h...@gmail.com (Tiago Hori) wrote:
> Hi All,
>
> I work with microarrays and get huge tab delimited files as outputs from =
the
> software that analysis these microarrays. The result is a tab-delimted Ex=
cel
> type of file that has 160000 rows and about 20 columns.
>
> Every 44K rows make one unit within the data. These units are identified =
by
> the Second data column, called meta arrow. So the first 44K rows have the
> value 1 on Meta row, the next 44K have the value 2 and so for.
>
> I would like to be able to separate these files into 4 different files, e=
ach
> one containing each unit of data. So all the rows that have meta row 1 wo=
uld
> go to one file, and the ones with meta row 2 would go to another file and=
so
> forth.
>
> I have been reading beginning perl to tried to figure this out, but I
> haven't be able to come up with anything.
>
> I have many questions: I know I can use a filhandle to connect to the fil=
e,
> but how would I store the data to begin with?
>
> Is there a way to iteratively read through the rows and then copy them to=
a
> variable as long as their metarow column read let's say 1? and then out p=
ut
> that as a new file?
>

There's already been a very good recommendation. But, if
you know your file has no irregularities, is surprise-free as
far as formatting, you may be tempted to just try a 1-liner
since Perl does make "easy things easy...":

perl -lane 'if ($F[1] ne $old ) {open($fh,'>',$F[1]) or die $!};
print $fh $_;$old =3D $F[1]' file

--
Charles DeRykus

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Help Parsing a Tab delimited file

am 14.07.2011 02:42:59 von Tiago Hori

Hi Charles,

Thanks a LOT.

I am trying to learn the CSV mode for the future, but this software will alw=
ays spit the same file format at me, so your solution may be the way to go f=
or now. Would you mi d giving me a quick explanation on what that one liner d=
oes? It be useful for me to learn it in more depth and be to adapt it to fut=
ure problems.

Cheers,

Tiago

Sent from my iPad

On 2011-07-13, at 4:48 PM, "C.DeRykus" wrote:

> On Jul 13, 9:59 am, tiago.h...@gmail.com (Tiago Hori) wrote:
>> Hi All,
>>=20
>> I work with microarrays and get huge tab delimited files as outputs from t=
he
>> software that analysis these microarrays. The result is a tab-delimted Ex=
cel
>> type of file that has 160000 rows and about 20 columns.
>>=20
>> Every 44K rows make one unit within the data. These units are identified b=
y
>> the Second data column, called meta arrow. So the first 44K rows have the=

>> value 1 on Meta row, the next 44K have the value 2 and so for.
>>=20
>> I would like to be able to separate these files into 4 different files, e=
ach
>> one containing each unit of data. So all the rows that have meta row 1 wo=
uld
>> go to one file, and the ones with meta row 2 would go to another file and=
so
>> forth.
>>=20
>> I have been reading beginning perl to tried to figure this out, but I
>> haven't be able to come up with anything.
>>=20
>> I have many questions: I know I can use a filhandle to connect to the fil=
e,
>> but how would I store the data to begin with?
>>=20
>> Is there a way to iteratively read through the rows and then copy them to=
a
>> variable as long as their metarow column read let's say 1? and then out p=
ut
>> that as a new file?
>>=20
>=20
> There's already been a very good recommendation. But, if
> you know your file has no irregularities, is surprise-free as
> far as formatting, you may be tempted to just try a 1-liner
> since Perl does make "easy things easy...":
>=20
> perl -lane 'if ($F[1] ne $old ) {open($fh,'>',$F[1]) or die $!};
> print $fh $_;$old =3D $F[1]' file
>=20
> --
> Charles DeRykus
>=20
>=20
> --=20
> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
> For additional commands, e-mail: beginners-help@perl.org
> http://learn.perl.org/
>=20
>=20

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Help Parsing a Tab delimited file

am 14.07.2011 02:51:13 von Jim Gibson

On 7/13/11 Wed Jul 13, 2011 5:42 PM, "Tiago Hori"
scribbled:

> Hi Charles,
>
> Thanks a LOT.
>
> I am trying to learn the CSV mode for the future, but this software will
> always spit the same file format at me, so your solution may be the way to go
> for now. Would you mi d giving me a quick explanation on what that one liner
> does? It be useful for me to learn it in more depth and be to adapt it to
> future problems.

Have you read 'perldoc perlrun'? Do that, and then ask any questions about
what you do not understand. Try to re-write Charles' one-liner as a regular,
multi-line program in a stored file and see if you can get the same output.

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Help Parsing a Tab delimited file

am 14.07.2011 05:54:41 von derykus

On Jul 13, 5:42=A0pm, tiago.h...@gmail.com (Tiago Hori) wrote:
> ...
> > C.DeRykus wrote:
> > There's already been a very good recommendation. But, if
> > you know your file has no irregularities, is surprise-free as
> > far as formatting, =A0you may be tempted to just try a 1-liner
> > since Perl does make "easy things easy...":
>
> > perl -lane 'if ($F[1] ne $old ) {open($fh,'>',$F[1]) or die $!};
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 print $fh $_;$old =3D $F[1]' =A0 fi=
le

> I am trying to learn the CSV mode for the future, but this software will
> always spit the same file format at me, so your solution may be the way
> to go for now. Would you mi d giving me a quick explanation on what that
> one liner does? It be useful for me to learn it in more depth and be to
> adapt it to future problems.

You're welcome. As Jim says though, you're better off running a
multi-line program to learn basics if you're just beginning and only
then trying a simpler solution. Or, if you want to jump ahead, see
the doc (perldoc perlrun) to see what the switches mean.

Basically, the one-liner reads the tab-delimited file line by line
(-n); autosplits fields the line into fields based on whitespace (-
a)
and populates @F with those fields. If the 2nd column $F[1] ,
hasn't been seen or differs with the previous line's 2nd col.,
then a new output file is opened with a name matching $F[1].
The entire current line $_ is then written to the file. Lastly, $F[0]
is saved to $old so when the next line is read, $F[0] can be
compared with $old to see if a new file should be opened.

Note on switches: -l * unnecessary so can be omitted
-F\t * could be added to split on
tab instead
of whitespace

--
Charles DeRykus

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Help Parsing a Tab delimited file

am 14.07.2011 08:17:57 von jwkrahn

C.DeRykus wrote:
>
> Basically, the one-liner reads the tab-delimited file line by line
> (-n); autosplits fields the line into fields based on whitespace (-
> a)
> and populates @F with those fields. If the 2nd column $F[1] ,
> hasn't been seen or differs with the previous line's 2nd col.,
> then a new output file is opened with a name matching $F[1].
> The entire current line $_ is then written to the file. Lastly, $F[0]
> is saved to $old so when the next line is read, $F[0] can be
> compared with $old to see if a new file should be opened.
>
> Note on switches: -l * unnecessary so can be omitted
> -F\t * could be added to split on
> tab instead of whitespace

That won't work as the shell will interpolate away the backslash:

$ echo "one two three four" | perl -F\t -lane'print "$_: $F[$_]"
for 0 .. $#F'
0: one
1: wo
2: hree four

You have to quote it:

$ echo "one two three four" | perl -F'\t' -lane'print "$_:
$F[$_]" for 0 .. $#F'
0: one
1: two
2: three
3: four

John
--
Any intelligent fool can make things bigger and
more complex... It takes a touch of genius -
and a lot of courage to move in the opposite
direction. -- Albert Einstein

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Help Parsing a Tab delimited file

am 14.07.2011 13:23:11 von rvtol+usenet

On 2011-07-14 08:17, John W. Krahn wrote:

> You have to quote it:
>
> $ echo "one two three four" | perl -F'\t' -lane'print "$_: $F[$_]" for 0
> .. $#F'
> 0: one
> 1: two
> 2: three
> 3: four

Or double it:

echo -e "a\tb\tc\td" |perl -F\\t -lane'print"$_: $F[$_]"for 0..$#F'

0: a
1: b
2: c
3: d

--
Ruud

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Help Parsing a Tab delimited file

am 14.07.2011 20:12:42 von derykus

On Jul 13, 11:17=A0pm, jwkr...@shaw.ca ("John W. Krahn") wrote:
> C.DeRykus wrote:
>
>...
>
> That won't work as the shell will interpolate away the backslash:

Not necessarily... it works on Win32's idea of a
"shell" for instance :)

But it doesn't hurt even there and is a good habit.

>
> $ echo "one =A0two =A0 =A0 three =A0 four" | perl -F\t -lane'print "$_: $=
F[$_]"
> for 0 .. $#F'
> 0: one
> 1: wo
> 2: hree four
>
> You have to quote it:
>

Yes, particularly on Unix.

--
Charles DeRykus

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/