File sorting question

File sorting question

am 17.08.2011 23:59:31 von Eric Krause

Hello all,
I am beating my head against the wall, any help would be appreciated.

I have a file:
/ / / / m / cvfbcbf/ A123/ / / /// //// =20
/ / / / m / cvfbcbf/ A234/ / / /// //// =20
/ / / / m / cvfbcbf/ B123/ / / /// //// =20

There is spaces in the beginning and the end of each line and each line =
is very similar. I'm trying to count how many unique A#'s and B#'s as =
well as total A#'s and B#'s.

The problem for me is the line endings I think. When I open the file and =
read in one line, I get the whole file. I think the line endings are ^p =
(MS paragraph markers), but I can't open the file to view them. The =
files are huge, 150M or bigger. MS Word chokes on them.

Each line does end with 30 spaces.=20

Is there a way for me to search the entire 150M single line and get the =
metrics I'm looking for, or is it possible to open the file, search for =
the 30 spaces and replace with \n?

Thanks again,
Eric=

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: File sorting question

am 18.08.2011 00:25:36 von Jim Gibson

On 8/17/11 Wed Aug 17, 2011 2:59 PM, "ERIC KRAUSE"
scribbled:

> Hello all,
> I am beating my head against the wall, any help would be appreciated.
>
> I have a file:
> / / / / m / cvfbcbf/ A123/ / / /// ////
> / / / / m / cvfbcbf/ A234/ / / /// ////
> / / / / m / cvfbcbf/ B123/ / / /// ////
>
> There is spaces in the beginning and the end of each line and each line is
> very similar. I'm trying to count how many unique A#'s and B#'s as well as
> total A#'s and B#'s.

A hash would be suitable for that task.

>
> The problem for me is the line endings I think. When I open the file and read
> in one line, I get the whole file. I think the line endings are ^p (MS
> paragraph markers), but I can't open the file to view them. The files are
> huge, 150M or bigger. MS Word chokes on them.

Try Wordpad or Notepad to open the file. It sounds like the file is not a
regular text file with normal Windows (or Unix) line endings such as "\r\n",
"\n", "\r", etc. Where did the file come from?

>
> Each line does end with 30 spaces.
>
> Is there a way for me to search the entire 150M single line and get the
> metrics I'm looking for, or is it possible to open the file, search for the 30
> spaces and replace with \n?

Yes:

$file_contents =~ s/\s{30,}/\n/g;

which will substitute any consecutive substring of 30 or more whitespace
characters with a newline character.

You can also split the file on the 30 spaces:

my @lines = split(/\s{30,}/,$file_contents);

If you can figure out how the paragraph markers are stored in the file, you
can split on those, instead. The above statement will likely leave those
markers at the beginning of each line, except possibly the first.

You can use substr to print parts of the file:

print substr($file_contents,0,80), "\n";

to see what you really have.



--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: File sorting question

am 18.08.2011 01:22:21 von Brandon McCaig

On Wed, Aug 17, 2011 at 5:59 PM, ERIC KRAUSE wrote:
> The problem for me is the line endings I think. When I open the
> file and read in one line, I get the whole file. I think the
> line endings are ^p (MS paragraph markers), but I can't open
> the file to view them. The files are huge, 150M or bigger. MS
> Word chokes on them.
*snip*
> Is there a way for me to search the entire 150M single line and
> get the metrics I'm looking for, or is it possible to open the
> file, search for the 30 spaces and replace with \n?

150M single line? Do you mean a single line is 150 megabytes or
did you mean something else?

Assuming sensible line lengths you could start by opening the
file as a binary file and reading a specific amount of data (a
reasonable length, like a few kilobytes or megabytes). Write
that to a new file and examine it, either with a text editor or
hex editor (or what ever application of your choosing). Once you
know the line/record separator character(s) you should be able to
easily process the file line by line or record by record.


--
Brandon McCaig
V zrna gur orfg jvgu jung V fnl. Vg qbrfa'g nyjnlf fbhaq gung jnl.
Castopulence Software

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: File sorting question

am 18.08.2011 23:59:30 von Eric Krause

Brandon and Jim,
Thank you for the replies. They were very helpful. I have gotten past my =
blockage.

Eric

On Aug 17, 2011, at 5:22 PM, Brandon McCaig wrote:

> On Wed, Aug 17, 2011 at 5:59 PM, ERIC KRAUSE =
wrote:
>> The problem for me is the line endings I think. When I open the
>> file and read in one line, I get the whole file. I think the
>> line endings are ^p (MS paragraph markers), but I can't open
>> the file to view them. The files are huge, 150M or bigger. MS
>> Word chokes on them.
> *snip*
>> Is there a way for me to search the entire 150M single line and
>> get the metrics I'm looking for, or is it possible to open the
>> file, search for the 30 spaces and replace with \n?
>=20
> 150M single line? Do you mean a single line is 150 megabytes or
> did you mean something else?
>=20
> Assuming sensible line lengths you could start by opening the
> file as a binary file and reading a specific amount of data (a
> reasonable length, like a few kilobytes or megabytes). Write
> that to a new file and examine it, either with a text editor or
> hex editor (or what ever application of your choosing). Once you
> know the line/record separator character(s) you should be able to
> easily process the file line by line or record by record.
>=20
>=20
> --=20
> Brandon McCaig
> V zrna gur orfg jvgu jung V fnl. Vg qbrfa'g nyjnlf fbhaq gung jnl.
> Castopulence Software =



--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/