RFC: The future of Text::CSV

RFC: The future of Text::CSV_XS

am 25.05.2007 20:48:21 von h.m.brand

I have been digging a bit to find what people consider loose ends in
Text::CSV_XS, and tried to summarize that (in no particular order) in
the new TODO list. Here "TODO" gives no guarantee that it will be done,
nor on any implementation or API that it might suggest, it is there now
just so I/we do not forget to think about these issues.

I'd like to get thoughts/feedback/suggestions about items on this list,
and how valuable you consider adding these features to a modules so
heavily used by other applications.

Jochen asked me to also post to this list, cause many DBI users (have
to) deal with CSV data. So start shooting ...

=head1 TODO

=over 2

=item eol

Discuss an option to make the eol honor the $/ setting. Maybe

my $csv = Text::CSV_XS->new ({ eol => $/ });

is already enough, and new options only make things less opaque.

=item setting meta info

Future extensions might include extending the C,
C, and C to accept setting these flags
for fields, so you can specify which fields are quoted in the
combine ()/string () combination.

$csv->meta_info (0, 1, 1, 3, 0, 0);
$csv->is_quoted (3, 1);

=item parse returning undefined fields

Adding an option that enables the parser to distinguish between
empty fields and undefined fields, like

$csv->quote_always (1);
$csv->allow_undef (1);
$csv->parse (qq{,"",1,"2",,""});
my @fld = $csv->fields ();

Then would return (undef, "", "1", "2", undef, "") in @fld, instead
of the current ("", "", "1", "2", "", "").

=item combined methods

Adding means (methods) that combine C and C in
a single call. Likewise for C and C.

=item Unicode

Make C and C do the right thing for Unicode
(UTF-8) if requested. See t/50_utf8.t.

=item Space delimited seperators

Discuss if and how C should/could support formats like

1 , "foo" , "bar" , 3.19 ,

=item Double double quotes

There seem to be applications around that write their dates like

1,4,""12/11/2004"",4,1

If we would support that, in what way?

=item Parse the whole file at once

Implement a new methods that enables the parsing of a complete file
at once, returning a lis of hashes. Possible extension to this could
be to enable a column selection on the call:

my @AoH = $csv->parse_file ($filename, { cols => [ 1, 4..8, 12 ]});

Returning something like

[ { fields => [ 1, 2, "foo", 4.5, undef, "", 8 ],
flags => [ ... ],
errors => [ ... ],
},
{ fields => [ ... ],
.
.
},
]

=back

I'll try to keep the most current snapshot available as

http://www.xs4all.nl/~hmbrand/Text-CSV_XS-0.27.tar.gz

Until I release 0.27, after which I'll use 0.28 for the snapshot :)

--
H.Merijn Brand Amsterdam Perl Mongers (http://amsterdam.pm.org/)
using & porting perl 5.6.2, 5.8.x, 5.9.x on HP-UX 10.20, 11.00, 11.11,
& 11.23, SuSE 10.0 & 10.2, AIX 4.3 & 5.2, and Cygwin. http://qa.perl.org
http://mirrors.develooper.com/hpux/ http://www.test-smoke.org
http://www.goldmark.org/jeff/stupid-disclaimers/

Re: RFC: The future of Text::CSV_XS

am 25.05.2007 21:53:04 von h.m.brand

On Fri, 25 May 2007 15:22:02 -0400, "Richard Dice"
wrote:

> Merijn,

Why no Cc: to the list?

> Thanks for asking, and for your work on this. (Looks like you just took
> over maintainership recently...?)

Yes.

> My recent "I wish Text::CSV_XS could handle X..." experience was -
>
> - Save an Excel spreadsheet to CSV format

My Spreadsheet::Read module on CPAN includes a utility that does just that:

# xlscat -c file.xls >file.csv

> - But some of the cells in the Excel spreadsheet contained line breaks

Shouldn't matter

> - So iterating line-by-line through the file in order to have lines to
> parse with Text::CSV_XS meant that any line derived from a row in Excel
> containing a cell containing a line break would fail
>
> That new feature idea regarding reading the whole file at once might be a
> good place to address this.

Don't think so, but feel free to enlighten me on the reasoning you have

> Other features that could be nice -
>
> - given a file, tell it whether it has a header row and if so provide
> a hash-key-style interface on each row per the names in columns of the
> header row

Could be one of the options to the suggested

parse_file ($file, { cols => [ ...]. has_header_row => 1 });

causing the construct of

{ fields => [ .... ],

to change to

{ fields => { Name = "...", Address => "...", ... },

but I think that would be a huge impact on memory use and also be
quite easy to create yourself in a map {} construct;

> - have it return how many rows and columns there are in the file

# xlscat -i file.csv

I don't think that kind of functionality should be in the low level
that this module lives in. Consider that reading CSV has no defined
way to jump back in the data stream, so once you've read the data,
you cannot go back. It has no random access structure like Excel.

> - ability to automatically ignore trailing (and perhaps leading) empty
> rows

Also an option in xlscat

> - provide a "best guess" count of how many columns there _should_ be
> in a row, based on the header row (if present) and/or general agreement
> amongst the other rows in the file (if 99 have 14 columns in a row and 1 has
> 10 columns, that 1 could is likely an outlier)

Nice example. I like that. Should not be in the module itself, but could
be a file file in the examples/ folder.

> - In the event of rows with fewer columns than the best-guess (or a
> user-defined number of how many columns there should be) then provide
> extra undef column (array) values

I would say you use Spreadsheet::Read and do it in that framework.

> - ability to extract a row/column range, e.g. columns 2 through 7 in
> rows 3 through 13

You defenitely want xlscat :) Both supported as options

/home/merijn 101 > xlscat --help
usage: xlscat [-s ] [-L] [-u] [ Selection ] file.xls
[-c | -m] [-u] [ Selection ] file.xls
-i [ -S sheets ] file.xls
Generic options:
-v[#] Set verbose level (xlscat)
-d[#] Set debug level (Spreadsheet::Read)
-u Use unformatted values
--noclip Do not strip empty sheets and
trailing empty rows and columns
Input CSV:
--in-sep=c Set input sep_char for CSV
Output Text (default):
-s Use separator . Default '|', \n allowed
-L Line up the columns
Output Index only:
-i Show sheet names and size only
Output CSV:
-c Output CSV, separator = ','
-m Output CSV, separator = ';'
Selection:
-S Only print sheets . 'all' is a valid set
Default only prints the first sheet
-R Only print rows . Default is 'all'
-C Only print columns . Default is 'all'
-F Only fields e.g. -FA3,B16
/home/merijn 102 >

> You planning on being at YAPC::EU? Maybe I'll run into you there.

Yes, and planning to talk about another (new) module. I've
already been registered.

--
H.Merijn Brand Amsterdam Perl Mongers (http://amsterdam.pm.org/)
using & porting perl 5.6.2, 5.8.x, 5.9.x on HP-UX 10.20, 11.00, 11.11,
& 11.23, SuSE 10.0 & 10.2, AIX 4.3 & 5.2, and Cygwin. http://qa.perl.org
http://mirrors.develooper.com/hpux/ http://www.test-smoke.org
http://www.goldmark.org/jeff/stupid-disclaimers/

Re: RFC: The future of Text::CSV_XS

am 26.05.2007 00:16:52 von Sam

On Fri, 25 May 2007, H.Merijn Brand wrote:

> I have been digging a bit to find what people consider loose ends in
> Text::CSV_XS, and tried to summarize that (in no particular order) in
> the new TODO list. Here "TODO" gives no guarantee that it will be done,
> nor on any implementation or API that it might suggest, it is there now
> just so I/we do not forget to think about these issues.
>
> I'd like to get thoughts/feedback/suggestions about items on this list,
> and how valuable you consider adding these features to a modules so
> heavily used by other applications.

Oooo, how exciting! Text::CSV_XS is a great module with some rather
unfortunate problems.

My number one problem is that binary-mode deals exceptionally badly
with the \r character. You can read all about it here, including a
patch for part of the problem:

http://use.perl.org/~samtregar/journal/31443

In my fantasy world Text::CSV_XS would automatically accept "\r", "\n"
and "\r\n" as line-ending characters with no user-interaction
necessary. I went into the source hoping to do it but I came away
empty-handed.

> =item parse returning undefined fields

I like this one too.

Good luck!

-sam

Re: RFC: The future of Text::CSV_XS

am 26.05.2007 07:52:21 von ron

H.Merijn Brand wrote:

Hi

> =item Space delimited seperators
>
> Discuss if and how C should/could support formats like
>
> 1 , "foo" , "bar" , 3.19 ,

If? Definitely. How? Errr...Is it so hard? When? Soon, so we can delete
specialized code written to deal with this. If you want to see the code
(which I did /not/ write), email me.
--
Ron Savage
ron@savage.net.au
http://savage.net.au/

Re: RFC: The future of Text::CSV_XS

am 26.05.2007 08:05:08 von ron

H.Merijn Brand wrote:

Hi

>> - given a file, tell it whether it has a header row and if so provide
>> a hash-key-style interface on each row per the names in columns of the
>> header row

Check the change log for Tie::Handle::CSV. I've recently suggested a
couple of changes to do with handling heading lines.

--
Ron Savage
ron@savage.net.au
http://savage.net.au/

Re: RFC: The future of Text::CSV_XS

am 26.05.2007 10:44:51 von ron

H.Merijn Brand wrote:

Hi

> { fields => { Name = "...", Address => "...", ... },
> but I think that would be a huge impact on memory use and also be
> quite easy to create yourself in a map {} construct;

Rejecting this option based on memory usage is a specious argument.

Every method call on every object in every program has the potential to
use memory. So what? The person who makes any such call has to accept
the consequences of that call.

The other argument in favour of this option is: Why should every
programmer who needs it have to reinvent such code, potentially
thousands of times, when you could do it once for all of us in the
module. And it's not as though the method has to be called - as always,
no call, no memory usage.

--
Ron Savage
ron@savage.net.au
http://savage.net.au/