Regex problem
am 08.10.2007 14:48:11 von Hendrik Maryns
(This is in Java, but the regex is general, therefore x-post to
c.l.p.m., f-up to c.l.j.h.)
Hi all,
I want to discard the header of some file. The header is everything
before a line beginning with "#BOS". However, I do not want #BOS to be
part of the match, since I need it later on.
I thought of using a regex to do that. I came up with
..*(?s)(?=#BOS)
However, this gave me nothing.
(To be precise, I have
Scanner corpus = new Scanner(inFile);
Pattern header = Pattern.compile(".*(?s)(?=#BOS)", Pattern.MULTILINE);
corpus.skip(header);
and it gives me
java.util.NoSuchElementException
at java.util.Scanner.skip(Scanner.java:1706)
at
de.uni_tuebingen.sfb.lichtenstein.binarytrees.Converter2.mai n(Converter2.java:61)
so if any of the Java people sees a problem there, please point out.)
So to pinpoint my problem: I want a regex which matches any number of
lines until it finds a line beginning with #BOS, but does not include
#BOS in the match.
Other tries looked like this:
..*?(?s)(?=#BOS)
(.|\n)*?(?=#BOS) (this freezes the program)
..*(?=#BOS) with MULTLINE uption to Pattern.Compile
..*(?s)^(?=#BOS)
and several others, but I find no solution. So my last resort is asking
here.
TIA, H.
--
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
Re: Regex problem
am 08.10.2007 18:08:52 von Abigail
_
Hendrik Maryns (hendrik_maryns@despammed.com) wrote on VCLI September
MCMXCIII in :
** (This is in Java, but the regex is general, therefore x-post to
** c.l.p.m., f-up to c.l.j.h.)
I don't read the latter, so I won't post just there. Followups set to
clpm though.
** I want to discard the header of some file. The header is everything
** before a line beginning with "#BOS". However, I do not want #BOS to be
** part of the match, since I need it later on.
**
** I thought of using a regex to do that. I came up with
**
** .*(?s)(?=#BOS)
That changes the meaning of . *after* matching .*
/(?s).*(?=#BOS)/
would do, although I would write it as:
/^.*(?=#BOS)/s
Note that due to the .*, it will match everything up to the *last* occurance
of #BOS. You might want to write that differently if you want to remove things
up to the first #BOS, for instance (untested):
/^[^#]*(?:#(?!BOS)[^#]*)*#BOS/
which does some loop unrolling, avoids the usage of .*? (which can be
costly), and doesn't need (?s) because there's no . in the pattern.
Note that I anchored the pattern to the beginning of the string. This
should speed up the case where no #BOS is present in the string matched
against.
Abigail
--
perl5.004 -wMMath::BigInt -e'$^V=Math::BigInt->new(qq]$^F$^W783$[$%9889$^F47]
..qq]$|88768$^W596577669$%$^W5$^F3364$[$^W$^F$|838747$[88897 39$%$|$^F673$%$^W]
..qq]98$^F76777$=56]);$^U=substr($]=>$|=>5)*(q.25..($^W=@^V) )=>do{print+chr$^V
%$^U;$^V/=$^U}while$^V!=$^W'
Re: Regex problem
am 08.10.2007 19:43:24 von Lew
> Hendrik Maryns (hendrik_maryns@despammed.com) wrote on VCLI September
> MCMXCIII in :
> ** (This is in Java, but the regex is general, therefore x-post to
> ** c.l.p.m., f-up to c.l.j.h.)
Abigail wrote:
> I don't read the latter, so I won't post just there. Followups set to
> clpm though.
But the OP /does/ read clj.help, and pointed out that his problem is in Java,
so redirecting the answers away from clj.help is pure arrogance.
--
Lew
Re: Regex problem
am 09.10.2007 00:45:10 von Tad McClellan
[ f-up set to a newsgroup that I participate in. ]
Lew wrote:
>> Hendrik Maryns (hendrik_maryns@despammed.com) wrote on VCLI September
>> MCMXCIII in :
>> ** (This is in Java, but the regex is general, therefore x-post to
>> ** c.l.p.m., f-up to c.l.j.h.)
^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^
> Abigail wrote:
>> I don't read the latter, so I won't post just there. Followups set to
^^^^^^^^^^
^^^^^^^^^^
>> clpm though.
>
> But the OP /does/ read clj.help, and pointed out that his problem is in Java,
And he will see Abigail's helpful answer there.
So what's the problem?
> so redirecting the answers away from clj.help is pure arrogance.
He did not redirect answers away!
His post containing an answer was posted to the newsgroup that
the OP asked for.
Abigail does not participate in clj.help, and so won't
be able to see any followups to his post.
Dumping stuff into a newsgroup you do not read is arrogance.
--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
Re: Regex problem
am 10.10.2007 16:42:34 von Hendrik Maryns
This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enig4EE67AF4C94D5ED8CFA5D268
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Abigail schreef:
> _
> Hendrik Maryns (hendrik_maryns@despammed.com) wrote on VCLI September
> MCMXCIII in :
> ** (This is in Java, but the regex is general, therefore x-post to
> ** c.l.p.m., f-up to c.l.j.h.)
>=20
> I don't read the latter, so I won't post just there. Followups set to
> clpm though.
Due to the quibbling, now cross-posting.
It seems it was a good idea to ask in c.l.p.m., though, since two of
three helpful answers came from there!
> ** I want to discard the header of some file. The header is everythin=
g
> ** before a line beginning with "#BOS". However, I do not want #BOS t=
o be
> ** part of the match, since I need it later on.
> ** =20
> ** I thought of using a regex to do that. I came up with
> ** =20
> ** .*(?s)(?=3D#BOS)
>=20
> That changes the meaning of . *after* matching .*
Ah, I thought that was a global thing.
> /(?s).*(?=3D#BOS)/
>=20
> would do, although I would write it as:
>=20
> /^.*(?=3D#BOS)/s
>=20
> Note that due to the .*, it will match everything up to the *last* occu=
rance
> of #BOS. You might want to write that differently if you want to remove=
things
> up to the first #BOS, for instance (untested):
>=20
> /^[^#]*(?:#(?!BOS)[^#]*)*#BOS/
>=20
> which does some loop unrolling, avoids the usage of .*? (which can be
> costly), and doesn't need (?s) because there's no . in the pattern.
The version with .*? works fine. Why would that be costly?
Would you mind explaining a bit what the above does? My hunch:
-look for anything except # (this matches \n as well, I suppose), as
often as possible
-if you see a #, check that it is not followed by BOS, and is then again
followed by anything except #; and this whole thing as often as
possible, until #BOS is effectively seen
What I do not understand, is why the first non-capturing group is
necessary, and did you forget to make the last #BOS a positive
lookahead, or is that on purpose?
> Note that I anchored the pattern to the beginning of the string. This=20
> should speed up the case where no #BOS is present in the string matched=
> against.
Hm, seems like there is still a lot to regular expressions to be explored=
â=A6
Thanks, H.
--=20
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
--------------enig4EE67AF4C94D5ED8CFA5D268
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org
iD8DBQFHDOTee+7xMGD3itQRArl5AJ0YqvRQa5kCDJNdVARG499BLr/+GgCf ZSQm
qYtA0Me9SiWGMrEHInYdNjk=
=TBCE
-----END PGP SIGNATURE-----
--------------enig4EE67AF4C94D5ED8CFA5D268--