Regular expression capture group in shell

Regular expression capture group in shell

am 16.04.2008 13:25:00 von juhanay

Hi
This one should be easy problem for you guys I have one file, which
for example contains the following data

00 XXX112233445566.QQQZZZ

Now I would just need to parse that last string
XXX112233445566.QQQZZZ into
pieces XXX,11,22,33,44,55,66,QQQ and ZZZ and append that data to the
end of that file with some label. In above example letters mean any
character and number mean any integer. Same set of letters/numbers
mean the same capture group.

Nice output file would be something like this:
00 XXX112233445566.QQQZZZ
Label1: XXX
Label2: 11
Label3: 22
Label4: 33
Label5: 44
Label6: 55
Label7: 66
Label8: QQQ
Label9: ZZZ

This would be easy with perl regular expressions and capture groups ()
but I dont know how to use capture groups with egrep. I cannot assume
that a perl-interpreter is always present so I would prefer to use
shell tools if possible. Same goes for other tools as well.

Re: Regular expression capture group in shell

am 16.04.2008 13:42:03 von Janis Papanagnou

On 16 Apr., 13:25, juha...@gmail.com wrote:
> Hi
> This one should be easy problem for you guys I have one file, which
> for example contains the following data
>
> 00 =A0 =A0XXX112233445566.QQQZZZ
>
> Now I would just need to parse that last string
> XXX112233445566.QQQZZZ =A0into
> pieces XXX,11,22,33,44,55,66,QQQ and ZZZ and append that data to the
> end of that file with some label. =A0In above example =A0letters mean any
> character and number mean any integer. Same set of =A0letters/numbers
> mean the same capture group.
>
> Nice output file would be something like this:
> 00 =A0 =A0XXX112233445566.QQQZZZ
> Label1: =A0 XXX
> Label2: =A0 11
> Label3: =A0 22
> Label4: =A0 33
> Label5: =A0 44
> Label6: =A0 55
> Label7: =A0 66
> Label8: =A0 QQQ
> Label9: =A0 ZZZ
>
> This would be easy with perl regular expressions and capture groups ()
> but I dont know how to use capture groups with egrep. I cannot assume
> that a perl-interpreter is always present so I would prefer to use
> shell tools if possible. Same goes for other tools as well.

Do you have GNU awk available?

echo "00 XXX112233445566.QQQZZZ" |
gawk 'BEGIN {FIELDWIDTHS=3D"6 3 2 2 2 2 2 2 1 3 3"}
{ print $0
print "Label1: " $2
print "Label2: " $3
print "Label3: " $4
print "Label4: " $5
print "Label5: " $6
print "Label6: " $7
print "Label7: " $8
print "Label8: " $10
print "Label9: " $11
}'

Or with normal awk use the substr(field,start,len) function, as in

{
print $0
print "Label1: " substr($2,1,3)
print "Label2: " substr($2,4,2)
... etc.
}

Janis

Re: Regular expression capture group in shell

am 16.04.2008 14:03:09 von PK

On Wednesday 16 April 2008 13:25, juhanay@gmail.com wrote:

> Hi
> This one should be easy problem for you guys I have one file, which
> for example contains the following data
>
> 00 XXX112233445566.QQQZZZ
>
> Now I would just need to parse that last string
> XXX112233445566.QQQZZZ into
> pieces XXX,11,22,33,44,55,66,QQQ and ZZZ and append that data to the
> end of that file with some label. In above example letters mean any
> character and number mean any integer. Same set of letters/numbers
> mean the same capture group.
>
> Nice output file would be something like this:
> 00 XXX112233445566.QQQZZZ
> Label1: XXX
> Label2: 11
> Label3: 22
> Label4: 33
> Label5: 44
> Label6: 55
> Label7: 66
> Label8: QQQ
> Label9: ZZZ
>
> This would be easy with perl regular expressions and capture groups ()
> but I dont know how to use capture groups with egrep. I cannot assume
> that a perl-interpreter is always present so I would prefer to use
> shell tools if possible. Same goes for other tools as well.

You don't even need egrep. Assuming the file has the fixed structure you
show, this is with sed:

$ sed -n 'p;s/^[[:digit:]]*[[:space:]]*\(...\)\(..\)\(..\)\(..\)\(..\ )\(.
\)\(..\)\.\(...\)\(...\)/Label1: \1\nLabel2: \2\nLabel3: \3\nLabel4:
\4\nLabel5: \5\nLabel6: \6\nLabel7: \7\nLabel8: \8\nLabel9:
\9/;p' file.txt

Or with GNU awk:

awk '{print; print gensub(/^[[:digit:]]*[[:space:]]*(...)(..)(..)(..)(..
(..)(..)\.(...)(...)/,"Label1: \\1\nLabel2: \\2\nLabel3: \\3\nLabel4:
\\4\nLabel5: \\5\nLabel6: \\6\nLabel7: \\7\nLabel8: \\8\nLabel9:
\\9","g")}'

--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.

Re: Regular expression capture group in shell

am 16.04.2008 14:17:36 von juhanay

Thanks for the replys.

I noticed that my original spesification did contain two small
mistakes. First of all the 00-section can contain any number (e.g 1,
12, 1234) not just 2-digits but it will contain atleast one digit.
Another mistake was in the last section ZZZ, which contains four
characters ZZZZ. Other than that the structure is fixed.

Revised spesification

0+ XXX112233445566.QQQZZZZ

I would need your help in choosing the best most universal way of
doing this. We might import this code to many linux-machines and we
would like to make it as portable as possible. So which one is better
awk or sed? Or is there a third alternative?

Re: Regular expression capture group in shell

am 16.04.2008 14:19:54 von Maxwell Lol

juhanay@gmail.com writes:

> 00 XXX112233445566.QQQZZZ

> This would be easy with perl regular expressions and capture groups ()
> but I dont know how to use capture groups with egrep.

Well, since you are editing it, you need sed - not egrep.

In general, you mark the regex with \( ......... \) and refer to it using \1, \2, \3, \4, etc.

So you can test this out using

sed 's/^.. \(...\)\(..\)\(..\)\(..\)\(..\)\(..\)\(..\)\.\(...\)\(...\)/ 1: \1 2: \2 3: \3 4: \4 5: \5 6: \6 7: \7 8: \8 9: \9/'

this outputs

1: XXX 2: 11 3: 22 4: 33 5: 44 6: 55 7: 66 8: QQQ 9: ZZZ

Perhaps we can make it look closer to your desired output with some tweaking.
Like this:

sed '
#duplicate the line first
/^.. \(...\)\(..\)\(..\)\(..\)\(..\)\(..\)\(..\)\.\(...\)\(...\)/ p
# now print out the modified line:
s/^.. \(...\)\(..\)\(..\)\(..\)\(..\)\(..\)\(..\)\.\(...\)\(...\)/ Label1: \1 \
Label2: \2 \
Label3: \3 \
Label4: \4 \
Label5: \5 \
Label6: \6 \
Label7: \7 \
Label8: \8 \
Label9: \9/
'

However, if you need a 10th field, you may have problems as \9 is the max AFAIK.
You can add "error checking" by replacing a "." with a "[a-zA-Z]" or "[0-9]".

>I cannot assume
> that a perl-interpreter is always present so I would prefer to use
> shell tools if possible. Same goes for other tools as well.

there is cut(1) and awk(1). Awk might be easier to read/maintain.
You can use the substr function to extract the fields.
Perhaps you can combine them - this might be easier.

cut --output-delimiter=" " -b 7-9,10-11,12-13,14-15,16-17,18-19,20-21,23-25,26-28 | \
awk ' {printf("Label1: %s\nLabel2: %s\nLabel3: %s\nLabel4: %s\nLabel5: %s\nLabel6: %s\nLabel7: %s\nLabel8: %s\nLabel9: %s\n", $1, $2, $3, $4, $5, $6, $7, $8, $9)}'

Or - with less word wrap

cut --output-delimiter=" " -b 7-9,10-11,12-13,14-15,16-17,18-19,20-21,23-25,26-28 | awk '
{

printf("Label1: %s \n", $1);
printf("Label2: %s \n", $2);
printf("Label3: %s \n", $3);
printf("Label4: %s \n", $4);
printf("Label5: %s \n", $5);
printf("Label6: %s \n", $6);
printf("Label7: %s \n", $7);
printf("Label8: %s \n", $8);
printf("Label9: %s \n", $9);
}'

This doesn't duplicate the input line however. You can do this with the shell.

Re: Regular expression capture group in shell

am 16.04.2008 14:49:39 von PK

On Wednesday 16 April 2008 14:17, juhanay@gmail.com wrote:

> Thanks for the replys.
>
> I noticed that my original spesification did contain two small
> mistakes. First of all the 00-section can contain any number (e.g 1,
> 12, 1234) not just 2-digits but it will contain atleast one digit.

I think my solutions take care of that. However, will there always be a
fixed number of character between the start of line and the first "label"
(ie, XXX in your example)? If the answer is yes, FIELDWIDTHS-based
solutions (like Janis' first one) will probably be the most efficient ones,
although nonstandard.

> Another mistake was in the last section ZZZ, which contains four
> characters ZZZZ.

This is no big deal, just change the last parenthesized group to \(....\)
(if using sed) or to (....) (if using awk).

> Other than that the structure is fixed.
>
> Revised spesification
>
> 0+ XXX112233445566.QQQZZZZ
>
> I would need your help in choosing the best most universal way of
> doing this. We might import this code to many linux-machines and we
> would like to make it as portable as possible. So which one is better
> awk or sed? Or is there a third alternative?

I think my sed solution and Janis' second awk solution (substr()-based) use
only standard features, so they should be quite portable. If you'll be
using linux exclusively, however, it's very likely that you'll always have
GNU tools available, so non-standard solutions would work too.
Other solutions are of course possible, eg using cut, but they are probably
less efficient and more verbose.

--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.

Re: Regular expression capture group in shell

am 16.04.2008 15:04:09 von juhanay

Hi again
The file is created by output from the command du (the exact command
is given below). The idea is that there are many files with similar
names and we just pick the latest one (generated dynamically by
another program). Then we are interested in the file size (0+ field)
and filename of that latest file, which we parse into different
sections. QQQ-field is static and should always be the same.

du -sk `ls *.QQQ???? -t | head -1` >file.txt

After we have formed the file (with structure given above), then we
need to parse it.

So what is the best way then?
Thanks again for the help.

Re: Regular expression capture group in shell

am 16.04.2008 15:11:54 von Dave B

On Wednesday 16 April 2008 14:19, Maxwell Lol wrote:

> sed '
> #duplicate the line first
> /^.. \(...\)\(..\)\(..\)\(..\)\(..\)\(..\)\(..\)\.\(...\)\(...\)/ p

p

--
D.

Re: Regular expression capture group in shell

am 16.04.2008 15:41:26 von PK

On Wednesday 16 April 2008 15:04, juhanay@gmail.com wrote:

> Hi again
> The file is created by output from the command du (the exact command
> is given below). The idea is that there are many files with similar
> names and we just pick the latest one (generated dynamically by
> another program). Then we are interested in the file size (0+ field)
> and filename of that latest file, which we parse into different
> sections. QQQ-field is static and should always be the same.
>
> du -sk `ls *.QQQ???? -t | head -1` >file.txt
>
> After we have formed the file (with structure given above), then we
> need to parse it.
>
> So what is the best way then?

I assume the filesize would never exceed 9GB (which is probably the case).

For maximum portability, I'd use sed and awk substr-based solutions.

If you'll be using only linux, and file.txt will be small, then any solution
among those proposed would just work fine.

If the file.txt is going to be very large, and you'll be using linux, then
the FIELDWIDTHS awk solution will probably be the most efficient.

--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.

Re: Regular expression capture group in shell

am 16.04.2008 15:42:41 von mop2

Hi
The best way i dont know, but using shell only, a way can be:

while read L;do
echo $L
echo "\
Label1: ${L:6:3}
Label2: ${L:9:2}
Label3: etc.
Label4:
Label5:
Label6:
Label7:
Label8:
Label9:
"
done
PS I dont know the chars between 00 and XXX (tab or n spaces) and
assumed
the lenght line as constant

It is feasible to make the parsing at the time of writing that
original file?



juha...@gmail.com wrote:
> The file is created by output from the command du (the exact command
> is given below). The idea is that there are many files with similar
> names and we just pick the latest one (generated dynamically by
> another program). Then we are interested in the file size (0+ field)
> and filename of that latest file, which we parse into different
> sections. QQQ-field is static and should always be the same.
>
> du -sk `ls *.QQQ???? -t | head -1` >file.txt
>
> After we have formed the file (with structure given above), then we
> need to parse it.
>
> So what is the best way then?

Re: Regular expression capture group in shell

am 16.04.2008 15:48:44 von PK

On Wednesday 16 April 2008 15:41, pk wrote:

> I assume the filesize would never exceed 9GB (which is probably the case).

I mean the size of the *.QQQ???? file here.

Re: Regular expression capture group in shell

am 16.04.2008 15:49:35 von juhanay

Hi again and thanks a lot
Standard size for *.QQQ???? is about 1-5Mb. What happends if filesize
exceeds 9GB? Any way we are mainly interested in small files, because
they are created by some error. If filesize is above some treshold
then that file is ok.

One more question about awk-based methods. How I can determine the
length of the first field (0+) for the substr or for FIELDWIDTHS if i
do not know the filesize in advance? In sed I could always define
[:digit:]* field.


On 16 huhti, 16:48, pk wrote:
> On Wednesday 16 April 2008 15:41, pk wrote:
>
> > I assume the filesize would never exceed 9GB (which is probably the case).
>
> I mean the size of the *.QQQ???? file here.

Re: Regular expression capture group in shell

am 16.04.2008 16:08:06 von mop2

Ok, TAB is the separator:
$ du -sk `ls * -t | head -1` |xxd
0000000: 3130 3132 0930 3830 3332 3131 3533 3235 1012.08032115325
0000010: 322e 7761 760a 2.wav.
$

Now for field 1 with variable lenght.
So, in my previous message, with bash or ksh, i think this works:

while read L;do
echo $L
L=${L#*$'\t'}
#L=${L##*$'\t'} # or this, for any number of TABs
echo "\
Label1: ${L:0:3}
Label2: ${L:3:2}
Label3: etc.
Label4:
Label5:
Label6:
Label7:
Label8:
Label9:
"
done


juha...@gmail.com wrote:
> Hi again
> The file is created by output from the command du (the exact command
> is given below). The idea is that there are many files with similar
>
> du -sk `ls *.QQQ???? -t | head -1` >file.txt
>

Re: Regular expression capture group in shell

am 16.04.2008 16:27:00 von PK

On Wednesday 16 April 2008 15:49, juhanay@gmail.com wrote:

> Hi again and thanks a lot
> Standard size for *.QQQ???? is about 1-5Mb. What happends if filesize
> exceeds 9GB?

I noticed that, on my system, du -sk reserves 7 digits for the filesize, so,
given that it's the size in Kbyes, if it exceeded 9999999 (approx. 10GB)
then the filename would need to be moved to the right accordingly. If
filesize doesn't exceed 10GB, then we can assume the filename always starts
at position 9 (after 7 for the size + 1 space).

> Any way we are mainly interested in small files, because
> they are created by some error. If filesize is above some treshold
> then that file is ok.
>
> One more question about awk-based methods. How I can determine the
> length of the first field (0+) for the substr or for FIELDWIDTHS if i
> do not know the filesize in advance? In sed I could always define
> [:digit:]* field.

That's why I assumed filesize would never exceed 10GB, see above. If that is
the case, you can just use a field width of 8 for the first field (or
whatever value makes sense for your version of du; just try and see).

--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.

Re: Regular expression capture group in shell

am 16.04.2008 16:39:04 von PK

On Wednesday 16 April 2008 16:27, pk wrote:

> That's why I assumed filesize would never exceed 10GB, see above. If that
> is the case, you can just use a field width of 8 for the first field (or
> whatever value makes sense for your version of du; just try and see).

Argh, that is not true (thanks mop2).

du -sk inserts a tab, so you can't use FIELDWIDTHS, unless you "sanitize"
du's output before giving it to awk, eg

du -sk .... | expand | awk ...

--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.

Re: Regular expression capture group in shell

am 16.04.2008 17:25:07 von Janis Papanagnou

On 16 Apr., 14:17, juha...@gmail.com wrote:
> Thanks for the replys.
>
> I noticed that my original spesification did contain two small
> mistakes. First of all the 00-section can contain any number (e.g 1,
> 12, 1234) =A0not just 2-digits but it will contain atleast one digit.
> Another mistake was in the last section ZZZ, which contains four
> characters ZZZZ. Other than that the structure is fixed.
>
> Revised spesification
>
> 0+ =A0 =A0XXX112233445566.QQQZZZZ
>
> I would need your help in choosing the best most universal way of
> doing this. We might import this code to many linux-machines and we
> would like to make it as portable as possible. So which one is better
> awk or sed? Or is there a third alternative?

Standard awk and standard sed are both acceptable; sed is generally
for non-trivial tasks a lot less legible compared to awk.

For your new spec I suggest to use my second (standard) awk solution
upthread...

awk '{ print $0
print "Label1: " substr($2,1,3)
print "Label2: " substr($2,4,2)
... etc.
}'


Janis