Send one stream down two pipes

Send one stream down two pipes

am 05.04.2008 02:54:23 von jak

It's been a while since I read c.u.s ... I puzzled for a while over
this problem, maybe the solution will be useful to someone besides me.

Say you want to send one stream down two pipes, and process each one
independently. Like having two independent pointers on the same input
stream. I could not discover any such thing in bash, so I worked it
out using FIFOs and tee.

In the example below, I use find to build a list of mailbox files for
input to sa-learn, and then I truncate the files. I want to process
all the files with a single call of sa-learn (using xargs), so the
first stream has to be completely processed before the second starts.

That was tricky, but I solved it using nested subshells and wait.

Stephane probably has some better way to do this, but here is what I
worked out. :-)


#!/bin/bash

set -B -e +h -u -o pipefail; shopt -s extglob nullglob

pushd . > /dev/null
cd ~/temp

td=`mktemp -d`
mkfifo "$td/3"
mkfifo "$td/4"

( (
exec < "$td/3"
xargs -0 -r sa-learn --spam --mbox
) &
exec < "$td/4"
wait
while read -d $'\0'; do
cp /dev/null "$REPLY"
done
) &

find . -type f -size +0c \( \
-wholename './UCE/*' -o \
-name 'uce.*' \
\) -printf '%P\0' |
sort -z | tee "$td/3" "$td/4" > /dev/null

wait
rm -rf "$td"

popd > /dev/null


--
Internet service
http://www.isp2dial.com/

Re: Send one stream down two pipes

am 05.04.2008 07:33:49 von Dan Stromberg

On Sat, 05 Apr 2008 00:54:23 +0000, www.isp2dial.com wrote:

> It's been a while since I read c.u.s ... I puzzled for a while over
> this problem, maybe the solution will be useful to someone besides me.
>
> Say you want to send one stream down two pipes, and process each one
> independently. Like having two independent pointers on the same input
> stream. I could not discover any such thing in bash, so I worked it out
> using FIFOs and tee.
>
> In the example below, I use find to build a list of mailbox files for
> input to sa-learn, and then I truncate the files. I want to process all
> the files with a single call of sa-learn (using xargs), so the first
> stream has to be completely processed before the second starts.
>
> That was tricky, but I solved it using nested subshells and wait.
>
> Stephane probably has some better way to do this, but here is what I
> worked out. :-)
>
>
> #!/bin/bash
>
> set -B -e +h -u -o pipefail; shopt -s extglob nullglob
>
> pushd . > /dev/null
> cd ~/temp
>
> td=`mktemp -d`
> mkfifo "$td/3"
> mkfifo "$td/4"
>
> ( (
> exec < "$td/3"
> xargs -0 -r sa-learn --spam --mbox
> ) &
> exec < "$td/4"
> wait
> while read -d $'\0'; do
> cp /dev/null "$REPLY"
> done
> ) &
>
> find . -type f -size +0c \( \
> -wholename './UCE/*' -o \
> -name 'uce.*' \
> \) -printf '%P\0' |
> sort -z | tee "$td/3" "$td/4" > /dev/null
>
> wait
> rm -rf "$td"
>
> popd > /dev/null

IMO, mtee is a lot simpler (at least in one's shell code) :

http://stromberg.dnsalias.org/~strombrg/mtee.html

....but ptee probably would've been a better name than mtee. :)

Re: Send one stream down two pipes

am 05.04.2008 11:40:25 von PK

(sorry to reply to Dan, but the original message did not arrive at my NNTP
service)

Dan Stromberg wrote:

>> #!/bin/bash
>>
>> set -B -e +h -u -o pipefail; shopt -s extglob nullglob
>>
>> pushd . > /dev/null
>> cd ~/temp
>>
>> td=`mktemp -d`
>> mkfifo "$td/3"
>> mkfifo "$td/4"
>>
>> ( (
>> exec < "$td/3"
>> xargs -0 -r sa-learn --spam --mbox
>> ) &
>> exec < "$td/4"
>> wait
>> while read -d $'\0'; do
>> cp /dev/null "$REPLY"
>> done
>> ) &
>>
>> find . -type f -size +0c \( \
>> -wholename './UCE/*' -o \
>> -name 'uce.*' \
>> \) -printf '%P\0' |
>> sort -z | tee "$td/3" "$td/4" > /dev/null
>>
>> wait
>> rm -rf "$td"
>>
>> popd > /dev/null

But since you need the first pipeline to finish anyway before starting the
second, why not just do something like

find .... | sort -z | tee tmpfile | xargs -0 -r sa-learn --spam --mbox

and then:

while read -d $'\0'; do
cp /dev/null "$REPLY"
done < tmpfile

Are there specific reasons you need to do things the way you did?
Just curious...

--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.

Re: Send one stream down two pipes

am 05.04.2008 13:19:40 von jak

On Sat, 05 Apr 2008 11:40:25 +0200, pk wrote:

>(sorry to reply to Dan, but the original message did not arrive at my NNTP
>service)

>But since you need the first pipeline to finish anyway before starting the
>second, why not just do something like
>
>find .... | sort -z | tee tmpfile | xargs -0 -r sa-learn --spam --mbox
>
>and then:
>
>while read -d $'\0'; do
> cp /dev/null "$REPLY"
>done < tmpfile
>
>Are there specific reasons you need to do things the way you did?
>Just curious...

Yours is a good solution, if you don't mind using a tmpfile. But I
wanted to avoid writing any data into the filesystem, since it's only
transient data.

I first tried using internal pipes, but could not work it out that
way, so FIFOs were my next idea.

I was also trying to discover some general method of writing one
stream to multiple pipes, and processing them in a synchronized
sequence. There have been other times when I needed to do something
like that, but I've never known how, until now.


--
Internet service
http://www.isp2dial.com/

Re: Send one stream down two pipes

am 05.04.2008 13:34:44 von jak

On Sat, 05 Apr 2008 05:33:49 GMT, Dan Stromberg
wrote:

>IMO, mtee is a lot simpler (at least in one's shell code) :
>
>http://stromberg.dnsalias.org/~strombrg/mtee.html
>
>...but ptee probably would've been a better name than mtee. :)

I looked at your web page, but when I saw "Python" I ran away.

Not that I think there's anything wrong with Python. I'm just too old
to learn any big tricks. Small tricks are all I can do. ;-)


--
Internet service
http://www.isp2dial.com/

Re: Send one stream down two pipes

am 05.04.2008 17:41:51 von spcecdt

In article <67idv3p5ljacecm4cu2k5mpolumh8sfhe9@4ax.com>,
www.isp2dial.com wrote:
>It's been a while since I read c.u.s ... I puzzled for a while over
>this problem, maybe the solution will be useful to someone besides me.
>
>Say you want to send one stream down two pipes, and process each one
>independently. Like having two independent pointers on the same input
>stream. I could not discover any such thing in bash, so I worked it
>out using FIFOs and tee.
>
>In the example below, I use find to build a list of mailbox files for
>input to sa-learn, and then I truncate the files. I want to process
>all the files with a single call of sa-learn (using xargs), so the
>first stream has to be completely processed before the second starts.
>
>That was tricky, but I solved it using nested subshells and wait.
>
>Stephane probably has some better way to do this, but here is what I
>worked out. :-)
>
>
>#!/bin/bash
>
>set -B -e +h -u -o pipefail; shopt -s extglob nullglob
>
>pushd . > /dev/null
>cd ~/temp
>
>td=`mktemp -d`
>mkfifo "$td/3"
>mkfifo "$td/4"
>
>( (
> exec < "$td/3"
> xargs -0 -r sa-learn --spam --mbox
>) &
> exec < "$td/4"
> wait
> while read -d $'\0'; do
> cp /dev/null "$REPLY"
> done
>) &
>
>find . -type f -size +0c \( \
> -wholename './UCE/*' -o \
> -name 'uce.*' \
> \) -printf '%P\0' |
>sort -z | tee "$td/3" "$td/4" > /dev/null
>
>wait
>rm -rf "$td"
>
>popd > /dev/null

If the list of filenames comes to more text than will fit in the combined
buffers of tee and the named pipe, it looks to me like the above will stall
with tee trying to write to the second pipe while the second reader is
deferring reading until the first completes.

John
--
John DuBois spcecdt@armory.com KC6QKZ/AE http://www.armory.com/~spcecdt/

Re: Send one stream down two pipes

am 05.04.2008 17:56:58 von PK

www.isp2dial.com wrote:

> >Are there specific reasons you need to do things the way you did?
> >Just curious...
>
> Yours is a good solution, if you don't mind using a tmpfile. But I
> wanted to avoid writing any data into the filesystem, since it's only
> transient data.
>
> I first tried using internal pipes, but could not work it out that
> way, so FIFOs were my next idea.
>
> I was also trying to discover some general method of writing one
> stream to multiple pipes, and processing them in a synchronized
> sequence. There have been other times when I needed to do something
> like that, but I've never known how, until now.

Ok, your last sentence got me curious, and I tried writing something to do
what you want in a general way. The goal was: synchronize multiple
subshells (ie, make them run one after another), without using temporary
files, but instead fifos as you did (sure enough, something that someone
already did quite better than me, but anyway...trying to learn something
new).

Here was my first try (with bash, if that matters):

$ cat sync.sh
#!/bin/bash

# FIFOs for synchronization
mkfifo f1 f2 f3
# FIFOs for data
mkfifo d1 d2 d3

(read < f1
# ...do something reading data from d1...
while read n; do printf "Job1, reading %d\n" "$n"; done < d1
echo "a" > f2 ) &

(read < f2
# ...do something reading data from d2...
while read n; do printf "Job2, reading %d\n" "$n"; done < d2
echo "a" > f3 ) &

(read < f3
# ...do something reading data from d3...
while read n; do printf "Job3, reading %d\n" "$n"; done < d3
) &

# main program

# start the 1st job
echo "a" > f1
# produce data
seq 1 10000 | tee d1 d2 d3 > /dev/null

wait
rm {f,d}{1,2,3}
-----------------

This hangs as soon as it's started, I suppose because tee blocks when
opening d2 and d3, which noone has open for reading.
So I then changed the line that produces data as follows:

# produce data
seq 1 10000 | tee d1 | tee d2 | tee d3 > /dev/null

This actually seems to work (parts of output omitted):

$ ./sync.sh
Job1, reading 1
Job1, reading 2
....
Job1, reading 9999
Job1, reading 10000
Job2, reading 1
Job2, reading 2
....
Job2, reading 9999
Job2, reading 10000
Job3, reading 1
Job3, reading 2
....
Job3, reading 9999
Job3, reading 10000


The only caveat is that the "tee"s after the first one block until the first
job has completed, so all the data accumulates in the shell pipeline
between "tee d1" and "tee d2", and if you have more data than that shell
pipe can contain, the script will hang. To prove this, try producing a lot
more data, ie

# produce data
seq 1 100000 | tee d1 | tee d2 | tee d3 > /dev/null

This starts, but hangs somewhere around "Job1, reading 12773".
"seq 1 12773 | wc -c" gives 65532 bytes, something suggesting that, on my
system, a bash pipe can only hold 65535 bytes, which seems to make sense
for the above.

Thanks for any insight (and also suggestions/corrections about anything in
the above are welcome).

--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.

Re: Send one stream down two pipes

am 05.04.2008 22:19:24 von jak

On Sat, 05 Apr 2008 10:41:51 -0500, spcecdt@armory.com (John DuBois)
wrote:

>>sort -z | tee "$td/3" "$td/4" > /dev/null

>If the list of filenames comes to more text than will fit in the combined
>buffers of tee and the named pipe, it looks to me like the above will stall
>with tee trying to write to the second pipe while the second reader is
>deferring reading until the first completes.

Good observation, I didn't think of that.

Testing a large dataset shows that it deadlocks at about 64k.

:-(


--
Internet service
http://www.isp2dial.com/

Re: Send one stream down two pipes

am 05.04.2008 22:34:22 von jak

On Sat, 05 Apr 2008 17:56:58 +0200, pk wrote:

># produce data
>seq 1 100000 | tee d1 | tee d2 | tee d3 > /dev/null
>
>This starts, but hangs somewhere around "Job1, reading 12773".
>"seq 1 12773 | wc -c" gives 65532 bytes, something suggesting that, on my
>system, a bash pipe can only hold 65535 bytes, which seems to make sense
>for the above.

Yes, as John observed, buffer limits and deadlock. Maybe writing a
tmpfile into the filesystem is not so bad after all. I need to think
harder about this ...


--
Internet service
http://www.isp2dial.com/

Re: Send one stream down two pipes

am 07.04.2008 08:50:42 von jak

On Sat, 05 Apr 2008 10:41:51 -0500, spcecdt@armory.com (John DuBois)
wrote:

>If the list of filenames comes to more text than will fit in the combined
>buffers of tee and the named pipe, it looks to me like the above will stall
>with tee trying to write to the second pipe while the second reader is
>deferring reading until the first completes.

I fixed it with an extra sort between the two pipes. I tested this on
my /usr directory and it produced a 2.6 meg output. I guess sort will
take as much memory as it needs (plus temp files).

I also fixed bugs. Testing revealed that the outer subshell FIFO open
would return with an interrupted system call when I had zero input, if
the inner subshell finished before the writer to the second pipe could
open it for output. An obscure race condition.

So now FIFO 4 now has a non blocking open, and always falls through to
the wait. That's the reason for all the redirections on FIFO 4. This
relies on exec 4<> succeeding in blocking mode. I hear that's a Linux
trick only, maybe someone else knows about other OSes ...


#!/bin/bash

set -B -e +h -u -o pipefail; shopt -s extglob nullglob

pushd . > /dev/null
cd ~/temp

td=`mktemp -d`
mkfifo "$td/3"
mkfifo "$td/4"

exec 4<>"$td/4" 4>"$td/4"

( (
exec <"$td/3"
xargs -0 -r sa-learn --spam --mbox
) &
exec <"$td/4" 4>&-
wait $!
while read -d $'\0'; do
cp /dev/null "$REPLY"
done
) &

find . -type f -size +0c \( \
-wholename './UCE/*' -o -name 'uce.*' \
\) -printf '%P\0' |
sort -z | tee "$td/3" | sort -z >&4

exec 4>&-
wait $!
rm -rf "$td"

popd > /dev/null


--
Internet service
http://www.isp2dial.com/

Re: Send one stream down two pipes

am 07.04.2008 18:18:28 von jak

On Mon, 07 Apr 2008 06:50:42 +0000, www.isp2dial.com
wrote:

>So now FIFO 4 now has a non blocking open

And if you don't want to feed an unbounded sequence to xargs, the
version below will split your stream into substreams of 10 items each
((++tx > 9)).

The downside is spawning a new set of subshells for each batch. Of
course you could make the batch larger, say 100 instead of 10.

I thought of eliminating the extra spawns by using coprocesses who
synchronize each other via signals, or maybe a pair of control channel
FIFOs. I started coding it with signals, but I don't see how to
eliminate all race conditions, given the limited facilities of bash.

The main problem is where pid A sends his signal to pid B, and then
sleeps on a wait, waiting for a response from B . But if B responds
before A can sleep in the wait, pid A will never wake up. I need an
atomic { send-signal; wait; } where incoming signals are blocked. It
seems unlikely that putting them inside shell braces would make them
atomic in terms of blocking signals.

Thoughts?



#!/bin/bash

set -B -e +h -u -o pipefail; shopt -s extglob nullglob

pushd . > /dev/null
cd ~/temp

td=`mktemp -d`
mkfifo "$td/3"
mkfifo "$td/4"

pipes () {

exec 4<>"$td/4" 4>"$td/4"

( (
exec <"$td/3"
xargs -0 -r sa-learn --spam --mbox
) &
exec <"$td/4" 4>&-
wait $!
while read -d $'\0'; do
cp /dev/null "$REPLY"
done
) &

tee "$td/3" | sort -z >&4

exec 4>&-
wait $!

}

find . -type f -size +0c \( \
-wholename './UCE/*' -o -name 'uce.*' \
\) -printf '%P\0' |
sort -z | {

ts=''; declare -i tx=0
while read -d $'\0'; do
test -z "$REPLY" && continue
ts="$ts$REPLY\0"
if ((++tx > 9)); then
echo -en "$ts" | pipes
ts=''; tx=0
fi
done
((tx)) && { echo -en "$ts" | pipes; } || :

}

rm -rf "$td"

popd > /dev/null


--
Internet service
http://www.isp2dial.com/

Re: Send one stream down two pipes

am 12.04.2008 22:10:09 von Dan Stromberg

On Sat, 05 Apr 2008 11:34:44 +0000, www.isp2dial.com wrote:

> On Sat, 05 Apr 2008 05:33:49 GMT, Dan Stromberg
> wrote:
>
>>IMO, mtee is a lot simpler (at least in one's shell code) :
>>
>>http://stromberg.dnsalias.org/~strombrg/mtee.html
>>
>>...but ptee probably would've been a better name than mtee. :)
>
> I looked at your web page, but when I saw "Python" I ran away.
>
> Not that I think there's anything wrong with Python. I'm just too old
> to learn any big tricks. Small tricks are all I can do. ;-)

FWIW, you wouldn't need to learn python to use the script from shell, any
more than you'd need to learn C to use cut.