big table / hadoop / map reduce

am 22.10.2010 16:14:04 von Artur Ejsmont

Hi there guys and girls

Have anyone came across any reasonable explanation / articles on how
hadoop and map reduce work in practice?

i have read a few articles now and then and i must say i am puzzled
..... am i stupid or they just cant find an easy way to explain it? :P

What i would hope for is explanation on simple example of application
with some code samples preferably.

anyone good at it here?

cheers

--
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: big table / hadoop / map reduce

am 22.10.2010 16:29:12 von andresmontanez

Hi Artur,

Here is an article on wikipedia: http://en.wikipedia.org/wiki/MapReduce

And here are the native implementations in php:
http://www.php.net/manual/en/function.array-map.php
http://www.php.net/manual/en/function.array-reduce.php

The basic idea is to gather a lot of data, from several nodes, and
"map" them togheter;
then, assuming a lot of this data is repeated across the dataset, we
"reduce" them.

Cheers.

On 22 October 2010 12:14, Artur Ejsmont wrote:
> Hi there guys and girls
>
> Have anyone came across any reasonable explanation / articles on how
> hadoop and map reduce work in practice?
>
> i have read a few articles now and then and i must say i am puzzled
> .... am i stupid or they just cant find an easy way to explain it? :P
>
> What i would hope for is explanation on simple example of application
> with some code samples preferably.
>
> anyone good at it here?
>
> cheers
>
> --
> PHP Database Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>

--=20
AndrÃ©s G. MontaÃ±ez
Zend Certified Engineer
Montevideo - Uruguay

--
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: big table / hadoop / map reduce

am 22.10.2010 17:27:30 von Artur Ejsmont

hehe .... sorry but this does not help :-) i can google for wikipedia
definitions.

I was hoping for some really good articles/examples that would put it
into enough context. I would like to have good idea when it could be
useful.

So far had no luck with that. Its like with design patterns ... people
who dont understand them should not write articles trying to explain
them to others :P

Art

On 22 October 2010 15:29, Andr=E9s G. Monta=F1ez =
wrote:
> Hi Artur,
>
> Here is an article on wikipedia: http://en.wikipedia.org/wiki/MapReduce
>
> And here are the native implementations in php:
> http://www.php.net/manual/en/function.array-map.php
> http://www.php.net/manual/en/function.array-reduce.php
>
> The basic idea is to gather a lot of data, from several nodes, and
> "map" them togheter;
> then, assuming a lot of this data is repeated across the dataset, we
> "reduce" them.
>
>
> Cheers.
>
> On 22 October 2010 12:14, Artur Ejsmont wrote:
>> Hi there guys and girls
>>
>> Have anyone came across any reasonable explanation / articles on how
>> hadoop and map reduce work in practice?
>>
>> i have read a few articles now and then and i must say i am puzzled
>> .... am i stupid or they just cant find an easy way to explain it? :P
>>
>> What i would hope for is explanation on simple example of application
>> with some code samples preferably.
>>
>> anyone good at it here?
>>
>> cheers
>>
>> --
>> PHP Database Mailing List (http://www.php.net/)
>> To unsubscribe, visit: http://www.php.net/unsub.php
>>
>>
>
>
>
> --
> Andr=E9s G. Monta=F1ez
> Zend Certified Engineer
> Montevideo - Uruguay
>

--=20
Visit me at:
http://artur.ejsmont.org/blog/

--
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: big table / hadoop / map reduce

am 22.10.2010 17:49:22 von andresmontanez

Imagine you have to get track of some kind of traffic, for example,
"ad impressions";
lets supose that you have millions of those hits; you will have to
have a few servers to
receive the notifications of the impression of an ad.

After the end of the day, you will have that info across a bunch of
servers; mostly you will have
a record of each impression indicating the Identifier (id) of the Ad.

To this info to become useful, you will have to agregate it; for
example to know which is the Ad with most impressions.
You will have to iterate over all servers and MAP the info into one
place; now that you have all the info,
you will have to REDUCE it; so you will have one record per Ad
identifier indicating the TOTAL impressions of that day.

That's the basic idea. It's like aftermath of "Divide and Conquer".

Hope this will be useful.

Cheers.

On 22 October 2010 13:27, Artur Ejsmont wrote:
> hehe .... sorry but this does not help :-) i can google for wikipedia
> definitions.
>
> I was hoping for some really good articles/examples that would put it
> into enough context. I would like to have good idea when it could be
> useful.
>
> So far had no luck with that. Its like with design patterns ... people
> who dont understand them should not write articles trying to explain
> them to others :P
>
> Art
>
> On 22 October 2010 15:29, AndrÃ©s G. MontaÃ±ez ail.com> wrote:
>> Hi Artur,
>>
>> Here is an article on wikipedia: http://en.wikipedia.org/wiki/MapReduce
>>
>> And here are the native implementations in php:
>> http://www.php.net/manual/en/function.array-map.php
>> http://www.php.net/manual/en/function.array-reduce.php
>>
>> The basic idea is to gather a lot of data, from several nodes, and
>> "map" them togheter;
>> then, assuming a lot of this data is repeated across the dataset, we
>> "reduce" them.
>>
>>
>> Cheers.
>>
>> On 22 October 2010 12:14, Artur Ejsmont wrote:
>>> Hi there guys and girls
>>>
>>> Have anyone came across any reasonable explanation / articles on how
>>> hadoop and map reduce work in practice?
>>>
>>> i have read a few articles now and then and i must say i am puzzled
>>> .... am i stupid or they just cant find an easy way to explain it? :P
>>>
>>> What i would hope for is explanation on simple example of application
>>> with some code samples preferably.
>>>
>>> anyone good at it here?
>>>
>>> cheers
>>>
>>> --
>>> PHP Database Mailing List (http://www.php.net/)
>>> To unsubscribe, visit: http://www.php.net/unsub.php
>>>
>>>
>>
>>
>>
>> --
>> AndrÃ©s G. MontaÃ±ez
>> Zend Certified Engineer
>> Montevideo - Uruguay
>>
>
>
>
> --
> Visit me at:
> http://artur.ejsmont.org/blog/
>

--=20
AndrÃ©s G. MontaÃ±ez
Zend Certified Engineer
Montevideo - Uruguay

--
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: big table / hadoop / map reduce

am 30.10.2010 19:51:53 von Artur Ejsmont

sure that was a bit more helpful, thanks :)

i was still wondering to what other use cases would that apply. This
is a good article (best so far i guess):
http://code.google.com/edu/parallel/mapreduce-tutorial.html

The thing is that reduce has to aggregate data or it would be
impractical. So i am trying to see more examples to fully understand
the limitations of the method.

Lets say i want to find top 10 IP addresses in an access log:
- split log into small files
- i take one fragment (one file)
- worker maps to a list of
- before reduce is called data is sorted by ip
- reduce makes

so i have a bunch of files with aggregated lists of totalCountPerFile>. But then would it not have to be merged across all
results again? with another sort/reduce call? or to avoid that do i
need initial data to be already clustered so one ip appears only in
one chunk file?

Does it make sense?

As i said i am still trying to figure out how should it be applied and
when ... also how to transform problems to make it still work : )

I want to write some simple map reduce like the one above just to see
it working and play around a bit :)

cheers

Art

On 22 October 2010 16:49, Andr=E9s G. Monta=F1ez =
wrote:
> Imagine you have to get track of some kind of traffic, for example,
> "ad impressions";
> lets supose that you have millions of those hits; you will have to
> have a few servers to
> receive the notifications of the impression of an ad.
>
> After the end of the day, you will have that info across a bunch of
> servers; mostly you will have
> a record of each impression indicating the Identifier (id) of the Ad.
>
> To this info to become useful, you will have to agregate it; for
> example to know which is the Ad with most impressions.
> You will have to iterate over all servers and MAP the info into one
> place; now that you have all the info,
> you will have to REDUCE it; so you will have one record per Ad
> identifier indicating the TOTAL impressions of that day.
>
> That's the basic idea. It's like aftermath of "Divide and Conquer".
>
> Hope this will be useful.
>
> Cheers.
>
> On 22 October 2010 13:27, Artur Ejsmont wrote:
>> hehe .... sorry but this does not help :-) i can google for wikipedia
>> definitions.
>>
>> I was hoping for some really good articles/examples that would put it
>> into enough context. I would like to have good idea when it could be
>> useful.
>>
>> So far had no luck with that. Its like with design patterns ... people
>> who dont understand them should not write articles trying to explain
>> them to others :P
>>
>> Art
>>
>> On 22 October 2010 15:29, Andr=E9s G. Monta=F1ez om> wrote:
>>> Hi Artur,
>>>
>>> Here is an article on wikipedia: http://en.wikipedia.org/wiki/MapReduce
>>>
>>> And here are the native implementations in php:
>>> http://www.php.net/manual/en/function.array-map.php
>>> http://www.php.net/manual/en/function.array-reduce.php
>>>
>>> The basic idea is to gather a lot of data, from several nodes, and
>>> "map" them togheter;
>>> then, assuming a lot of this data is repeated across the dataset, we
>>> "reduce" them.
>>>
>>>
>>> Cheers.
>>>
>>> On 22 October 2010 12:14, Artur Ejsmont wrote=
:
>>>> Hi there guys and girls
>>>>
>>>> Have anyone came across any reasonable explanation / articles on how
>>>> hadoop and map reduce work in practice?
>>>>
>>>> i have read a few articles now and then and i must say i am puzzled
>>>> .... am i stupid or they just cant find an easy way to explain it? :P
>>>>
>>>> What i would hope for is explanation on simple example of application
>>>> with some code samples preferably.
>>>>
>>>> anyone good at it here?
>>>>
>>>> cheers
>>>>
>>>> --
>>>> PHP Database Mailing List (http://www.php.net/)
>>>> To unsubscribe, visit: http://www.php.net/unsub.php
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Andr=E9s G. Monta=F1ez
>>> Zend Certified Engineer
>>> Montevideo - Uruguay
>>>
>>
>>
>>
>> --
>> Visit me at:
>> http://artur.ejsmont.org/blog/
>>
>
>
>
> --
> Andr=E9s G. Monta=F1ez
> Zend Certified Engineer
> Montevideo - Uruguay
>

--=20
Visit me at:
http://artur.ejsmont.org/blog/

--
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: big table / hadoop / map reduce

am 30.10.2010 19:58:35 von andresmontanez

Hi Artur,
in your IPs examples, lets supouse you have ten access log files (from
ten different servers),
there you already have the mapping part done.

Then you reduce each log into anonther new file, indicating the IP
address and the times it's repeated.
At this stage you have a reduced version of each log file; then you
need to map them into a new unique file,
this file will be the merge of all the reduced versions of the log files.
This this unique file, you will need to reduce it again, and there you
will have an unique file with all the
IPs address and the times they appear.

There is no limit on the times you can call map and reduce.

Cheers.

On 30 October 2010 15:51, Artur Ejsmont wrote:
> sure that was a bit more helpful, thanks :)
>
> i was still wondering to what other use cases would that apply. This
> is a good article (best so far i guess):
> http://code.google.com/edu/parallel/mapreduce-tutorial.html
>
> The thing is that reduce has to aggregate data or it would be
> impractical. So i am trying to see more examples to fully understand
> the limitations of the method.
>
> Lets say i want to find top 10 IP addresses in an access log:
> - split log into small files
> - i take one fragment (one file)
> - worker maps to a list of
> - before reduce is called data is sorted by ip
> - reduce makes
>
> so i have a bunch of files with aggregated lists of > totalCountPerFile>. But then would it not have to be merged across all
> results again? with another sort/reduce call? or to avoid that do i
> need initial data to be already clustered so one ip appears only in
> one chunk file?
>
> Does it make sense?
>
> As i said i am still trying to figure out how should it be applied and
> when ... also how to transform problems to make it still work : )
>
> I want to write some simple map reduce like the one above just to see
> it working and play around a bit :)
>
> cheers
>
> Art
>
> On 22 October 2010 16:49, AndrÃ©s G. MontaÃ±ez ail.com> wrote:
>> Imagine you have to get track of some kind of traffic, for example,
>> "ad impressions";
>> lets supose that you have millions of those hits; you will have to
>> have a few servers to
>> receive the notifications of the impression of an ad.
>>
>> After the end of the day, you will have that info across a bunch of
>> servers; mostly you will have
>> a record of each impression indicating the Identifier (id) of the Ad.
>>
>> To this info to become useful, you will have to agregate it; for
>> example to know which is the Ad with most impressions.
>> You will have to iterate over all servers and MAP the info into one
>> place; now that you have all the info,
>> you will have to REDUCE it; so you will have one record per Ad
>> identifier indicating the TOTAL impressions of that day.
>>
>> That's the basic idea. It's like aftermath of "Divide and Conquer".
>>
>> Hope this will be useful.
>>
>> Cheers.
>>
>> On 22 October 2010 13:27, Artur Ejsmont wrote:
>>> hehe .... sorry but this does not help :-) i can google for wikipedia
>>> definitions.
>>>
>>> I was hoping for some really good articles/examples that would put it
>>> into enough context. I would like to have good idea when it could be
>>> useful.
>>>
>>> So far had no luck with that. Its like with design patterns ... people
>>> who dont understand them should not write articles trying to explain
>>> them to others :P
>>>
>>> Art
>>>
>>> On 22 October 2010 15:29, AndrÃ©s G. MontaÃ±ez gmail.com> wrote:
>>>> Hi Artur,
>>>>
>>>> Here is an article on wikipedia: http://en.wikipedia.org/wiki/MapReduc=
e
>>>>
>>>> And here are the native implementations in php:
>>>> http://www.php.net/manual/en/function.array-map.php
>>>> http://www.php.net/manual/en/function.array-reduce.php
>>>>
>>>> The basic idea is to gather a lot of data, from several nodes, and
>>>> "map" them togheter;
>>>> then, assuming a lot of this data is repeated across the dataset, we
>>>> "reduce" them.
>>>>
>>>>
>>>> Cheers.
>>>>
>>>> On 22 October 2010 12:14, Artur Ejsmont wrot=
e:
>>>>> Hi there guys and girls
>>>>>
>>>>> Have anyone came across any reasonable explanation / articles on how
>>>>> hadoop and map reduce work in practice?
>>>>>
>>>>> i have read a few articles now and then and i must say i am puzzled
>>>>> .... am i stupid or they just cant find an easy way to explain it? :P
>>>>>
>>>>> What i would hope for is explanation on simple example of application
>>>>> with some code samples preferably.
>>>>>
>>>>> anyone good at it here?
>>>>>
>>>>> cheers
>>>>>
>>>>> --
>>>>> PHP Database Mailing List (http://www.php.net/)
>>>>> To unsubscribe, visit: http://www.php.net/unsub.php
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> AndrÃ©s G. MontaÃ±ez
>>>> Zend Certified Engineer
>>>> Montevideo - Uruguay
>>>>
>>>
>>>
>>>
>>> --
>>> Visit me at:
>>> http://artur.ejsmont.org/blog/
>>>
>>
>>
>>
>> --
>> AndrÃ©s G. MontaÃ±ez
>> Zend Certified Engineer
>> Montevideo - Uruguay
>>
>
>
>
> --
> Visit me at:
> http://artur.ejsmont.org/blog/
>

--=20
AndrÃ©s G. MontaÃ±ez
Zend Certified Engineer
Montevideo - Uruguay

--
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: big table / hadoop / map reduce

am 30.10.2010 20:14:49 von Artur Ejsmont

yeah i think that would make sense.

if you find more good examples from different areas let me know ... i
think i get the basic idea ... will try to apply it some time :)

cheers :)

art

On 30 October 2010 18:58, Andr=E9s G. Monta=F1ez =
wrote:
> Hi Artur,
> in your IPs examples, lets supouse you have ten access log files (from
> ten different servers),
> there you already have the mapping part done.
>
> Then you reduce each log into anonther new file, indicating the IP
> address and the times it's repeated.
> At this stage you have a reduced version of each log file; then you
> need to map them into a new unique file,
> this file will be the merge of all the reduced versions of the log files.
> This this unique file, you will need to reduce it again, and there you
> will have an unique file with all the
> IPs address and the times they appear.
>
> There is no limit on the times you can call map and reduce.
>
> Cheers.
>
> On 30 October 2010 15:51, Artur Ejsmont wrote:
>> sure that was a bit more helpful, thanks :)
>>
>> i was still wondering to what other use cases would that apply. This
>> is a good article (best so far i guess):
>> http://code.google.com/edu/parallel/mapreduce-tutorial.html
>>
>> The thing is that reduce has to aggregate data or it would be
>> impractical. So i am trying to see more examples to fully understand
>> the limitations of the method.
>>
>> Lets say i want to find top 10 IP addresses in an access log:
>> - split log into small files
>> - i take one fragment (one file)
>> - worker maps to a list of
>> - before reduce is called data is sorted by ip
>> - reduce makes
>>
>> so i have a bunch of files with aggregated lists of >> totalCountPerFile>. But then would it not have to be merged across all
>> results again? with another sort/reduce call? or to avoid that do i
>> need initial data to be already clustered so one ip appears only in
>> one chunk file?
>>
>> Does it make sense?
>>
>> As i said i am still trying to figure out how should it be applied and
>> when ... also how to transform problems to make it still work : )
>>
>> I want to write some simple map reduce like the one above just to see
>> it working and play around a bit :)
>>
>> cheers
>>
>> Art
>>
>> On 22 October 2010 16:49, Andr=E9s G. Monta=F1ez om> wrote:
>>> Imagine you have to get track of some kind of traffic, for example,
>>> "ad impressions";
>>> lets supose that you have millions of those hits; you will have to
>>> have a few servers to
>>> receive the notifications of the impression of an ad.
>>>
>>> After the end of the day, you will have that info across a bunch of
>>> servers; mostly you will have
>>> a record of each impression indicating the Identifier (id) of the Ad.
>>>
>>> To this info to become useful, you will have to agregate it; for
>>> example to know which is the Ad with most impressions.
>>> You will have to iterate over all servers and MAP the info into one
>>> place; now that you have all the info,
>>> you will have to REDUCE it; so you will have one record per Ad
>>> identifier indicating the TOTAL impressions of that day.
>>>
>>> That's the basic idea. It's like aftermath of "Divide and Conquer".
>>>
>>> Hope this will be useful.
>>>
>>> Cheers.
>>>
>>> On 22 October 2010 13:27, Artur Ejsmont wrote=
:
>>>> hehe .... sorry but this does not help :-) i can google for wikipedia
>>>> definitions.
>>>>
>>>> I was hoping for some really good articles/examples that would put it
>>>> into enough context. I would like to have good idea when it could be
>>>> useful.
>>>>
>>>> So far had no luck with that. Its like with design patterns ... people
>>>> who dont understand them should not write articles trying to explain
>>>> them to others :P
>>>>
>>>> Art
>>>>
>>>> On 22 October 2010 15:29, Andr=E9s G. Monta=F1ez ..com> wrote:
>>>>> Hi Artur,
>>>>>
>>>>> Here is an article on wikipedia: http://en.wikipedia.org/wiki/MapRedu=
ce
>>>>>
>>>>> And here are the native implementations in php:
>>>>> http://www.php.net/manual/en/function.array-map.php
>>>>> http://www.php.net/manual/en/function.array-reduce.php
>>>>>
>>>>> The basic idea is to gather a lot of data, from several nodes, and
>>>>> "map" them togheter;
>>>>> then, assuming a lot of this data is repeated across the dataset, we
>>>>> "reduce" them.
>>>>>
>>>>>
>>>>> Cheers.
>>>>>
>>>>> On 22 October 2010 12:14, Artur Ejsmont wro=
te:
>>>>>> Hi there guys and girls
>>>>>>
>>>>>> Have anyone came across any reasonable explanation / articles on how
>>>>>> hadoop and map reduce work in practice?
>>>>>>
>>>>>> i have read a few articles now and then and i must say i am puzzled
>>>>>> .... am i stupid or they just cant find an easy way to explain it? :=
P
>>>>>>
>>>>>> What i would hope for is explanation on simple example of applicatio=
n
>>>>>> with some code samples preferably.
>>>>>>
>>>>>> anyone good at it here?
>>>>>>
>>>>>> cheers
>>>>>>
>>>>>> --
>>>>>> PHP Database Mailing List (http://www.php.net/)
>>>>>> To unsubscribe, visit: http://www.php.net/unsub.php
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Andr=E9s G. Monta=F1ez
>>>>> Zend Certified Engineer
>>>>> Montevideo - Uruguay
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Visit me at:
>>>> http://artur.ejsmont.org/blog/
>>>>
>>>
>>>
>>>
>>> --
>>> Andr=E9s G. Monta=F1ez
>>> Zend Certified Engineer
>>> Montevideo - Uruguay
>>>
>>
>>
>>
>> --
>> Visit me at:
>> http://artur.ejsmont.org/blog/
>>
>
>
>
> --
> Andr=E9s G. Monta=F1ez
> Zend Certified Engineer
> Montevideo - Uruguay
>

--=20
Visit me at:
http://artur.ejsmont.org/blog/

--
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php