problem w/ duplicates

problem w/ duplicates

am 08.07.2006 02:53:26 von mark

Let's say I have a table of users, and each user has a list of
categories. I could store each user's categories as TEXT with
delimeters like "cat1|cat2|cat3"

But then I need to be able to get a full list of everyone's categories,
without duplicates. Retrieving all the categories, exploding them, and
then removing the duplicates is a bit slow. Is there a better method?

Re: problem w/ duplicates

am 08.07.2006 15:15:43 von Rich R

"Mark" wrote in message
news:1152320006.538460.135940@s13g2000cwa.googlegroups.com.. .
> Let's say I have a table of users, and each user has a list of
> categories. I could store each user's categories as TEXT with
> delimeters like "cat1|cat2|cat3"
>
> But then I need to be able to get a full list of everyone's categories,
> without duplicates. Retrieving all the categories, exploding them, and
> then removing the duplicates is a bit slow. Is there a better method?
>

Your solution violates 1st normal form and leads to many problems. Create a
table with 2 columns: userid, categoryid, and make the these columns the
key.

Rich

Re: problem w/ duplicates

am 09.07.2006 09:33:35 von mark

Rich Ryan wrote:
> "Mark" wrote in message
> news:1152320006.538460.135940@s13g2000cwa.googlegroups.com.. .
> > Let's say I have a table of users, and each user has a list of
> > categories. I could store each user's categories as TEXT with
> > delimeters like "cat1|cat2|cat3"
> >
> > But then I need to be able to get a full list of everyone's categories,
> > without duplicates. Retrieving all the categories, exploding them, and
> > then removing the duplicates is a bit slow. Is there a better method?
> >
>
> Your solution violates 1st normal form and leads to many problems. Create a
> table with 2 columns: userid, categoryid, and make the these columns the
> key.
>
> Rich

make what the key? the "these" columns?? i don't understand what you
mean.

and if i did it that method, wouldn't there be a lot of excessive data?

userid | categories
0 | life
0 | work
0 | web
1 | life
1 | work
1 | starcraft
2 | work
2 | starcraft
2 | programming

something like that..? i mean..i guess it works, but it seems like
wasted space. thought there might be a way to group everyone who has
the same category.

Re: problem w/ duplicates

am 09.07.2006 15:36:48 von zac.carey

users
| userID | user | (userID is Primary)
| 01 | john |
| 02 | paul |
| 03 | george |
| 04 | ringo |

categories
| categoryID | category | (categoryID is Primary)
| 01 | lead vocals |
| 02 | lead guitar |
| 03 | keyboard |
| 04 | harmonica |
| 05 | backing vocals|
| 06 | drums |
| 07 | rhythm guitar |
| 08 | bass guitar |

userID_categoryID
| userID | categoryID | (the primary is made from both together!)
| 01 | 01 |
| 01 | 03 |
| 01 | 04 |
| 01 | 05 |
| 01 | 07 |
| 02 | 01 |
| 02 | 05 |
| 02 | 07 |
| 02 | 08 |
| 03 | 02 |
| 03 | 05 |
| etc | etc |


Mark wrote:
> Rich Ryan wrote:
> > "Mark" wrote in message
> > news:1152320006.538460.135940@s13g2000cwa.googlegroups.com.. .
> > > Let's say I have a table of users, and each user has a list of
> > > categories. I could store each user's categories as TEXT with
> > > delimeters like "cat1|cat2|cat3"
> > >
> > > But then I need to be able to get a full list of everyone's categories,
> > > without duplicates. Retrieving all the categories, exploding them, and
> > > then removing the duplicates is a bit slow. Is there a better method?
> > >
> >
> > Your solution violates 1st normal form and leads to many problems. Create a
> > table with 2 columns: userid, categoryid, and make the these columns the
> > key.
> >
> > Rich
>
> make what the key? the "these" columns?? i don't understand what you
> mean.
>
> and if i did it that method, wouldn't there be a lot of excessive data?
>
> userid | categories
> 0 | life
> 0 | work
> 0 | web
> 1 | life
> 1 | work
> 1 | starcraft
> 2 | work
> 2 | starcraft
> 2 | programming
>
> something like that..? i mean..i guess it works, but it seems like
> wasted space. thought there might be a way to group everyone who has
> the same category.

Re: problem w/ duplicates

am 11.07.2006 08:33:54 von mark

zac.carey@gmail.com wrote:
> users
> | userID | user | (userID is Primary)
> | 01 | john |
> | 02 | paul |
> | 03 | george |
> | 04 | ringo |
>
> categories
> | categoryID | category | (categoryID is Primary)
> | 01 | lead vocals |
> | 02 | lead guitar |
> | 03 | keyboard |
> | 04 | harmonica |
> | 05 | backing vocals|
> | 06 | drums |
> | 07 | rhythm guitar |
> | 08 | bass guitar |
>
> userID_categoryID
> | userID | categoryID | (the primary is made from both together!)
> | 01 | 01 |
> | 01 | 03 |
> | 01 | 04 |
> | 01 | 05 |
> | 01 | 07 |
> | 02 | 01 |
> | 02 | 05 |
> | 02 | 07 |
> | 02 | 08 |
> | 03 | 02 |
> | 03 | 05 |
> | etc | etc |
>
>
> Mark wrote:
> > Rich Ryan wrote:
> > > "Mark" wrote in message
> > > news:1152320006.538460.135940@s13g2000cwa.googlegroups.com.. .
> > > > Let's say I have a table of users, and each user has a list of
> > > > categories. I could store each user's categories as TEXT with
> > > > delimeters like "cat1|cat2|cat3"
> > > >
> > > > But then I need to be able to get a full list of everyone's categories,
> > > > without duplicates. Retrieving all the categories, exploding them, and
> > > > then removing the duplicates is a bit slow. Is there a better method?
> > > >
> > >
> > > Your solution violates 1st normal form and leads to many problems. Create a
> > > table with 2 columns: userid, categoryid, and make the these columns the
> > > key.
> > >
> > > Rich
> >
> > make what the key? the "these" columns?? i don't understand what you
> > mean.
> >
> > and if i did it that method, wouldn't there be a lot of excessive data?
> >
> > userid | categories
> > 0 | life
> > 0 | work
> > 0 | web
> > 1 | life
> > 1 | work
> > 1 | starcraft
> > 2 | work
> > 2 | starcraft
> > 2 | programming
> >
> > something like that..? i mean..i guess it works, but it seems like
> > wasted space. thought there might be a way to group everyone who has
> > the same category.

hm. thank you for clearing that up. is having 3 tables faster/more
efficient/save more space than having two tables? i guess there are
less strings stored, but there are twice as many rows necessary..

Re: problem w/ duplicates

am 11.07.2006 14:10:19 von zac.carey

Mark wrote:
> zac.carey@gmail.com wrote:
> > users
> > | userID | user | (userID is Primary)
> > | 01 | john |
> > | 02 | paul |
> > | 03 | george |
> > | 04 | ringo |
> >
> > categories
> > | categoryID | category | (categoryID is Primary)
> > | 01 | lead vocals |
> > | 02 | lead guitar |
> > | 03 | keyboard |
> > | 04 | harmonica |
> > | 05 | backing vocals|
> > | 06 | drums |
> > | 07 | rhythm guitar |
> > | 08 | bass guitar |
> >
> > userID_categoryID
> > | userID | categoryID | (the primary is made from both together!)
> > | 01 | 01 |
> > | 01 | 03 |
> > | 01 | 04 |
> > | 01 | 05 |
> > | 01 | 07 |
> > | 02 | 01 |
> > | 02 | 05 |
> > | 02 | 07 |
> > | 02 | 08 |
> > | 03 | 02 |
> > | 03 | 05 |
> > | etc | etc |
> >
> >
> > Mark wrote:
> > > Rich Ryan wrote:
> > > > "Mark" wrote in message
> > > > news:1152320006.538460.135940@s13g2000cwa.googlegroups.com.. .
> > > > > Let's say I have a table of users, and each user has a list of
> > > > > categories. I could store each user's categories as TEXT with
> > > > > delimeters like "cat1|cat2|cat3"
> > > > >
> > > > > But then I need to be able to get a full list of everyone's categories,
> > > > > without duplicates. Retrieving all the categories, exploding them, and
> > > > > then removing the duplicates is a bit slow. Is there a better method?
> > > > >
> > > >
> > > > Your solution violates 1st normal form and leads to many problems. Create a
> > > > table with 2 columns: userid, categoryid, and make the these columns the
> > > > key.
> > > >
> > > > Rich
> > >
> > > make what the key? the "these" columns?? i don't understand what you
> > > mean.
> > >
> > > and if i did it that method, wouldn't there be a lot of excessive data?
> > >
> > > userid | categories
> > > 0 | life
> > > 0 | work
> > > 0 | web
> > > 1 | life
> > > 1 | work
> > > 1 | starcraft
> > > 2 | work
> > > 2 | starcraft
> > > 2 | programming
> > >
> > > something like that..? i mean..i guess it works, but it seems like
> > > wasted space. thought there might be a way to group everyone who has
> > > the same category.
>
> hm. thank you for clearing that up. is having 3 tables faster/more
> efficient/save more space than having two tables? i guess there are
> less strings stored, but there are twice as many rows necessary..

Well, I'm definitely not qualified to comment on efficiency but I'm
sure there must be lots out there on the performance comparisons of
flat tables vs normalized dbs.

In this simple example, there's probably not a lot in it. By putting
the categories in a separate table, I'm reducing the risk of errors in
user input - or at least making those errors more consistent! If the
categories also had descriptions, for instance, then the performance
benefits would become more apparent.

But the db I've suggested is a poor example for demonstrating the real
benefits of (at least some degree of) normalization. I'd write a better
one - but there's SO many well-written tutorials already out there on
db construction and normalization that it hardly seems worth it :-)

Re: problem w/ duplicates

am 11.07.2006 16:22:43 von Shawn Hamzee

Indeed, in the example provided the three table is faster. Overall though,
the performance of your tables depend on the number of rows. If your number
of rows in any of the tables is not going to surpass maybe 400,000 or
500,000 rows, the denormalized solution is the way to go; however, there is
a point of diminishing return that you need to be aware of by examining the
logs and the performance of your database.

Hope this helps.


On 7/11/06 02:33, in article
1152599634.347968.24610@b28g2000cwb.googlegroups.com, "Mark"
wrote:

>
> zac.carey@gmail.com wrote:
>> users
>> | userID | user | (userID is Primary)
>> | 01 | john |
>> | 02 | paul |
>> | 03 | george |
>> | 04 | ringo |
>>
>> categories
>> | categoryID | category | (categoryID is Primary)
>> | 01 | lead vocals |
>> | 02 | lead guitar |
>> | 03 | keyboard |
>> | 04 | harmonica |
>> | 05 | backing vocals|
>> | 06 | drums |
>> | 07 | rhythm guitar |
>> | 08 | bass guitar |
>>
>> userID_categoryID
>> | userID | categoryID | (the primary is made from both together!)
>> | 01 | 01 |
>> | 01 | 03 |
>> | 01 | 04 |
>> | 01 | 05 |
>> | 01 | 07 |
>> | 02 | 01 |
>> | 02 | 05 |
>> | 02 | 07 |
>> | 02 | 08 |
>> | 03 | 02 |
>> | 03 | 05 |
>> | etc | etc |
>>
>>
>> Mark wrote:
>>> Rich Ryan wrote:
>>>> "Mark" wrote in message
>>>> news:1152320006.538460.135940@s13g2000cwa.googlegroups.com.. .
>>>>> Let's say I have a table of users, and each user has a list of
>>>>> categories. I could store each user's categories as TEXT with
>>>>> delimeters like "cat1|cat2|cat3"
>>>>>
>>>>> But then I need to be able to get a full list of everyone's categories,
>>>>> without duplicates. Retrieving all the categories, exploding them, and
>>>>> then removing the duplicates is a bit slow. Is there a better method?
>>>>>
>>>>
>>>> Your solution violates 1st normal form and leads to many problems. Create a
>>>> table with 2 columns: userid, categoryid, and make the these columns the
>>>> key.
>>>>
>>>> Rich
>>>
>>> make what the key? the "these" columns?? i don't understand what you
>>> mean.
>>>
>>> and if i did it that method, wouldn't there be a lot of excessive data?
>>>
>>> userid | categories
>>> 0 | life
>>> 0 | work
>>> 0 | web
>>> 1 | life
>>> 1 | work
>>> 1 | starcraft
>>> 2 | work
>>> 2 | starcraft
>>> 2 | programming
>>>
>>> something like that..? i mean..i guess it works, but it seems like
>>> wasted space. thought there might be a way to group everyone who has
>>> the same category.
>
> hm. thank you for clearing that up. is having 3 tables faster/more
> efficient/save more space than having two tables? i guess there are
> less strings stored, but there are twice as many rows necessary..
>

Re: problem w/ duplicates

am 11.07.2006 17:32:15 von Skarjune

Mark wrote:
> hm. thank you for clearing that up. is having 3 tables faster/more
> efficient/save more space than having two tables? i guess there are
> less strings stored, but there are twice as many rows necessary..

Mark,

As Rich first pointed out Normalization is important for data quality
and performance. The number of tables and rows alone does not determine
performance. Proper primary and foreign keys between related tables
along with indexes on columns used for conditions in WHERE clauses has
a greater effect on performance, since that controls how the query is
parsed and how the data engine determines how to fetch the data.
Whereas, hacking delimiters for nested values will tend to slow things
down.

Imagine that you went to the library and the books were simply stacked
on the shelves in whatever order could cram in the most books on the
least shelves using the least staff. That'd be easy for the library to
file the books, but a hassle for patrons to find the books...

-DHS-

Re: problem w/ duplicates

am 17.07.2006 20:54:06 von mark

Skarjune wrote:
> Mark wrote:
> > hm. thank you for clearing that up. is having 3 tables faster/more
> > efficient/save more space than having two tables? i guess there are
> > less strings stored, but there are twice as many rows necessary..
>
> Mark,
>
> As Rich first pointed out Normalization is important for data quality
> and performance. The number of tables and rows alone does not determine
> performance. Proper primary and foreign keys between related tables
> along with indexes on columns used for conditions in WHERE clauses has
> a greater effect on performance, since that controls how the query is
> parsed and how the data engine determines how to fetch the data.
> Whereas, hacking delimiters for nested values will tend to slow things
> down.
>
> Imagine that you went to the library and the books were simply stacked
> on the shelves in whatever order could cram in the most books on the
> least shelves using the least staff. That'd be easy for the library to
> file the books, but a hassle for patrons to find the books...
>
> -DHS-

Thanks for explaining all this to me guys :) I knew there was a proper
or better method to approach this problem, but I guess I'm still sort
of new to databases, and wasn't sure what it was. Perhaps I'll google
some stuff on normalization and find out more. Anyways, this should
solve my problem. Thanks a ton!