I've got a parent table of movies and child table of ratings where people give the film a score on a scale of 1 to 10. In my display, I want to display an average rating for the film.
I don't want to calculate the average of scores of rows for each display. With a large ratio of 'child' to 'parent' that could be very expensive. I want to denormalise but only just enough.
Say "Thor" has a current average score of 7 and a new review adds a rating of 2, what does that do to my average ? You can't tell from those figures, because an average is a ratio. Whenever you derive one value from multiple values, you are going to lose some functionality.
What I should do is take a step back and store on my movie table a TOTAL of people's rating, plus a COUNT of the number of ratings. Now, rather than an average of 7, I've got 10 reviews with a total rating of 70. Add in my new row and I've got 11 reviews with a total of 72 (giving an average of a bit more than 6.5). I don't need to re-read all those individual rows to recalculate the average.
Say "Thor" has a current average score of 7 and a new review adds a rating of 2, what does that do to my average ? You can't tell from those figures, because an average is a ratio. Whenever you derive one value from multiple values, you are going to lose some functionality.
What I should do is take a step back and store on my movie table a TOTAL of people's rating, plus a COUNT of the number of ratings. Now, rather than an average of 7, I've got 10 reviews with a total rating of 70. Add in my new row and I've got 11 reviews with a total of 72 (giving an average of a bit more than 6.5). I don't need to re-read all those individual rows to recalculate the average.
This is why we learn maths !
The next issue will be contention. Fifty people coming out of a screening, all adding their ratings at the same time. Scary ? INSERTs are good at concurrency. There might be a few brief latches/mutex's over sequences and index blocks, but no actual locking problems. Updates are a different matter, and if we get fifty people tring to update the total/count values at the same time it will get ugly.
This is where queues come in. Have those 50 inserted rows join a queue and a single process dequeue them to calculate and update the movie's total and count.
This is why we learn computer science. Of course similar things happen in the real world too, and queuing theory is....a part of maths.
The queue / de-queue model may show inconsistencies while waiting for ratings to be processed off the queue. Is 'Eventually Consistent' good enough ? If you need transactional consistency then leave the ratings in a 'pending' state until after they've been de-queued. If you need immediacy AND transactional consistency, then you'll end up with a potential bottleneck. Recognising immovable objects is a vital lesson to learn too.
8 comments:
I've done something similar to the movie ratings, only it's financial transactions.
On the account line, store Current Balance and any other things pertinent.
From a purist perpective, what if performance were never a consideration? Maybe a small data set or you have something like Exadata? Would you then go the Parent/Child way and roll it up each time it was looked up?
"I've got 10 reviews with a total rating of 70" (for "Thor")
But what if "Fast and Furious" has a total rating of 40 but only 5 reviewers ?
Would people look at "70 and 10" versus "40 and 5" (i.e. a pair of figures each) when comparing the ratings for two movies ? Or would it be easier to look at "7" and "8" ? If you present two figures ("70" and "10") for the same movie, most people would read only 1 figure ("70").
Then, again if only 1 person has given a rating of "10" to "Source Code", should we accept that this film has an average rating of "10" ?
So, an average makes sense only after a minimum (threshold) number of ratings have been entered.
If a movie has not been rated by at least 10 people, I would not present the average (or total) rating score at all.
Good points, folks.
One of the things I had to get my head around during the Statistics and Probability semester in uni was the notion of "p-values". Goggle it.
In a nutshell: they show us the degree of probability that a given sample is significant before we publish any ratios based on that sample.
Rarely calculated nowadays, where simply doing a ratio is considered by newspapers as a "statistic".
Excellent post. Especially the point about the queues, I've been trying to get people to do this forever.
Since its a bit of a non-DBA or noob-friendly post, I think more of an explanation on why inserts are not a concurrency issue but updates could help some readers. I know why, but I'm not sure its completely intuitive.
And noons:
P-value is the probability that if what you want to say about your data is wrong, you still got the results you did in your experiment by pure chance.
The lower the probability of getting your results by chance even if you are wrong, the more significant your results are.
and as far as I know - p-values are calculated on analyzed data (averages and variance), not on the sample as a whole.
@Chet. Performance should ALWAYS be a consideration. General rule of thumb would be where the work involved in maintaining the summary is outweighed by the work saved by not recalculating for each query.
But that is pretty vague as the former includes work in development / maintenance rather than just work by the DB engine.
Volume of data, rapidity of change, consistency all come into play.
@Hemant
Happy for the UI (or a mid-tier layer) to calc an average as TOTAL/COUNT and maybe show "AVG 6.5 from 12 reviews" or whatever.
Agree that small number of samples has little meaning. More so if people are volunteering opinions rather than a really random sample.
DBMS_STATS determining averages/min/max from sampling would make an interesting post. But I don't think my maths is up to it. Perhaps Craig Shallahamer might take it up.
@Gwen,
Thanks. I've done a follow up about the death of delete. I'll have to stretch my maths out to come up with a more extensive description of where update concurrency starts to break things.
Chen:
I did not say that p-values are calculated on the sample as a whole.
Like I said: google the term. It's worth it.
Post a Comment