5

I trying to write in SQL the following requirement:

Each user can rate as much films as they like. (1= I don't like, 5=I love it) And the system should provide to the current user a list of films that he doesn't rate yet and that other users like too.

To simplify, I have this table and the following data:

create table wich (
  uid int,
  film varchar(50),
  rate int
);

insert into wich values 
(1, 'usual suspect', 5), (1, 'gataca', 4), (1, 'goldeneye', 2),
(2, 'usual suspect', 4), (2, 'gataca', 5), (2, 'i am a legend', 4), (2, 'the hobbit', 1),
(3, 'usual suspect', 1), (3, 'gataca', 5), (3, 'goldeneye', 5),
(4, 'usual suspect', 5), (4, 'goldeneye', 5),
(5, 'usual suspect', 5), (5, 'gataca', 4), (5, 'goldeneye', 5),
(6, 'usual suspect', 4), (6, 'gataca', 3), (6, 'goldeneye', 3), (6, 'shrek', 4);

This can be read as: User 1 like 'usual suspect' and 'gataca' but not 'goldeneye'.

So what I want know is to find the user(s) that have the closest tastes of the current user and get a list of films that the current user could probably like.

Following, what I have done so far:

Step 1: for every films rated by user 1, if user1 and user2 rate the film in the same way, compute a score (should be 0 if the difference is greater than 2) and it's great if both like a film or both doesn't like a film.

select w2.uid, 1 as nb, case when abs(w1.rate - w2.rate) <= 2 THEN 5-abs(w1.rate - w2.rate) ELSE 0 END as score
        from wich w1
        inner join wich w2 on w1.uid<>w2.uid and w1.film=w2.film
        where w1.uid=1

Step 2: compute a score by user

select scores.uid, SUM(scores.score) * 100 / sum(scores.nb) as score
from (
    select w2.uid, 1 as nb, case when abs(w1.rate - w2.rate) <= 2 THEN 5-abs(w1.rate - w2.rate) ELSE 0 END as score
    from wich w1
    inner join wich w2 on w1.uid<>w2.uid and w1.film=w2.film
    where w1.uid=1
) scores
group by uid
ORDER BY score desc

step 3: find films that user1 has not yet rated that he could like

select w3.uid, w3.film,w3.rate, matches.score
from (
    select scores.uid, SUM(scores.score) * 100 / sum(scores.nb) as score
    from (
        select w2.uid, 1 as nb, case when abs(w1.rate - w2.rate) <= 2 THEN 5-abs(w1.rate - w2.rate) ELSE 0 END as score
        from wich w1
        inner join wich w2 on w1.uid<>w2.uid and w1.film=w2.film
        where w1.uid=1
    ) scores
    group by uid
    ORDER BY score desc
) as matches
inner join wich w3 on matches.uid=w3.uid and w3.rate >= 3
left join wich w4 on w4.uid=1 and w3.film=w4.film
where w4.uid is null
order by w3.rate desc, matches.score desc;

This seems to work, but I am not sure that this will still respond in a small amount of time if each user rate of lot of films and the table become become bigger.

What do you think ?

Is there a better way to accomplish such a thing ?

A working fiddle: http://sqlfiddle.com/#!2/89c57/1/0


Edit:

In my real table, the film column is an integer and I have a second table with films data (title, description, year, ...)

create table wich (
  uid int,
  film int,
  rate int
);

The example given here is just to simplify the problem with just one table.

fluminis
  • 151
  • 3

3 Answers3

1

You can make a table which holds the users with similar taste for every user.

CREATE TABLE `similar_users` (
    `uid1` INT,
    `uid2` INT,
    `score` INT,
    PRIMARY KEY (`uid1`, `uid2`)
)

The algorithm for calculating the score can be anything, as long as a higher score means a closer taste.

You you should limit the amount of similar users to a fixed number and also exclude the ones with a score less than a certain value (the ones with a not so similar tastes).

The contents of the table will be calculated periodically (e.g. every day).

Then step 3 from your solution becomes something like this:

SELECT w.film
FROM similar_users su
JOIN wich w ON w.uid = su.uid2 AND w.rate > 2
LEFT JOIN wich wn ON wn.film = w.film AND wn.uid = {current_user}
WHERE su.uid1 = {current_user}
ORDER BY su.score * w.rate DESC
Vatev
  • 189
  • 1
  • 5
  • Good point here, I do not have to get the similar users dynamically in every query. Computing a similarity table periodically is a good idea. – fluminis Jul 30 '14 at 14:16
-1

Normalize it

The biggest thing that stands out to me is that you need to normalize the relationship here. You can find tons of stuff online about normalizing your database and ER diagrams. But to cut right to the code, it means pushing out the film titles to what you could consider a "lookup table". Then your wich table will just be all int codes. More difficult for humans to read, but much better for machines to process, and avoids a lot of data duplication and potential data anomalies.

CREATE TABLE wich (
  uid int,
  film_id int REFERENCES film(id),
  rate int,
  PRIMARY KEY (uid, film_id)
);

CREATE TABLE film (  
  id int PRIMARY KEY,
  title varchar(50)
);

To get the film's title to show up in your queries, you'll just want to JOIN the film table to get the title.

On performance

Your database will probably slow down as you data size grows UNLESS you add some indexes. The indexes to make depend largely on your queries that you will run against the database. A good place to start is to create indexes on:

  • columns that you are using as part of a JOIN condition
  • columns that you reference in your WHERE clauses
Joshua Huber
  • 1,757
  • 11
  • 15
  • As I said the example is there to "simplify" the problem, of course the tables are normalized and there are a lot more tables in question here. Good hint for the indexes I already added a few, I will verify that point. – fluminis Jun 10 '14 at 04:15
  • 1
    Assumig the (obvious) key (uid,film) (which he should declare), this table is in 5NF. It has no normalization problems. – philipxy Jul 17 '14 at 11:10
  • @philipxy. Good point. OP can confirm/disconfirm whether the film title is part of the key. Anyhow, the original question was more about scalability & performance than NF. I should have better explained this: indexes are most important here, but the my other suggestion to push out the film titles to a reference table is intended to keep the PK index nodes small for speed: two int columns = 8 bytes; vs. an int + varchar(50) = 55 bytes. The result is 7x key cache density, so if the DB gets big enough to saturate the key cache, you'd see reduced physical I/O with the smaller PK => faster queries. – Joshua Huber Jul 17 '14 at 13:05
  • The sample data means the only possible keys are (uid,film) or (uid,film,rate). We can expect there to be only one rating per user & film, ie the former to be key. 2. Regardless of what is good for optimization, normalization is not involved & the only part of your first section that is relevant is string "duplication" [ie bytes per value] and "anomalies" [ie misspellings], with neither word being used in a normalization sense.
  • – philipxy Jul 17 '14 at 21:01
  • Academically, yes, the table is Normalized under all NF flavors. However in the workplace, it wouldn't be uncommon to hear a developer ask "how far should I normalize this?" If we want to be grammatical, we would say that this use of the term normalize is incorrect, and should more accurately be phrased as "to what extent does one implement lookup tables" (been beaten up pretty well here: http://dba.stackexchange.com/q/9011/38772). Anyhow @philipxy, your comments have been stimulating and made me think about 3NF, BCNF, 5NF, DNF, etc. as I haven't since grad school. – Joshua Huber Jul 17 '14 at 21:24
  • Despite being fully normalized in the textbook sense, the sample data is vulnerable to data modification anomalies. Example: the programmer discovers that the list of movies has a typo in a movie title "usual suspect" when meaning "usual suspects". To get this fixed, one has to update potentially a lot of rows, and would involve updating values that are part of a composite key. It is unarguably simpler to just update 1 non-key attribute in 1 record in the films lookup table. Besides, if there is a frontend app for users to rate films, wouldn't you want a table that enumerates your films? – Joshua Huber Jul 17 '14 at 21:34
  • I think that people who use "normalization" to mean other things not only don't know what the word "normalization" means but are extremely likely to not know what normalization is and when and how to use it as a tool. Eg all the issues you raise are reasonable, they just don't have anything to do with normalization. But I repeat myself. (I don't understand your "beaten up" but the very link you gave explains how your last comment confuses introducing FKs to canonical values with introducing surrogate ids.) – philipxy Jul 17 '14 at 22:43
  • I edited my question: in my real database, the film column is a numeric value with a foreign key on the films table. – fluminis Jul 30 '14 at 13:09