Movie IMDb Rating Predictor
EECS 349, Machine Learning
Zhilin Chen
Movie IMDb Ratings Predictor
Zhilin Chen, zcj456, Email: zhilinchen2015@u.northwestern.edu
This is an individual project and I work on all the parts.
EECS 349 Machine Learning, Northwestern University
Course Website: http://www.cs.northwestern.edu/~ddowney/courses/349_Spring2016/
Instructor: Doug Downey
Our task is to determine what and how attributes decide the IMDb rating of a movie. For example, how movie’s genres, directors, stars or production corporation effect its IMDb ratings. This would be interesting and meaningful. A businessman would want to use it because he can know will a movie be popular and profitable beforehand. What’s else, this predictor could also provide us an insight about how public preferences in movies change over time.
In order to collect the dataset for out task, I wrote a python program to grab the information of movies in JSON format through OMDb API. There are 13427 examples in our dataset. I randomly choose 10000 examples for training and the remaining for testing. The attributes I used in this task include: 'Year', 'Genres', 'Writers', 'Directors', 'Actors', 'MetaScores' of movies. As some attributes would have multiples values, I split these attributes into multiple attributes(For example, 'Genres' would be split into 'Genres_1', 'Genres_2',' Genres_3')
We analyze the dataset without any learning techniques as it's enough to provide us an insight of movie industry. (For example, which genres/directors/writers/actors are popular among audiences in terms of IMDb ratings?). After plain analysis, we try different models(such as KNN, decision tree and so on). Based on their accuracy on 10-folds cross-validation, we find that BayesNet is the most suitable model for our problem. Then we go deep into BayesNet. For example, we explore the learning curve(the accuracy on training set/testing set/10-folds crossvalidation among size of training set) of this model for out task.
Generally speaking, we are really successful in our problem. We've obtained over 80% accuracy on testing set or 10-folds cross validation. What's more, our model could obatin relatively high accuracy on small training set and it seems that we don't need to worry about over-fitting problem when the training set goes larger.
Complete Version of Detailed Report
1) Background: Motivation and related background of our task
2) Dataset: Source and details (number of examples, utilized attributes and so on) of our dataset
3) Plain Analysis: Parts of the result we get from simply analyze the data
4) Learning: How we use some machine learning techniques on our dataset and key result of this project
5) Conclusion: Conclusion, results and some suggestions for future work of this project.