Movie IMDb Rating Predictor

EECS 349, Machine Learning

Zhilin Chen

Project Title

    Movie IMDb Ratings Predictor

    Github link

Group Member

     Zhilin Chen, zcj456, Email: zhilinchen2015@u.northwestern.edu

     This is an individual project and I work on all the parts.

Course and University

     EECS 349 Machine Learning, Northwestern University

     Course Website: http://www.cs.northwestern.edu/~ddowney/courses/349_Spring2016/

     Instructor: Doug Downey

Abstract

     Our task is to determine what and how attributes decide the IMDb rating of a movie. For example, how movie’s genres, directors, stars or production corporation effect its IMDb ratings. This would be interesting and meaningful. A businessman would want to use it because he can know will a movie be popular and profitable beforehand. What’s else, this predictor could also provide us an insight about how public preferences in movies change over time.

     In order to collect the dataset for out task, I wrote a python program to grab the information of movies in JSON format through OMDb API. There are 13427 examples in our dataset. I randomly choose 10000 examples for training and the remaining for testing. The attributes I used in this task include: 'Year', 'Genres', 'Writers', 'Directors', 'Actors', 'MetaScores' of movies. As some attributes would have multiples values, I split these attributes into multiple attributes(For example, 'Genres' would be split into 'Genres_1', 'Genres_2',' Genres_3')

     We analyze the dataset without any learning techniques as it's enough to provide us an insight of movie industry. (For example, which genres/directors/writers/actors are popular among audiences in terms of IMDb ratings?). After plain analysis, we try different models(such as KNN, decision tree and so on). Based on their accuracy on 10-folds cross-validation, we find that BayesNet is the most suitable model for our problem. Then we go deep into BayesNet. For example, we explore the learning curve(the accuracy on training set/testing set/10-folds crossvalidation among size of training set) of this model for out task.

     Generally speaking, we are really successful in our problem. We've obtained over 80% accuracy on testing set or 10-folds cross validation. What's more, our model could obatin relatively high accuracy on small training set and it seems that we don't need to worry about over-fitting problem when the training set goes larger.

Menu of Detailed Report

     Complete Version of Detailed Report

     1) Background: Motivation and related background of our task

     2) Dataset: Source and details (number of examples, utilized attributes and so on) of our dataset

     3) Plain Analysis: Parts of the result we get from simply analyze the data

     4) Learning: How we use some machine learning techniques on our dataset and key result of this project

     5) Conclusion: Conclusion, results and some suggestions for future work of this project.