Movie IMDb Rating Predictor

EECS 349, Machine Learning

Zhilin Chen

Where and how do we grab data?

    1)The Open Movie Database(OMDb, http://www.omdbapi.com) API is a free web service to obtain movie. It provides us an API to access to the information of movies. Its search engine would return the information of target movie in JSON format like following:

{"Title":"Zootopia","Year":"2016","Rated":"PG","Released":"04 Mar 2016","Runtime":"108 min","Genre":"Animation, Action,
Adventure","Director":"Byron Howard, Rich Moore, Jared Bush","Writer":"Byron Howard (story by), Jared Bush (story by),
Rich Moore (story by), Josie Trinidad (story by), Jim Reardon (story by), Phil Johnston (story by),
Jennifer Lee (story by), Jared Bush (screenplay), Phil Johnston (screenplay)","Actors":"Ginnifer Goodwin, Jason Bateman,
Idris Elba, Jenny Slate", "Plot":"In a city of anthropomorphic animals, a rookie bunny cop and a cynical con artist
fox must work together to uncover a conspiracy.","Language":"English, Norwegian","Country":"USA","Awards":"1 nomination.",
"Poster":"http://ia.media-imdb.com/images/M/MV5BOTMyMjEyNzIzMV5BMl5BanBnXkFtZTgwNzIyNjU0NzE@._V1_SX300.jpg",
"Metascore":"78","imdbRating":"8.2","imdbVotes":"101,917","imdbID":"tt2948356","Type":"movie","Response":"True"}

    2) Movie lists from wikipedia (https://en.wikipedia.org/wiki/Lists_of_films) and IMDb (ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/)

    3) A self-written python program (code) that will automatically grab the data from OMDb API of the movies in the given list and then store these datas in .csv files.

Description of dataset

    1) Number of examples: 13427

    2) Training and testing set: We randomly choose 10000 examples as training set and the remaining 3427 examples as testing set.

    3) Attributes (since each movie might have multiple genres, writers, actors and so on, these attributes with multiple values would be split into multiple attributes, such as Genre_1, Genre_2, Genre_3...): 'Title', 'Year', 'Genre_1', 'Genre_2', 'Genre_3', 'Director', 'Writer_1', 'Writer_2', 'Writer_3', 'Actors_1', 'Actors_2', 'Actors_3', 'Actors_4', 'Metascore', 'imdbVotes', 'Language', 'Country' and 'imdbRatings'. And what should we notice is that NOT ALL the attributes have been used in learning (irrelevant attributes, such as 'Title' and 'imdbVotes' would be discarded during the learning).

    4) Screenshot of raw data in .csv file:

    5) Screenshot of processed data in .csv file: