IMDb Movie Database Analysis

The subset sample of movies is taken from the Kaggle IMDB 5000 movie dataset

Import and Inspect

movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)

## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …

Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:

gross : the gross earnings in the US box office, not adjusted for inflation
budget: the movie’s budget
cast_facebook_likes: the number of facebook likes cast members received
votes: the number of people who voted for (or rated) the movie in IMDB
reviews: the number of reviews for that movie
rating: IMDB average rating

Check for missing values (NAs) and if all entries are distinct

skimr::skim(movies)

(#tab:skim_data)Data summary
Name	movies
Number of rows	2961
Number of columns	11
_______________________
Column type frequency:
character	3
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
title	1	1	83	2907
genre	1	5	11	17
director	1	3	32	1366

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2.00e+03	9.95e+00	1920.0	2.00e+03	2.00e+03	2.01e+03	2.02e+03	▁▁▁▂▇
duration	1	1.10e+02	2.22e+01	37.0	9.50e+01	1.06e+02	1.19e+02	3.30e+02	▃▇▁▁▁
gross	1	5.81e+07	7.25e+07	703.0	1.23e+07	3.47e+07	7.56e+07	7.61e+08	▇▁▁▁▁
budget	1	4.06e+07	4.37e+07	218.0	1.10e+07	2.60e+07	5.50e+07	3.00e+08	▇▂▁▁▁
cast_facebook_likes	1	1.24e+04	2.05e+04	0.0	2.24e+03	4.60e+03	1.69e+04	6.57e+05	▇▁▁▁▁
votes	1	1.09e+05	1.58e+05	5.0	1.99e+04	5.57e+04	1.33e+05	1.69e+06	▇▁▁▁▁
reviews	1	5.03e+02	4.94e+02	2.0	1.99e+02	3.64e+02	6.31e+02	5.31e+03	▇▁▁▁▁
rating	1	6.39e+00	1.05e+00	1.6	5.80e+00	6.50e+00	7.10e+00	9.30e+00	▁▁▆▇▁

There are no missing values as can be observed when analysing n_missing. However, there are duplicate values for some variables. What is more likely our concern is that there are duplicate titles, which shouldn’t be the case. This can be observed through looking at n_unique: even though there are a total of 2961 records, there only seem to be 2907 unique movie titles.

Number of movies in each genre

count_movies_genre <- movies %>% 
  group_by(genre) %>% 
  count(sort=TRUE) %>% 
  rename("number of movies" = n)

count_movies_genre

## # A tibble: 17 × 2
## # Groups:   genre [17]
##    genre       `number of movies`
##    <chr>                    <int>
##  1 Comedy                     848
##  2 Action                     738
##  3 Drama                      498
##  4 Adventure                  288
##  5 Crime                      202
##  6 Biography                  135
##  7 Horror                     131
##  8 Animation                   35
##  9 Fantasy                     28
## 10 Documentary                 25
## 11 Mystery                     16
## 12 Sci-Fi                       7
## 13 Family                       3
## 14 Musical                      2
## 15 Romance                      2
## 16 Western                      2
## 17 Thriller                     1

There is a significant difference between the genre with the highest number of movies - Comedy, and the genre with the lowest number of movies - Thriller.

Return on budget - how much $ did a movie make at the box office for each $ of its budget

library(scales)

genre_returns <- movies %>% 
  group_by(genre) %>% 
  summarise(average_gross = mean(gross), 
            average_budget = mean(budget)) %>% 
  mutate(return_on_budget = average_gross/average_budget) %>% 
  mutate(return_on_budget = round(return_on_budget, 2)) %>% 
  mutate(average_gross = dollar(average_gross), average_budget = dollar(average_budget)) %>% 
  # The dollar function is from the scales package which allows the numbers to be more readable
  
  arrange(desc(return_on_budget))

genre_returns

## # A tibble: 17 × 4
##    genre       average_gross average_budget return_on_budget
##    <chr>       <chr>         <chr>                     <dbl>
##  1 Musical     $92,084,000   $3,189,500                28.9 
##  2 Family      $149,160,478  $14,833,333               10.1 
##  3 Western     $20,821,884   $3,465,000                 6.01
##  4 Documentary $17,353,973   $5,887,852                 2.95
##  5 Horror      $37,713,738   $13,504,916                2.79
##  6 Fantasy     $42,408,841   $17,582,143                2.41
##  7 Comedy      $42,630,552   $24,446,319                1.74
##  8 Mystery     $67,533,021   $39,218,750                1.72
##  9 Animation   $98,433,792   $61,701,429                1.6 
## 10 Biography   $45,201,805   $28,543,696                1.58
## 11 Adventure   $95,794,257   $66,290,069                1.45
## 12 Drama       $37,465,371   $26,242,933                1.43
## 13 Crime       $37,502,397   $26,596,169                1.41
## 14 Romance     $31,264,848   $25,107,500                1.25
## 15 Action      $86,583,860   $71,354,888                1.21
## 16 Sci-Fi      $29,788,371   $27,607,143                1.08
## 17 Thriller    $2,468        $300,000                   0.01

Top 15 directors who have created the highest gross revenue in the box office

top_directors <- movies %>% 
  group_by(director) %>% 
  summarise(total_gross = sum(gross),
            avg_gross = mean(gross),
            median_gross = median(gross),
            sd_gross = sd(gross)) %>% 
  slice_max(order_by = total_gross, n = 15) %>% 
  mutate(total_gross = dollar(total_gross),
         avg_gross = dollar(avg_gross), 
         median_gross = dollar(median_gross),
         sd_gross = dollar(sd_gross))

top_directors

## # A tibble: 15 × 5
##    director          total_gross    avg_gross    median_gross sd_gross    
##    <chr>             <chr>          <chr>        <chr>        <chr>       
##  1 Steven Spielberg  $4,014,061,704 $174,524,422 $164,435,221 $101,421,051
##  2 Michael Bay       $2,231,242,537 $171,634,041 $138,396,624 $127,161,579
##  3 Tim Burton        $2,071,275,480 $129,454,718 $76,519,172  $108,726,924
##  4 Sam Raimi         $2,014,600,898 $201,460,090 $234,903,076 $162,126,632
##  5 James Cameron     $1,909,725,910 $318,287,652 $175,562,880 $309,171,337
##  6 Christopher Nolan $1,813,227,576 $226,653,447 $196,667,606 $187,224,133
##  7 George Lucas      $1,741,418,480 $348,283,696 $380,262,555 $146,193,880
##  8 Robert Zemeckis   $1,619,309,108 $124,562,239 $100,853,835 $91,300,279 
##  9 Clint Eastwood    $1,378,321,100 $72,543,216  $46,700,000  $75,487,408 
## 10 Francis Lawrence  $1,358,501,971 $271,700,394 $281,666,058 $135,437,020
## 11 Ron Howard        $1,335,988,092 $111,332,341 $101,587,923 $81,933,761 
## 12 Gore Verbinski    $1,329,600,995 $189,942,999 $123,207,194 $154,473,822
## 13 Andrew Adamson    $1,137,446,920 $284,361,730 $279,680,930 $120,895,765
## 14 Shawn Levy        $1,129,750,988 $102,704,635 $85,463,309  $65,484,773 
## 15 Ridley Scott      $1,128,857,598 $80,632,686  $47,775,715  $68,812,285

Ratings distribution

ratings_by_genre <- movies %>% 
  group_by(genre) %>% 
  summarise(avg_rating = mean(rating),
           min_rating = min(rating),
           max_rating = max(rating),
           sd_rating = sd(rating)) %>% 
  arrange(desc(avg_rating))

ratings_by_genre

## # A tibble: 17 × 5
##    genre       avg_rating min_rating max_rating sd_rating
##    <chr>            <dbl>      <dbl>      <dbl>     <dbl>
##  1 Biography         7.11        4.5        8.9     0.760
##  2 Crime             6.92        4.8        9.3     0.849
##  3 Mystery           6.86        4.6        8.5     0.882
##  4 Musical           6.75        6.3        7.2     0.636
##  5 Drama             6.73        2.1        8.8     0.917
##  6 Documentary       6.66        1.6        8.5     1.77 
##  7 Sci-Fi            6.66        5          8.2     1.09 
##  8 Animation         6.65        4.5        8       0.968
##  9 Romance           6.65        6.2        7.1     0.636
## 10 Adventure         6.51        2.3        8.6     1.09 
## 11 Family            6.5         5.7        7.9     1.22 
## 12 Action            6.23        2.1        9       1.03 
## 13 Fantasy           6.15        4.3        7.9     0.959
## 14 Comedy            6.11        1.9        8.8     1.02 
## 15 Horror            5.83        3.6        8.5     1.01 
## 16 Western           5.7         4.1        7.3     2.26 
## 17 Thriller          4.8         4.8        4.8    NA

# Plotting the graph that shows how ratings are distributed (all genres)
ggplot(movies, aes(x=rating)) +
  geom_histogram(color="black", fill = "grey") +
  geom_vline(aes(xintercept=mean(rating)), color = "darkred", size = 1, linetype = "dashed") +
  labs(title = "Ratings distribution", 
       subtitle = "A histogram on overall ratings across genres", 
       x = "Rating",
       y = "count") +
  annotate("text",
           label = "Mean",
           color = "darkred", 
           y = -10, 
           x = 7,
           size = 5) +
  theme_bw()

# Plotting the graph that shows how ratings are distributed by genre
ggplot(movies, aes(x=rating)) +
  geom_histogram(color="black", fill = "grey") +
  facet_wrap(vars(genre), scales = "free_y") +
  labs(title = "Ratings distribution by genre", 
         x = "Rating") +
  theme_bw()

The relationship between `gross` and `cast_facebook_likes`

We would like to find out if the number of facebook likes that the casts have received is likely to be a good predictor of how much money a movie will make at the box office.

ggplot(movies, aes(x=cast_facebook_likes, y=gross)) +
  geom_point(aes(color=genre)) +
  geom_smooth(method = "lm") +
  labs(title = "Casts' Facebook likes and gross revenue",
       subtitle = "The relationship between casts' Facebook popularity and the movie's gross revenue (Unadjusted)",
       x = "Casts' likes on facebook",
       y = "Gross revenue") +
  stat_regline_equation(label.x = 50000, label.y = 760000000, aes(label = ..eq.label..), size = 3) +
  stat_regline_equation(label.x = 50000, label.y = 730000000, aes(label = ..rr.label..), size = 3) +
  scale_x_continuous(labels = number) +
  scale_y_continuous(labels = dollar_format(prefix = "$", suffix = "m", scale = 1/1000000))+
  theme_bw()

#Removing outliers to improve visualisation

ggplot(movies, aes(x=cast_facebook_likes, y=gross)) +
  geom_point(aes(color=genre)) +
  geom_smooth(method = "lm") +
  xlim(0, 150000) + #Limiting the display of values on the x and y axes to account for outliers
  ylim(0,600000000) +
  labs(title = "Casts' Facebook likes and gross revenue",
       subtitle = "The relationship between casts' Facebook popularity and the movie's gross revenue (Adjusted)",
       x = "Casts' likes on facebook",
       y = "Gross revenue") +
  stat_regline_equation(label.x = 15000, label.y = 750000000, aes(label = ..eq.label..), size = 3) +
  stat_regline_equation(label.x = 15000, label.y = 730000000, aes(label = ..rr.label..), size = 3) +
  scale_y_continuous(labels = dollar_format(prefix = "$", suffix = "m", scale = 1/1000000))+
  theme_bw()

From the R² value of 0.081, we can infer that the correlation between the two variables is rather weak. This indicates that the casts’ Facebook popularity does not solely help to predict the gross revenue of a movie.

The relationship between `gross` and `budget`

We would like to find out if the budget for the movie is likely to be a good predictor of how much money the movie will make at the box office.

ggplot(movies, aes(x=budget, y=gross)) +
  geom_point(aes(color=genre)) +
  geom_smooth(method = "lm") +
  labs(title = "Movie budget and gross revenue",
       subtitle = "The relationship between the movie's budget and its' gross revenue",
       x = "Budget",
       y = "Gross revenue") +
  stat_regline_equation(label.y = 700000000, aes(label = ..eq.label..), size = 3) +
  stat_regline_equation(label.y = 670000000, aes(label = ..rr.label..), size = 3) +
  scale_x_continuous(labels = dollar_format(prefix = "$", suffix = "m", scale = 1/1000000))+
  scale_y_continuous(labels = dollar_format(prefix = "$", suffix = "m", scale = 1/1000000))+
  theme_bw()

Budget seems to be a stronger predictor of a movie’s gross revenue compared to casts’ facebook popularity, as the R² value is closer to 1.

The relationship between `gross` and `rating`

We would like to find out whether IMDB ratings are likely to be a good predictor of how much money a movie will make at the box office

ggplot(movies, aes(x=rating, y=gross)) +
  geom_point(aes(color=genre), size=1) +
  geom_smooth(method = "lm") +
  scale_x_continuous(labels = number) +
  scale_y_continuous(labels = dollar) +
  labs(title = "Movie rating and gross revenue",
       subtitle = "The relationship between the movie's rating and its' gross revenue",
       x = "Rating",
       y = "Gross revenue") +
  stat_regline_equation(label.y = 700000000, aes(label = ..eq.label..), size = 3) +
  stat_regline_equation(label.y = 670000000, aes(label = ..rr.label..), size = 3) +
  scale_y_continuous(labels = dollar_format(prefix = "$", suffix = "m", scale = 1/1000000)) +
  theme_bw()


# Faceted by genre
ggplot(movies, aes(x=rating, y=gross)) +
  geom_point(aes(color=genre), size=0.5) +
  geom_smooth(method = "lm") +
  facet_wrap(vars(genre), scale="free_y") +
  scale_x_continuous(labels = number) +
  scale_y_continuous(labels = dollar) +
  labs(title = "Movie rating and gross revenue",
       subtitle = "The relationship between the movie's rating and its' gross revenue",
       x = "Rating",
       y = "Gross revenue") +
  facet_wrap(vars(genre), scale="free_y") +
  scale_y_continuous(labels = dollar_format(prefix = "$", suffix = "m", scale = 1/1000000)) +
  theme_bw()

In general, there is a positive correlation between a movie’s rating and its gross revenue. However, similar to casts’ Facebook likes, the correlation is weak. Other than that, due to the small sample size of some genres (e.g Musical, Western and Sci-Fi), not all trendlines are representative.