Sep
Sep A curious data analyst who is trying to maximize performance while cutting down time to manage

Netflix movies and TV shows

Netflix movies and TV shows

The exploration of Netflix movies and TV shows during the pandemic when I was students in Bournemouth University

# Netflix-movies-and-TVs

The summary dashboard is showed below

Dashboard

![dashboard](https://user-images.githubusercontent.com/100246099/155524505-e6b69bac-1e98-4eda-a901-3517b921ce27.jpg)

1. Exploratory and Visualization

In the pandemic time, as a student, I spent so much time in watching movies/ TVs from Netflix. Because of that, I have an idea that what if I could analyze the data from Netflix to see how Netflix is doing. I wasnt sure that I could find the dataset, but I was lucky that this is a famous dataset in Kaggle. In that way, I could try my best to practice what I have learned with this.

This dataset contains more than 8,500 Netflix movies and TV shows, including cast members, duration, and genre. It contains titles added as recently as late September 2021.

* Calling the library *

First of all, I will call all the library that I could use in this dataset.

library

```ruby library(ggplot2) library(tidyverse) library(lubridate) library(dplyr) library(tibble) library(purrr) library(tidyr) library(forcats) ```

import file

```{r} data <- readr::read_csv('D:/Giang/studying/project/Netflix/Netflix movies and TVs/netflix_titles.csv') head(data,5) ```

Data Dictionary

1
2
3
4
5
6
7
8
9
10
11
12
 |:--------------|:----------|:--------------------------------------------------------------|
 | type          | character | Either 'TV Show' or 'Movie'                                   |
 | title         | character | The title of the movie or TV show                             |
 | director      | character | The director of the movie or TV show                          |
 | cast          | character | The actors playing in the movie or TV show                    |
 | country       | character | The country in which the movie or TV show was directed        |
 | date_added    | character | The date on which the movie or TV show was added to Netflix   |
 | release_year  | character | The year the movie or TV show was released                    |
 | rating        | character | The kid-friendly rating the movie or TV show received         |
 | duration      | character | The length of the movie or TV show                            |
 | listed_in     | character | The genre of the movie or TV show                             |
 | description   | character | The description/short summary of the movie or TV show         | 

Source of dataset

There is some quick summary of the dataset

1
2
3
4
5
6
summary(data)
data%>%
  group_by(show_id)%>%
  count()%>%
  filter(n>1)
glimpse(data)

The following code is just about the theme and size that I want. it is more like the personal reference.

theme and size

```ruby fill_theme <- theme(axis.text.x = element_text(size = 16, color = "#1B4F72"), axis.text.y = element_text(size = 16, color = "#34495E"), axis.title.x = element_text(size = 16), axis.title.y = element_text(size = 16,color = "#34495E"))+ theme(legend.key.size = unit(x = 2, units = 'line'), legend.text = element_text(size = 14, color = "#1B4F72"), legend.title = element_text(size = 14, color = "#34495E")) ``` ```ruby fig <- function(width, heigth){ options(repr.plot.width = width, repr.plot.height = heigth)} ```

Drop NA value

1
2
3
4
5
countries<-data%>%
  select(country, type, title, listed_in)
sum(is.na(countries$country))/nrow(countries)
countries<-countries%>%
  filter(!is.na(country))

2. Show types

The following part is the bar chart about the comparation between number of movies and TVs show in Netflix, the number of movies existed in Netlfix is double the number of TVs show

1
2
3
4
5
6
7
countries %>%
  count(type) %>%
  ggplot() + geom_col(aes(x = type, y = n, fill = type)) +
  labs(title = "Show Types",
       subtitle = "Netflix Data",
       caption = 'Data Source: Kaggle') +
  theme_minimal()

image

3. Which countries have produced the most movies in Netflix

This part is about the origin countries of the movies and TV shows in Netflix In the hidden part is about the preparation that extract a dataframe that include only countries name and the number of titles for each country

preparation code

```ruby ### number title of each country max(str_count(countries$country, ',')) #max = 11 ',' => maximum = 12 countries ### split the combined countries into single one ctr<-countries%>% separate(country, into = c('a','b','c','d','e','f','g','h','i','j','k','l') ,", ", convert = TRUE) ctr<-ctr[,1:12] ctr_list<-ctr%>% unlist() ctr_tibble<-tibble(country_name=ctr_list) #Which country has the most movies ctr<-ctr_tibble%>% group_by(country_name)%>% count()%>% filter(!is.na(country_name)) ```

And there is the code for chart

Top countries

```ruby fig(6,20) ctr%>% filter(n>100 && country_name != '')%>% ggplot(aes(reorder(country_name, FUN=median, n),n, fill= n>800)) + geom_bar(stat='identity', show.legend = F) + labs( y="Numbers of movies on Netflix", x= "Country name", title="The outstanding number of movies in US and India") +coord_flip() +fill_theme ```

image

4. To understand the categories from Netflix

This part is showing the categories in Netflix This first hidden code is the preparation for the chart

preparation

```ruby ctr<-ctr[-1,-2] max(str_count(countries$listed_in, ',')) List_in<-countries%>% select(listed_in)%>% separate(listed_in, into = c('a','b','c'),", ", convert = TRUE) List_in<-List_in%>%unlist() list_in<-tibble( list_in=List_in ) ```

And this is the code for chart

1
2
3
4
5
6
7
8
9
10
list_in%>%
  group_by(list_in)%>%
  count()%>%
  filter(!is.na(list_in) && n>=100)%>%
  ggplot(aes(reorder(list_in, fun=median, n),n, fill = n>1000))+ 
  geom_histogram(stat = 'identity', show.legend = F)+
  labs(
    y='Numbers of type in movies on Netflix',
    x='Types',
    title='Interational movies and the Dramas are the most movie types on Netflix') + coord_flip() + fill_theme

image

5. Shows rating

There are many types of movies and TV shows that fit different audience, so the rating would be set to understand which audience they focus on

colorset

```ruby colorset = c("#105738","#407442","#6e914c","#a1ad57", "#dac767","#ca9b43","#b77028","#a04417","#850b10") ```

1
2
3
4
5
6
7
8
9
10
11
data %>%
  count(rating) %>%
  group_by(rating) %>%
  filter(n > 100) %>%
  ggplot(aes(rating, n, fill = rating))+ scale_fill_manual(values = colorset)+
  geom_bar(stat = 'identity') +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_blank())+
  scale_y_continuous(breaks = seq(from = 0, to = 3000, by = 200))+
  labs(x = '', y = '') + fill_theme

image

6. Shows added quarterly

To figure out how many shows have been added into Netflix, the data will be visualized quarterly

1
2
3
4
5
6
7
8
9
10
11
12
13
 fig(17,20)
data$date_added <- as.Date(data$date_added, format = '%B %d, %Y') 
data %>%
  filter(date_added > '2015-01-01' & date_added < '2021-12-31') %>%
  mutate(date_added = as.Date(floor_date(date_added, unit = 'quarter'))) %>%
  count(date_added) %>%
  ggplot(aes(date_added, n))+
  geom_line(size = 1.3, alpha = 1, color = "#CD5C5C") +
  scale_x_date(breaks = '3 month', date_labels = '%b %y')+
  scale_y_continuous(breaks = seq(from = 0, to = 800, by = 100))+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 12),
        panel.grid.major.x = element_blank())+
  labs(y = 'shows added per quarter', color = "#0B5345",x = '') + fill_theme

image

7. Top TVs/ movies types in UK

Top 10 added TV shows/movies in the UK will be added in this part Why UK? Accounding to the part 3, there are 3 3 countries have the most movies and TV shows in the Netflix, however, I currently live in UK, so that this part will only show about the data in the UK

preparation

```ruby list_in<-list_in%>% group_by(list_in)%>% filter(!is.na(list_in))%>% count() list_in<-list_in[,-2] UK.movie<-countries[str_which(countries$country, 'United Kingdom'), ] max(str_count(UK.movie$listed_in, ',')) UK.movie<-UK.movie%>% separate( listed_in, into= c('type1', 'type2', 'type3'), ', ', convert = T) UK.list<-UK.movie%>% select(type1, type2, type3)%>% unlist() UK.list<-tibble( type = UK.list, ) ```

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
UK.list%>%
  group_by(type)%>%
  count()%>%
  ungroup()%>%
  filter( !is.na(type))%>%
  mutate(proportion = n / sum(n))%>%
  filter( rank(n) >= 28 )%>% #choose top 10 types
  ggplot(aes('' , n ,fill=type))+
  geom_histogram( position = 'stack', stat = 'identity', color='white', show.legend = F)+
  geom_text(aes(label = paste(type,'\n',round(proportion,2))), 
            position = position_stack(vjust = 0.5), size=2.8)+
  coord_polar('y', start = 0)+
  theme_bw()+
  labs(
    x='',
    y='',
    title='NetFlix: Top 10 movie types in UK',
    subtitle='Dramas amd Comedies are the most types of movies in UK'
  )

image

comments powered by Disqus