A Statistical Look at the Features of Top Movies (According to IMDB)¶

Accessing IMDB's top movies, then storing their names in an array. Also importing all needed packages.

In [ ]:
from bs4 import BeautifulSoup as bS
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

URL = "https://www.imdb.com/chart/top"
webpage = requests.get(URL)
soup = bS(webpage.content, 'html.parser')

table = soup.find_all('table')[0]
body = table.find('tbody')

movies = []
for row in body.find_all('tr'): 
    movies.append(row.find('td', class_='titleColumn').find('a').contents[0])

By First Letter¶

Creating a pandas data frame so that each letter corresponds to a count. Then counting the first letter of each movie and incrementing that letter in the data frame.

Note: Compiling all movies starting with numbers into "#"

In [ ]:
movies.sort()

pdMovies = pd.DataFrame(np.zeros(27), index=list('#ABCDEFGHIJKLMNOPQRSTUVWXYZ'))

for i in range(len(movies)):
    if (movies[i][0].isalpha()): 
        pdMovies.at[movies[i][0].upper(), 0] += 1

    else:
        pdMovies.at['#', 0] += 1

Finally, creating a bar graph from the gathered data.

In [ ]:
x = list('#ABCDEFGHIJKLMNOPQRSTUVWXYZ')
y = pdMovies[0].to_numpy()

fig, ax = plt.subplots()
ax.bar(x, y)
ax.set_title('Top %d Movies (from IMDB), Organized by First Letter' % len(movies))
font = {
    'family': 'serif',
    'color' : 'C9',
    'weight': 'normal',
    'size': 16
}
ax.text(17, pdMovies.at['T', 0] - 3, int(pdMovies.at['T', 0]), fontdict=font)
print()

Looking at the bar graph, it is obvious that from raw movie titles, the preferred first letter is T. However, it is important to realize that T is the first letter in the very important article, "the." Thus, it may be more accurate to record the first letters of titles, without counting "the."

By First Letter, Ignoring "The"¶

Resetting the pdMovies data frame, then incrementing different letters based on a new method. First, checks if the movie title has at least 5 characters ("THE X..."), where X is any letter. Also checks if the first 4 characters are "The ". If so, then checks if the 5th character is actually a letter, then increments the corresponding letter in the data frame. Otherwise, does the same process as last experiment.

Finally, creating the bar graph.

In [ ]:
pdMovies = pd.DataFrame(np.zeros(27), index=list('#ABCDEFGHIJKLMNOPQRSTUVWXYZ'))

for i in range(len(movies)):
    if (len(movies[i]) >= 5 and movies[i][0:4] == 'The '):
        if (movies[i][4].isalpha()):
            pdMovies.at[movies[i][4].upper(), 0] += 1
        
        else:
            pdMovies.at['#', 0] += 1

    elif (movies[i][0].isalpha()): 
        pdMovies.at[movies[i][0].upper(), 0] += 1

    else:
        pdMovies.at['#', 0] += 1

x = list('#ABCDEFGHIJKLMNOPQRSTUVWXYZ')
y = pdMovies[0].to_numpy()

fig, ax = plt.subplots()
ax.bar(x, y)
ax.set_title('Top %d Movies (from IMDB), Organized by First Letter, Ignoring "The"' % len(movies))
font = {
    'family': 'serif',
    'color' : 'C9',
    'weight': 'normal',
    'size': 16
}
ax.text(16, pdMovies.at['S', 0] - 1, int(pdMovies.at['S', 0]), fontdict=font)
print()