Accessing IMDB's top movies, then storing their names in an array. Also importing all needed packages.
from bs4 import BeautifulSoup as bS
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
URL = "https://www.imdb.com/chart/top"
webpage = requests.get(URL)
soup = bS(webpage.content, 'html.parser')
table = soup.find_all('table')[0]
body = table.find('tbody')
movies = []
for row in body.find_all('tr'):
movies.append(row.find('td', class_='titleColumn').find('a').contents[0])
Creating a pandas data frame so that each letter corresponds to a count. Then counting the first letter of each movie and incrementing that letter in the data frame.
Note: Compiling all movies starting with numbers into "#"
movies.sort()
pdMovies = pd.DataFrame(np.zeros(27), index=list('#ABCDEFGHIJKLMNOPQRSTUVWXYZ'))
for i in range(len(movies)):
if (movies[i][0].isalpha()):
pdMovies.at[movies[i][0].upper(), 0] += 1
else:
pdMovies.at['#', 0] += 1
Finally, creating a bar graph from the gathered data.
x = list('#ABCDEFGHIJKLMNOPQRSTUVWXYZ')
y = pdMovies[0].to_numpy()
fig, ax = plt.subplots()
ax.bar(x, y)
ax.set_title('Top %d Movies (from IMDB), Organized by First Letter' % len(movies))
font = {
'family': 'serif',
'color' : 'C9',
'weight': 'normal',
'size': 16
}
ax.text(17, pdMovies.at['T', 0] - 3, int(pdMovies.at['T', 0]), fontdict=font)
print()
Looking at the bar graph, it is obvious that from raw movie titles, the preferred first letter is T. However, it is important to realize that T is the first letter in the very important article, "the." Thus, it may be more accurate to record the first letters of titles, without counting "the."
Resetting the pdMovies data frame, then incrementing different letters based on a new method. First, checks if the movie title has at least 5 characters ("THE X..."), where X is any letter. Also checks if the first 4 characters are "The ". If so, then checks if the 5th character is actually a letter, then increments the corresponding letter in the data frame. Otherwise, does the same process as last experiment.
Finally, creating the bar graph.
pdMovies = pd.DataFrame(np.zeros(27), index=list('#ABCDEFGHIJKLMNOPQRSTUVWXYZ'))
for i in range(len(movies)):
if (len(movies[i]) >= 5 and movies[i][0:4] == 'The '):
if (movies[i][4].isalpha()):
pdMovies.at[movies[i][4].upper(), 0] += 1
else:
pdMovies.at['#', 0] += 1
elif (movies[i][0].isalpha()):
pdMovies.at[movies[i][0].upper(), 0] += 1
else:
pdMovies.at['#', 0] += 1
x = list('#ABCDEFGHIJKLMNOPQRSTUVWXYZ')
y = pdMovies[0].to_numpy()
fig, ax = plt.subplots()
ax.bar(x, y)
ax.set_title('Top %d Movies (from IMDB), Organized by First Letter, Ignoring "The"' % len(movies))
font = {
'family': 'serif',
'color' : 'C9',
'weight': 'normal',
'size': 16
}
ax.text(16, pdMovies.at['S', 0] - 1, int(pdMovies.at['S', 0]), fontdict=font)
print()