Cinemorgue
This project all started because I watched the same actor die back-to-back in separate films and I wanted to know what the high scores would look like for the under-reported metric of "film deaths". I couldn't trust any of the conflicting click-bait articles I read online, and figured if anyone should be trusted, it's an obscure crowd sourced wiki called Cinemorgue.
Unfortunately, wikis are gererally just big text files that lack a way to find this kind of info natively. So in order to get to the bottom of this, I had to parse a wiki dump myself.
Before I get into the semi-ironic gory details of manual entry text clean up, and partially because I wanted to doodle in Illustrator, I have a summary of the top 8 below!
Let's Get Into It
Fortunately, Cinemorgue the wiki containing the wealth of information we need has an xml export. It uses the mediawiki export format of 0.11 which is immensely helpful! There are a lot of wiki parser libraries out there, but being able to follow the exact structure will be much simpler.
import xml.etree.ElementTree as ET
NS = 'http://www.mediawiki.org/xml/export-0.11/'
def parse_wikimedia_xml(filepath):
tree = ET.parse(filepath)
root = tree.getroot()
data = []
for page in root.findall('{%s}page' % NS):
ns = page.find('{%s}ns' % NS).text
if ns != "0":
continue
title = page.find('{%s}title' % NS).text
revision = page.find('{%s}revision' % NS)
text = revision.find('{%s}text' % NS).text
data.append({'title': title, 'text': text})
df = pd.DataFrame(data)
return df
df = parse_wikimedia_xml('input/cinemorgue_pages_current.xml')
Doing so will generate the initial output below.
Section | Content |
---|---|
Main Page | Cinemorgue Wiki with navigation gallery containing Actor Index and Actress Index sections |
Redirect | Main Page redirects to Cinemorgue Wiki |
Marilyn Monroe | • Birth/Death: 1926 - 1962 • Notable as: First Playboy Playmate (December 1953) • Film Death: In "Niagara" (1953) as Rose Loomis - Strangled with white scarf by Joseph Cotton in bell tower• Notable Connections: Foster sister of Jody Lawrance, Ex-wife of Joe DiMaggio and Arthur Miller, Mistress of JFK, Ex-girlfriend of Jorge Guinile |
Page Structure
Laying some ground rules: we will only be looking for film deaths. TV Deaths include a lot of voice actors from animated shows which feels like cheating and ruins the spirt of finding out which actor put it all on the line.
The structure of each wikimedia page, while not perfect, is relatively consistent.
- Overview, Film Deaths, Television Deaths/TV Deaths, Video Game Deaths, Music Video Deaths, Notable Connections, Other General Page Formatting Failures
Unfortunately, there is no consistent label for when the "Film Deaths" section ends, so we need to find and drop every category below it.
# Delete everything before Film Deaths.
df['text'] = df['text'].str.split("Film Deaths", n=1, expand=True)[1]
# Delete everything below title.
3df['text'] = df['text'].str.split("Television Deaths", n=1, expand=True)[0]
# Delete everything below title.
df['text'] = df['text'].str.split("Video Game Deaths", n=1, expand=True)[0]
# Delete everything below title.
df['text'] = df['text'].str.split("Music Video Deaths", n=1, expand=True)[0]
# Delete everything below title.
df['text'] = df['text'].str.split("Notable Connections", n=1, expand=True)[0]
# Delete everything below title.
df['text'] = df['text'].str.split("Category", n=1, expand=True)[0]
#Drop all recently nulled rows. Ready to begin string splitting.
df = df.dropna(subset=['text'])
Actor/Actress Names
Every movie death is (generally) annotated by a line break and an asterisk. For each instance of this, we'll separate the entry out into a new row.
# New df to store the split rows
new_rows = {'title': [], 'text': []}
# Iterate through the original df
for idx, row in df.iterrows():
title = row['title']
text_parts = row['text'].split('\n*')
# Append the new rows to the new df
# Skip the first element, usually contains gibberish before first line.
for part in text_parts[1:]:
new_rows['title'].append(title)
new_rows['text'].append(part)
# Create the new df
new_df = pd.DataFrame(new_rows)
title | text |
---|---|
Joseph Cotten | Shadow of a Doubt (1943) [Uncle Charlie]: Falls out of a train and into the path of another train during a struggle with Teresa Wright. |
Joseph Cotten | Niagara (1953) [George Loomis]: Drowned when his boat sinks while going over Niagara Falls. |
Joseph Cotten | The Last Sunset (1961) [John Breckenridge]: Shot in the back by Adam Williams as he leaves the cantina, as he is flanked by Rock Hudson and Kirk Douglas. (Thanks to Brian) |
Film Year
Now we need to find the year the film was released. This is to help differentiate common/repeated movie titles that have been made over the years.
We are looking for the first instance of 4 digits between parentheis.
#Creating year column.
def extract_year(text):
match = re.search(r'\((\d{4})\)', text)
if match:
return match.group(1)
else:
return None
# Apply the function to create the "year" column
new_df['year'] = new_df['text'].apply(extract_year)
title | year | text |
---|---|---|
Joseph Cotten | 1943 | Shadow of a Doubt (1943) [Uncle Charlie]: Falls out of a train and into the path of another train during a struggle with Teresa Wright. |
Joseph Cotten | 1953 | Niagara (1953) [George Loomis]: Drowned when his boat sinks while going over Niagara Falls. |
Joseph Cotten | 1961 | The Last Sunset (1961) [John Breckenridge]: Shot in the back by Adam Williams as he leaves the cantina, as he is flanked by Rock Hudson and Kirk Douglas. (Thanks to Brian) |
Film Title
Now begins one of the worst parts, finding the film title. Titles come in two forms: Links and Non-Links.
-
Links are formatted in a way that have the title listed twice between brackets. (e.g. [The Shining (1980) | The Shining (1980)])
-
Non-Links bow to no god, and are strutured however someone decided to make the wiki entry. Our best hope is to just capture everything leading up to the first instance of a year between parenthesis. (e.g. ________ (1980) )
# If a string starts with a link [], grab the contained string.
# If a string is not a link, grab all textup until the first date ().
def extract_text(row):
if row.startswith("["):
# Remove parenthesis and their contents from inside the square brackets
cleaned_text = re.sub(r'\([^()]*\)', '', row)
match = re.search(r'\[(.*?)\]', cleaned_text)
if match:
return match.group(1)
else:
match = re.search(r'^([^()]*)', row)
if match:
return match.group(1).strip()
return '' # Return an empty string if no match is found
new_df['text'] = new_df['text'].apply(extract_text)
# For links, delete everything after the first instance of a |.
new_df['text'] = new_df['text'].str.split("|", n=1, expand=True)[0]
# Remove all remaining [[]].
new_df['text'] = new_df['text'].str.replace(r'\[|\]', '', regex=True)
# Remove all white space at end of string.
new_df['text'] = new_df['text'].str.strip()
# Remove all null rows that don't contain a year.
new_df = new_df.dropna(subset=['year'])
# Remove all blank rows.
new_df = new_df[new_df['year'] != '']
title | text | year | raw_text |
---|---|---|---|
Joseph Cotten | Show of a Doubt | 1943 | Shadow of a Doubt (1943) [Uncle Charlie]: Falls out of a train and into the path of another train during a struggle with Teresa Wright. |
Joseph Cotten | Niagara | 1953 | Niagara (1953) [George Loomis]: Drowned when his boat sinks while going over Niagara Falls. |
Joseph Cotten | The Last Sunset | 1961 | The Last Sunset (1961) [John Breckenridge]: Shot in the back by Adam Williams as he leaves the cantina, as he is flanked by Rock Hudson and Kirk Douglas. (Thanks to Brian) |
We now have names, the movie title and year a death occured! But... this output is boring on its own. We've come this far, so might as well go the extra yard.
Cause of Death
The least over-kill (pun kind of intended) way to go about this is tokenizing the text, and hard coding classifiers.
To define cause of death, we're going to use the FBI's boilerplate for homicide methodology and sprinkle in a few other things I think need distinctions.
- Firearms, Cutting Instrument, Blunt Objects, Personal Weapons, Strangulation, Drowning, Impact, Vehiclar, Supernatural, Weather, Animals, Fire, Explosion, Poison, Narcotics, Ailment
To begin, we are going to remove all stop words.
# Remove stop words to help lemmetizer.
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
words = word_tokenize(text)
filtered = [word for word in words if word.lower() not in stop_words]
return ' '.join(filtered)
new_df['processed_text'] = new_df['raw_text'].apply(remove_stopwords)
title | text | year | raw_text | processed_text |
---|---|---|---|---|
Joseph Cotten | Show of a Doubt | 1943 | Shadow of a Doubt (1943) [Uncle Charlie]: Falls out of a train and into the path of another train during a struggle with Teresa Wright. | Shadow Doubt Shadow Doubt Uncle Charlie Falls train path another train struggle Teresa Wright |
Joseph Cotten | Niagara | 1953 | Niagara (1953) [George Loomis]: Drowned when his boat sinks while going over Niagara Falls. | Niagara George Loomis Drowned boat sinks going Niagara Falls |
Joseph Cotten | The Last Sunset | 1961 | The Last Sunset (1961) [John Breckenridge]: Shot in the back by Adam Williams as he leaves the cantina, as he is flanked by Rock Hudson and Kirk Douglas. (Thanks to Brian) | Last Sunset Last Sunset John Breckenridge Shot back Adam Williams leaves cantina flanked Rock Hudson Kirk Douglas Thanks Brian |
Doing so has drastically reduced the time/resources needed to run lemmatization on our processed_text.
# Having done above makes creating a dictionary for classifiers less insane.
lemmatizer = WordNetLemmatizer()
def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].lower()
tag_dict = {
"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV
}
return tag_dict.get(tag, wordnet.VERB)
def lemmatize_text(text):
text = text.lower()
words = word_tokenize(text)
lemmatized = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words]
return ' '.join(lemmatized)
new_df['processed_text'] = new_df['processed_text'].apply(lemmatize_text)
title | text | year | raw_text | processed_text |
---|---|---|---|---|
Joseph Cotten | Show of a Doubt | 1943 | Shadow of a Doubt (1943) [Uncle Charlie]: Falls out of a train and into the path of another train during a struggle with Teresa Wright. | shadow doubt shadow doubt uncle charlie fall train path another train struggle teresa wright |
Joseph Cotten | Niagara | 1953 | Niagara (1953) [George Loomis]: Drowned when his boat sinks while going over Niagara Falls. | niagara george loomis drown boat sink go niagara fall |
Joseph Cotten | The Last Sunset | 1961 | The Last Sunset (1961) [John Breckenridge]: Shot in the back by Adam Williams as he leaves the cantina, as he is flanked by Rock Hudson and Kirk Douglas. (Thanks to Brian) | last sunset last sunset john breckenridge shot back adam williams leaf cantina flank rock hudson kirk douglas thanks brian |
With this done, we can now run a dictionary of classifiers against the string of text. We want to classify based on the first word that appears in the string as this is probably the best method of inferring cause of death, as it should be in the opener. This method of classification will definitely have some missfires every now and then, but this level of loss is acceptable considering the volume.
We start with earliest_index being equal to inf so all valid index will be smaller. Earliest_index grabs the first index of a word that exists in the dictionary. And then it update values whenever a smaller index exists, otherwise skip.
# Wireframe of classifiers to keep it brief.
keywords = {
'Firearms': [
'gun'
],
'Cutting Instrument': [
'knife'
],
'Blunt Objects': [
'club'
],
'Personal Weapons': [
'punch'
]
}
def identify_cause(text):
text = word_tokenize(text.lower())
first_instance = None
earliest_index = float('inf')
for cause, keys in keywords.items():
for key in keys:
# Handling for compound words
if ' ' in key:
compound_words = key.split(' ')
for i in range(len(text) - len(compound_words) + 1):
if all(text[i+j] == compound_word for j, compound_word in enumerate(compound_words)):
index = i
if index < earliest_index:
earliest_index = index
first_instance = cause
else:
# Original single-word handling
if key in text:
index = text.index(key)
if index < earliest_index:
earliest_index = index
first_instance = cause
if first_instance:
return first_instance
return 'Other'
new_df['cause_of_death'] = new_df['processed_text'].apply(identify_cause)
title | text | year | cause_of_death | processed_text |
---|---|---|---|---|
Joseph Cotten | Show of a Doubt | 1943 | Impact | shadow doubt shadow doubt uncle charlie fall train path another train struggle teresa wright |
Joseph Cotten | Niagara | 1953 | Drowning | niagara george loomis drown boat sink go niagara fall |
Joseph Cotten | The Last Sunset | 1961 | Firearms | last sunset last sunset john breckenridge shot back adam williams leaf cantina flank rock hudson kirk douglas thanks brian |
We now have a dataframe with an actor/actress, film name, year, and cause of death!
If you want to review the notebook this was accomplished with, as a handful of things were left out, you can view the repository on my Github.