Cinemorgue

This project all started because I watched the same actor die back-to-back in separate films and I wanted to know what the high scores would look like for the under-reported metric of "film deaths". I couldn't trust any of the conflicting click-bait articles I read online, and figured if anyone should be trusted, it's an obscure crowd sourced wiki called Cinemorgue.

Unfortunately, wikis are gererally just big text files that lack a way to find this kind of info natively. So in order to get to the bottom of this, I had to parse a wiki dump myself.

Before I get into the semi-ironic gory details of manual entry text clean up, and partially because I wanted to doodle in Illustrator, I have a summary of the top 8 below!

Cinemorgue

Let's Get Into It

Fortunately, Cinemorgue the wiki containing the wealth of information we need has an xml export. It uses the mediawiki export format of 0.11 which is immensely helpful! There are a lot of wiki parser libraries out there, but being able to follow the exact structure will be much simpler.

import xml.etree.ElementTree as ET

NS = 'http://www.mediawiki.org/xml/export-0.11/'

def parse_wikimedia_xml(filepath):
    tree = ET.parse(filepath)
    root = tree.getroot()
    data = []
    for page in root.findall('{%s}page' % NS):
        ns = page.find('{%s}ns' % NS).text
        if ns != "0":
            continue
        title = page.find('{%s}title' % NS).text
        revision = page.find('{%s}revision' % NS)
        text = revision.find('{%s}text' % NS).text
        data.append({'title': title, 'text': text})
    df = pd.DataFrame(data)
    return df

df = parse_wikimedia_xml('input/cinemorgue_pages_current.xml')

Doing so will generate the initial output below.

SectionContent
Main PageCinemorgue Wiki with navigation gallery containing Actor Index and Actress Index sections
RedirectMain Page redirects to Cinemorgue Wiki
Marilyn Monroe• Birth/Death: 1926 - 1962 • Notable as: First Playboy Playmate (December 1953) • Film Death: In "Niagara" (1953) as Rose Loomis - Strangled with white scarf by Joseph Cotton in bell tower• Notable Connections: Foster sister of Jody Lawrance, Ex-wife of Joe DiMaggio and Arthur Miller, Mistress of JFK, Ex-girlfriend of Jorge Guinile

Page Structure

Laying some ground rules: we will only be looking for film deaths. TV Deaths include a lot of voice actors from animated shows which feels like cheating and ruins the spirt of finding out which actor put it all on the line.

The structure of each wikimedia page, while not perfect, is relatively consistent.

  • Overview, Film Deaths, Television Deaths/TV Deaths, Video Game Deaths, Music Video Deaths, Notable Connections, Other General Page Formatting Failures

Unfortunately, there is no consistent label for when the "Film Deaths" section ends, so we need to find and drop every category below it.

# Delete everything before Film Deaths.
df['text'] = df['text'].str.split("Film Deaths", n=1, expand=True)[1]

# Delete everything below title.
3df['text'] = df['text'].str.split("Television Deaths", n=1, expand=True)[0]

# Delete everything below title.
df['text'] = df['text'].str.split("Video Game Deaths", n=1, expand=True)[0]

# Delete everything below title.
df['text'] = df['text'].str.split("Music Video Deaths", n=1, expand=True)[0]

# Delete everything below title.
df['text'] = df['text'].str.split("Notable Connections", n=1, expand=True)[0]

# Delete everything below title.
df['text'] = df['text'].str.split("Category", n=1, expand=True)[0]

#Drop all recently nulled rows. Ready to begin string splitting.
df = df.dropna(subset=['text'])

Actor/Actress Names

Every movie death is (generally) annotated by a line break and an asterisk. For each instance of this, we'll separate the entry out into a new row.

# New df to store the split rows
new_rows = {'title': [], 'text': []}

# Iterate through the original df
for idx, row in df.iterrows():
    title = row['title']
    text_parts = row['text'].split('\n*')
    
    # Append the new rows to the new df
    # Skip the first element, usually contains gibberish before first line.
    for part in text_parts[1:]:
        new_rows['title'].append(title)
        new_rows['text'].append(part)

# Create the new df
new_df = pd.DataFrame(new_rows)
titletext
Joseph CottenShadow of a Doubt (1943) [Uncle Charlie]: Falls out of a train and into the path of another train during a struggle with Teresa Wright.
Joseph CottenNiagara (1953) [George Loomis]: Drowned when his boat sinks while going over Niagara Falls.
Joseph CottenThe Last Sunset (1961) [John Breckenridge]: Shot in the back by Adam Williams as he leaves the cantina, as he is flanked by Rock Hudson and Kirk Douglas. (Thanks to Brian)

Film Year

Now we need to find the year the film was released. This is to help differentiate common/repeated movie titles that have been made over the years.

We are looking for the first instance of 4 digits between parentheis.

#Creating year column.
def extract_year(text):
    match = re.search(r'\((\d{4})\)', text)
    if match:
        return match.group(1)
    else:
        return None

# Apply the function to create the "year" column
new_df['year'] = new_df['text'].apply(extract_year)
titleyeartext
Joseph Cotten1943Shadow of a Doubt (1943) [Uncle Charlie]: Falls out of a train and into the path of another train during a struggle with Teresa Wright.
Joseph Cotten1953Niagara (1953) [George Loomis]: Drowned when his boat sinks while going over Niagara Falls.
Joseph Cotten1961The Last Sunset (1961) [John Breckenridge]: Shot in the back by Adam Williams as he leaves the cantina, as he is flanked by Rock Hudson and Kirk Douglas. (Thanks to Brian)

Film Title

Now begins one of the worst parts, finding the film title. Titles come in two forms: Links and Non-Links.

  • Links are formatted in a way that have the title listed twice between brackets. (e.g. [The Shining (1980) | The Shining (1980)])

  • Non-Links bow to no god, and are strutured however someone decided to make the wiki entry. Our best hope is to just capture everything leading up to the first instance of a year between parenthesis. (e.g. ________ (1980) )

# If a string starts with a link [], grab the contained string.
# If a string is not a link, grab all textup until the first date ().
def extract_text(row):
    if row.startswith("["):
        # Remove parenthesis and their contents from inside the square brackets
        cleaned_text = re.sub(r'\([^()]*\)', '', row)
        match = re.search(r'\[(.*?)\]', cleaned_text)
        if match:
            return match.group(1)
    else:
        match = re.search(r'^([^()]*)', row)
        if match:
            return match.group(1).strip()
    return ''  # Return an empty string if no match is found

new_df['text'] = new_df['text'].apply(extract_text)

# For links, delete everything after the first instance of a |.
new_df['text'] = new_df['text'].str.split("|", n=1, expand=True)[0]

# Remove all remaining [[]].
new_df['text'] = new_df['text'].str.replace(r'\[|\]', '', regex=True)

# Remove all white space at end of string.
new_df['text'] = new_df['text'].str.strip()

# Remove all null rows that don't contain a year.
new_df = new_df.dropna(subset=['year'])

# Remove all blank rows.
new_df = new_df[new_df['year'] != '']
titletextyearraw_text
Joseph CottenShow of a Doubt1943Shadow of a Doubt (1943) [Uncle Charlie]: Falls out of a train and into the path of another train during a struggle with Teresa Wright.
Joseph CottenNiagara1953Niagara (1953) [George Loomis]: Drowned when his boat sinks while going over Niagara Falls.
Joseph CottenThe Last Sunset1961The Last Sunset (1961) [John Breckenridge]: Shot in the back by Adam Williams as he leaves the cantina, as he is flanked by Rock Hudson and Kirk Douglas. (Thanks to Brian)

We now have names, the movie title and year a death occured! But... this output is boring on its own. We've come this far, so might as well go the extra yard.

Cause of Death

The least over-kill (pun kind of intended) way to go about this is tokenizing the text, and hard coding classifiers.

To define cause of death, we're going to use the FBI's boilerplate for homicide methodology and sprinkle in a few other things I think need distinctions.

  • Firearms, Cutting Instrument, Blunt Objects, Personal Weapons, Strangulation, Drowning, Impact, Vehiclar, Supernatural, Weather, Animals, Fire, Explosion, Poison, Narcotics, Ailment

To begin, we are going to remove all stop words.

# Remove stop words to help lemmetizer.
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    words = word_tokenize(text)
    filtered = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered)

new_df['processed_text'] = new_df['raw_text'].apply(remove_stopwords)
titletextyearraw_textprocessed_text
Joseph CottenShow of a Doubt1943Shadow of a Doubt (1943) [Uncle Charlie]: Falls out of a train and into the path of another train during a struggle with Teresa Wright.Shadow Doubt Shadow Doubt Uncle Charlie Falls train path another train struggle Teresa Wright
Joseph CottenNiagara1953Niagara (1953) [George Loomis]: Drowned when his boat sinks while going over Niagara Falls.Niagara George Loomis Drowned boat sinks going Niagara Falls
Joseph CottenThe Last Sunset1961The Last Sunset (1961) [John Breckenridge]: Shot in the back by Adam Williams as he leaves the cantina, as he is flanked by Rock Hudson and Kirk Douglas. (Thanks to Brian)Last Sunset Last Sunset John Breckenridge Shot back Adam Williams leaves cantina flanked Rock Hudson Kirk Douglas Thanks Brian

Doing so has drastically reduced the time/resources needed to run lemmatization on our processed_text.

# Having done above makes creating a dictionary for classifiers less insane.
lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].lower()
    tag_dict = {
        "J": wordnet.ADJ,
        "N": wordnet.NOUN,
        "V": wordnet.VERB,
        "R": wordnet.ADV
    }
    return tag_dict.get(tag, wordnet.VERB)

def lemmatize_text(text):
    text = text.lower()
    words = word_tokenize(text)
    lemmatized = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words]
    return ' '.join(lemmatized)

new_df['processed_text'] = new_df['processed_text'].apply(lemmatize_text)
titletextyearraw_textprocessed_text
Joseph CottenShow of a Doubt1943Shadow of a Doubt (1943) [Uncle Charlie]: Falls out of a train and into the path of another train during a struggle with Teresa Wright.shadow doubt shadow doubt uncle charlie fall train path another train struggle teresa wright
Joseph CottenNiagara1953Niagara (1953) [George Loomis]: Drowned when his boat sinks while going over Niagara Falls.niagara george loomis drown boat sink go niagara fall
Joseph CottenThe Last Sunset1961The Last Sunset (1961) [John Breckenridge]: Shot in the back by Adam Williams as he leaves the cantina, as he is flanked by Rock Hudson and Kirk Douglas. (Thanks to Brian)last sunset last sunset john breckenridge shot back adam williams leaf cantina flank rock hudson kirk douglas thanks brian

With this done, we can now run a dictionary of classifiers against the string of text. We want to classify based on the first word that appears in the string as this is probably the best method of inferring cause of death, as it should be in the opener. This method of classification will definitely have some missfires every now and then, but this level of loss is acceptable considering the volume.

We start with earliest_index being equal to inf so all valid index will be smaller. Earliest_index grabs the first index of a word that exists in the dictionary. And then it update values whenever a smaller index exists, otherwise skip.

# Wireframe of classifiers to keep it brief.
keywords = {
    'Firearms': [
        'gun'
    ],
    'Cutting Instrument': [
        'knife'
    ],
    'Blunt Objects': [
        'club'
    ],
    'Personal Weapons': [
        'punch'
    ]
}

def identify_cause(text):
    text = word_tokenize(text.lower())
    first_instance = None
    earliest_index = float('inf')
    
    for cause, keys in keywords.items():
        for key in keys:
            # Handling for compound words
            if ' ' in key:
                compound_words = key.split(' ')
                for i in range(len(text) - len(compound_words) + 1):
                    if all(text[i+j] == compound_word for j, compound_word in enumerate(compound_words)):
                        index = i
                        if index < earliest_index:
                            earliest_index = index
                            first_instance = cause
            else:
                # Original single-word handling
                if key in text:
                    index = text.index(key)
                    if index < earliest_index:
                        earliest_index = index
                        first_instance = cause
                    
    if first_instance:
        return first_instance
    return 'Other'

new_df['cause_of_death'] = new_df['processed_text'].apply(identify_cause)
titletextyearcause_of_deathprocessed_text
Joseph CottenShow of a Doubt1943Impactshadow doubt shadow doubt uncle charlie fall train path another train struggle teresa wright
Joseph CottenNiagara1953Drowningniagara george loomis drown boat sink go niagara fall
Joseph CottenThe Last Sunset1961Firearmslast sunset last sunset john breckenridge shot back adam williams leaf cantina flank rock hudson kirk douglas thanks brian

We now have a dataframe with an actor/actress, film name, year, and cause of death!

If you want to review the notebook this was accomplished with, as a handful of things were left out, you can view the repository on my Github.