Big Data/BI Zone is brought to you in partnership with:

I have been programming in Python since 2006 and writing about Python almost as long on my blog. I also enjoy apologetics, reading, and photography. Mike is a DZone MVB and is not an employee of DZone and has posted 59 posts at DZone. You can read more from them at their website. View Full User Profile

Python 101: How to Grab Data from RottenTomatoes

11.11.2013
| 3788 views |
  • submit to reddit

Today we’ll be looking at how to acquire data from the popular movie site, Rotten Tomatoes. To follow along, you’ll want to sign up for an API key here. When you get your key, make a note of your usage limit, if there is one. You don’t want to do too many calls to their API or you may get your key revoked. Finally, it’s always a very good idea to read the documentation of the API you will be using. Here are a couple of links:

Once you’ve perused that or decided that you’ll save it for later, we’ll continue our journey.

Starting the Show

Rotten Tomatoes’ API provides a set of json feeds that we can extract data from. We’ll be using requests and simplejson to pull the data down and process it. Let’s write a little script that can get the currently playing movies.

#----------------------------------------------------------------------
def getInTheaterMovies():
    """
    Get a list of movies in theaters. 
    """
    key = "YOUR API KEY"
    url = "http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=%s"
    res = requests.get(url % key)
 
    data = res.content
 
    js = simplejson.loads(data)
 
    movies = js["movies"]
    for movie in movies:
        print movie["title"]
 
#----------------------------------------------------------------------
if __name__ == "__main__":
    getInTheaterMovies()

If you run this code, you’ll see a list of movies printed to stdout. When this script was run at the time of this writing, I got the following output:

Free Birds
Gravity
Ender's Game
Jackass Presents: Bad Grandpa
Last Vegas
The Counselor
Cloudy with a Chance of Meatballs 2
Captain Phillips
Carrie
Escape Plan
Enough Said
Insidious: Chapter 2
12 Years a Slave
We're The Millers
Prisoners
Baggage Claim

In the code above, we build a URL using our API key and use requests to download the feed. Then we load the data into simplejson which returns a nested Python dictionary. Next we loop over the movies dictionary and print out each movie’s title. Now we’re ready to create a function to extract additional information from Rotten Tomatoes about each of these movies.

import requests
import simplejson
import urllib
 
#----------------------------------------------------------------------
def getMovieDetails(key, title):
    """
    Get additional movie details
    """
    if " " in title:
        parts = title.split(" ")
        title = "+".join(parts)
 
    link = "http://api.rottentomatoes.com/api/public/v1.0/movies.json"
    url = "%s?apikey=%s&q=%s&page_limit=1"
    url = url % (link, key, title)
    res = requests.get(url)
    js = simplejson.loads(res.content)
 
    for movie in js["movies"]:
        print "rated: %s" % movie["mpaa_rating"]
        print "movie synopsis: " + movie["synopsis"]
        print "critics_consensus: " + movie["critics_consensus"]
 
        print "Major cast:"
        for actor in movie["abridged_cast"]:
            print "%s as %s" % (actor["name"], actor["characters"][0])
 
        ratings = movie["ratings"]
        print "runtime: %s"  % movie["runtime"]
        print "critics score: %s" % ratings["critics_score"]
        print "audience score: %s" % ratings["audience_score"]
        print "for more information: %s" % movie["links"]["alternate"]
    print "-" * 40
    print
 
#----------------------------------------------------------------------
def getInTheaterMovies():
    """
    Get a list of movies in theaters. 
    """
    key = "YOUR API CODE"
    url = "http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=%s"
    res = requests.get(url % key)
 
    data = res.content
 
    js = simplejson.loads(data)
 
    movies = js["movies"]
    for movie in movies:
        print movie["title"]
        getMovieDetails(key, movie["title"]) 
    print
 
#----------------------------------------------------------------------
if __name__ == "__main__":
    getInTheaterMovies()

This new code pulls out a lot of data about each of the movies, but the json feeds contains quite a bit more that is not shown in this example. You can see what you’re missing out on by just printing the js dictionary to stdout or you can see an example json feed on the Rotten Tomatoes docs page. If you’ve been paying close attention, you’ll notice that the Rotten Tomatoes API doesn’t cover a lot of the data on their website. For example, there is no way to pull actor information itself. For example, if we wanted to know what movies Jim Carrey was in, there is no URL endpoint to query against. You also cannot look up anyone else in the cast, such as the director or producer. The information is on the website, but is not exposed by the API. For that, we would have to turn to the Internet Movie Database (IMDB), but that will be the topic of a different article.

Let’s spend some time improving this example. One simple improvement would be to put the API key into a config file. Another would be to actually store the information we’re downloading into a database. A third improvement would be to add some code that checks if we’ve already downloaded today’s current movies because there really isn’t a good reason to download today’s releases more than once a day. Let’s add those features!

Adding a Config File

I prefer and recommend ConfigObj for dealing with config files. Let’s create a simple “config.ini” file with the following contents:

api_key = API KEY
last_downloaded =

Now let’s change our code to import ConfigObj and change the getInTheaterMovies function to use it:

import requests
import simplejson
import urllib
 
from configobj import ConfigObj
 
#----------------------------------------------------------------------
def getInTheaterMovies():
    """
    Get a list of movies in theaters. 
    """
    config = ConfigObj("config.ini")
    key = config["Settings"]["api_key"]
    url = "http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=%s"
    res = requests.get(url % key)
 
    data = res.content
 
    js = simplejson.loads(data)
 
    movies = js["movies"]
    for movie in movies:
        print movie["title"]
        getMovieDetails(key, movie["title"]) 
    print
 
#----------------------------------------------------------------------
if __name__ == "__main__":
    getInTheaterMovies()

As you can see, we import configobj and pass it our filename. You could also pass it the fully qualified path. Next we pull out the value of api_key and use it in our URL. Since we have a last_downloaded value in our config, we should go ahead and add that to our code so we can prevent downloading the data multiple times a day.

import datetime
import requests
import simplejson
import urllib
 
from configobj import ConfigObj
 
#----------------------------------------------------------------------
def getInTheaterMovies():
    """
    Get a list of movies in theaters. 
    """
    today = datetime.datetime.today().strftime("%Y%m%d")
    config = ConfigObj("config.ini")
 
    if today != config["Settings"]["last_downloaded"]:
        config["Settings"]["last_downloaded"] = today
 
        try: 
            with open("config.ini", "w") as cfg:
                config.write(cfg)
        except IOError:
            print "Error writing file!"
            return
 
        key = config["Settings"]["api_key"]
        url = "http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=%s"
        res = requests.get(url % key)
 
        data = res.content
 
        js = simplejson.loads(data)
 
        movies = js["movies"]
        for movie in movies:
            print movie["title"]
            getMovieDetails(key, movie["title"]) 
        print
 
#----------------------------------------------------------------------
if __name__ == "__main__":
    getInTheaterMovies()

Here we import Python’s datetime module and use it to get today’s date in the following format: YYYYMMDD. Next we check if the config file’s last_downloaded value equals today’s date. If it does, we do nothing. However, if they don’t match, we set last_downloaded to today’s date and then we download the movie data. Now we’re ready to learn how to save the data to a database.

Saving the Data with SQLite

Python has supported SQLite natively since version 2.5, so unless you’re using a really old version of Python, you should be able to follow along with this part of the article without any problems. Basically, we just need to add a function that can create a database and save our data into it. Here is the function:

#----------------------------------------------------------------------
def saveData(movie):
    """
    Save the data to a SQLite database
    """
    if not os.path.exists("movies.db"):
        # create the database
        conn = sqlite3.connect("movies.db")
 
        cursor = conn.cursor()
 
        cursor.execute("""CREATE TABLE movies 
        (title text, rated text, movie_synopsis text,
        critics_consensus text, runtime integer,
        critics_score integer, audience_score integer)""")
 
        cursor.execute("""
        CREATE TABLE cast
        (actor text, 
        character text)
        """)
 
        cursor.execute("""
        CREATE TABLE movie_cast
        (movie_id integer, 
        cast_id integer,
        FOREIGN KEY(movie_id) REFERENCES movie(id),
        FOREIGN KEY(cast_id) REFERENCES cast(id)
        )
        """)
    else:
        conn = sqlite3.connect("movies.db")
        cursor = conn.cursor()
 
    # insert the data
    print
    sql = "INSERT INTO movies VALUES(?, ?, ?, ?, ?, ?, ?)"
    cursor.execute(sql, (movie["title"],
                         movie["mpaa_rating"],
                         movie["synopsis"],
                         movie["critics_consensus"],
                         movie["runtime"],
                         movie["ratings"]["critics_score"],
                         movie["ratings"]["audience_score"]
                         )
                   )
    movie_id = cursor.lastrowid
 
    for actor in movie["abridged_cast"]:
        print "%s as %s" % (actor["name"], actor["characters"][0])
        sql = "INSERT INTO cast VALUES(?, ?)"
        cursor.execute(sql, (actor["name"],
                             actor["characters"][0]
                             )
                       )
        cast_id = cursor.lastrowid
 
        sql = "INSERT INTO movie_cast VALUES(?, ?)"
        cursor.execute(sql, (movie_id, cast_id) )
 
    conn.commit()
    conn.close()

This code first checks to see if the database file already exists. If it does not, then it will create the database along with 3 tables. Otherwise the saveData function will create a connection and a cursor object. Next it will insert the data using the movie dictionary that is passed to it. We’ll call this function and pass the movie dictionary from the getMovieDetails function. Finally, we will commit the data to the database and close the connection.

You’re probably wondering what the complete code looks like. Well, here it is:

import datetime
import os
import requests
import simplejson
import sqlite3
import urllib
 
from configobj import ConfigObj
 
#----------------------------------------------------------------------
def getMovieDetails(key, title):
    """
    Get additional movie details
    """
    if " " in title:
        parts = title.split(" ")
        title = "+".join(parts)
 
    link = "http://api.rottentomatoes.com/api/public/v1.0/movies.json"
    url = "%s?apikey=%s&q=%s&page_limit=1"
    url = url % (link, key, title)
    res = requests.get(url)
    js = simplejson.loads(res.content)
 
    for movie in js["movies"]:
        print "rated: %s" % movie["mpaa_rating"]
        print "movie synopsis: " + movie["synopsis"]
        print "critics_consensus: " + movie["critics_consensus"]
 
        print "Major cast:"
        for actor in movie["abridged_cast"]:
            print "%s as %s" % (actor["name"], actor["characters"][0])
 
        ratings = movie["ratings"]
        print "runtime: %s"  % movie["runtime"]
        print "critics score: %s" % ratings["critics_score"]
        print "audience score: %s" % ratings["audience_score"]
        print "for more information: %s" % movie["links"]["alternate"]
        saveData(movie)
    print "-" * 40
    print
 
#----------------------------------------------------------------------
def getInTheaterMovies():
    """
    Get a list of movies in theaters. 
    """
    today = datetime.datetime.today().strftime("%Y%m%d")
    config = ConfigObj("config.ini")
 
    if today != config["Settings"]["last_downloaded"]:
        config["Settings"]["last_downloaded"] = today
 
        try: 
            with open("config.ini", "w") as cfg:
                config.write(cfg)
        except IOError:
            print "Error writing file!"
            return
 
        key = config["Settings"]["api_key"]
        url = "http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=%s"
        res = requests.get(url % key)
 
        data = res.content
 
        js = simplejson.loads(data)
 
        movies = js["movies"]
        for movie in movies:
            print movie["title"]
            getMovieDetails(key, movie["title"]) 
        print
 
#----------------------------------------------------------------------
def saveData(movie):
    """
    Save the data to a SQLite database
    """
    if not os.path.exists("movies.db"):
        # create the database
        conn = sqlite3.connect("movies.db")
 
        cursor = conn.cursor()
 
        cursor.execute("""CREATE TABLE movies 
        (title text, rated text, movie_synopsis text,
        critics_consensus text, runtime integer,
        critics_score integer, audience_score integer)""")
 
        cursor.execute("""
        CREATE TABLE cast
        (actor text, 
        character text)
        """)
 
        cursor.execute("""
        CREATE TABLE movie_cast
        (movie_id integer, 
        cast_id integer,
        FOREIGN KEY(movie_id) REFERENCES movie(id),
        FOREIGN KEY(cast_id) REFERENCES cast(id)
        )
        """)
    else:
        conn = sqlite3.connect("movies.db")
        cursor = conn.cursor()
 
    # insert the data
    print
    sql = "INSERT INTO movies VALUES(?, ?, ?, ?, ?, ?, ?)"
    cursor.execute(sql, (movie["title"],
                         movie["mpaa_rating"],
                         movie["synopsis"],
                         movie["critics_consensus"],
                         movie["runtime"],
                         movie["ratings"]["critics_score"],
                         movie["ratings"]["audience_score"]
                         )
                   )
    movie_id = cursor.lastrowid
 
    for actor in movie["abridged_cast"]:
        print "%s as %s" % (actor["name"], actor["characters"][0])
        sql = "INSERT INTO cast VALUES(?, ?)"
        cursor.execute(sql, (actor["name"],
                             actor["characters"][0]
                             )
                       )
        cast_id = cursor.lastrowid
 
        sql = "INSERT INTO movie_cast VALUES(?, ?)"
        cursor.execute(sql, (movie_id, cast_id) )
 
    conn.commit()
    conn.close()
 
#----------------------------------------------------------------------
if __name__ == "__main__":
    getInTheaterMovies()

If you use Firefox, there’s a fun plugin called SQLite Manager that you can use to visualize the database that we’ve created. Here is a screenshot of what was produced at the time of writing:

rotten_tomatoes_db

Wrapping Up

There are still lots of things that should be added. For example, we need some code in the getInTheaterMovies function that will load the details from the database if we’ve already got the current data. We also need to add some logic to the database to prevent us from adding the same actor or movie multiple times. It would be nice if we had some kind of GUI or web interface as well. These are all things you can add as a fun little exercise.

By the way, this article was inspired by the Real Python for the Web book by Michael Herman. It has lots of neat ideas and samples in it. You can check it out here.


Published at DZone with permission of Mike Driscoll, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags: