Day 3 - Ingesting NBA Player Data into BigQuery using Python, Cloud Storage, and the Sportsdata.io API

Day 3 - Ingesting NBA Player Data into BigQuery using Python, Cloud Storage, and the Sportsdata.io API

In this article, we will learn how we can fetch data from the Sportsdata.io API, store that data in Google Cloud Storage & load that data into a BigQuery table. The code for this article can be found in this GitHub repo under the Day_3 folder.

Prerequisites:

  1. Python 3.x

  2. Google Cloud Project with a Service Account having Storage Admin and BigQuery User privileges

  3. Sportsdata.io account and its NBA API Key

Setting up:

  1. Create a Google Cloud project and enable the Google Cloud Storage API if not already enabled.

  2. Create a Service Account with Storage Admin and BigQuery User privileges and download the JSON key file

  3. Obtain an NBA API key from Sportsdata.io sign up form.

  4. Create an .env file with the following environment variables and fill in their respective values:

     GCP_STORAGE_BUCKET_NAME = <your_bucket_name>
    
     GCP_REGION = <the_gcp_region_for_this_task>
    
     SPORTSDATA_API_KEY = <api_key_here>
    
     GCP_PROJECT_ID = <your_gcp_project_id>
    
  5. Create and activate a virtual environment that you will use to install dependencies and run the project:

     $ python -m venv venv
    
     $ venv\Scripts\activate # For Windows
    
     $ source venv/bin/activate # For MacOS/UNIX systems
    
  6. Install the dependencies that will be used in this project

     $ pip install google-cloud-storage google-cloud-bigquery python-dotenv requests
    

Writing the Script

Create a Python file called nba_datalake.py and paste the following code:

import csv
import requests
import os
from google.cloud import storage, bigquery
from dotenv import load_dotenv

# Load BigQuery and Cloud Storage Clients
storage_client = storage.Client()
bigquery_client = bigquery.Client()

#load enviroment variables
load_dotenv()
file_name = '<your-gcp-service-account-credentials.json' # Replace this with the actual filename
the_abs_path = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
parent_path = os.path.abspath(os.path.join(the_abs_path, os.pardir))
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = os.path.join(parent_path, file_name) # load the service account json file to the environment

bucket_name = os.getenv('GCP_STORAGE_BUCKET_NAME')
region = os.getenv('GCP_REGION')
project_id = os.getenv('GCP_PROJECT_ID')
api_key = os.getenv('SPORTSDATA_API_KEY')
nba_endpoint = "https://api.sportsdata.io/v3/nba/scores/json/Players"

# The name for your BigQuery table
bigquery_table_name = "nba_players"

def create_gcs_bucket():
    # Create a GCS bucket
    try:
        storage_client.get_bucket(bucket_name)
        print(f"Bucket {bucket_name} exists")
    except:
        try:
            bucket = storage_client.create_bucket(
                bucket_name,
                location=region
                )
            print(f"Successfully created bucket {bucket.name}")
        except Exception as e:
            print(f"Error creating bucket: {e}")

def fetch_nba_data():
    """Fetch NBA player data from sportsdata.io."""
    try:
        headers = {"Ocp-Apim-Subscription-Key": api_key}
        response = requests.get(nba_endpoint, headers=headers)
        response.raise_for_status()  
        print("Fetched NBA data successfully.")

        # We will be storing the data in a CSV format
        fields = ['PlayerID', 'FirstName', 'LastName', 'Team', 'Position']
        out_file = "./players.csv"
        rows = []
        for player in response.json():
            the_dict = {"PlayerID": player['PlayerID'], 
                "FirstName": player["FirstName"],"LastName": player["LastName"], 
                "Team": player["Team"],"Position": player["Position"]
            }
            rows.append(the_dict.values())

        # Write the data to a CSV file
        with open(out_file, 'w', newline='', encoding='utf-8') as csvfile:
            csvwriter = csv.writer(csvfile)
            csvwriter.writerow(fields)
            csvwriter.writerows(rows) 
    except Exception as e:
        print(f"Error fetching NBA data: {e}")
        return []

def upload_data_to_gcs():
    """Upload NBA data to the GCS bucket."""
    try:
        # Upload the CSV data to GCS
        file_name = "raw-data/nba_player_data.csv"
        bucket = storage_client.bucket(bucket_name)
        blob = bucket.blob(file_name)
        blob.upload_from_filename(os.path.join(os.getcwd(), "players.csv"))
        print(f"Uploaded data to GCS: {file_name}")
    except Exception as e:
        print(f"Error uploading data to GCS: {e}")

def create_bigquery_table():
    try:
        # Create a BigQuery dataset
        dataset_name = bigquery.Dataset(f'{project_id}.my_nba_dataset')
        dataset = bigquery_client.create_dataset(dataset_name)

        # Define the BigQuery Schema
        schema = {
            bigquery.SchemaField("PlayerID", "INTEGER"),
            bigquery.SchemaField("FirstName", "STRING"),
            bigquery.SchemaField("LastName", "STRING"),
            bigquery.SchemaField("Team", "STRING"),
            bigquery.SchemaField("Position", "STRING"),
        }

        # Create the BigQuery table in the dataset
        table_ref = dataset.table(bigquery_table_name)
        table = bigquery.Table(table_ref, schema=schema)
        bigquery_client.create_table(table, exists_ok=True)

        print(f"BigQuery dataset {dataset_name} created successfully.")
    except Exception as e:
        print(f"Error creating BigQuery dataset: {e}")

def load_data_to_bigquery():
    """Load data from GCS to BigQuery."""
    bigquery_dataset_name = f'{project_id}.my_nba_dataset'
    try:
        uri = f"gs://{bucket_name}/raw-data/nba_player_data.csv"
        table_ref = f"{bigquery_dataset_name}.{bigquery_table_name}"

        job_config = bigquery.LoadJobConfig(
            source_format=bigquery.SourceFormat.CSV,
            autodetect=True,
        )
        load_job = bigquery_client.load_table_from_uri(
            uri, table_ref, job_config=job_config
        )
        load_job.result()  # Wait for the job to complete
        print(f"Data loaded to BigQuery table '{bigquery_table_name}'.")
    except Exception as e:
        print(f"Error loading data to BigQuery: {e}")

def main():
    print("Setting up data lake for NBA sports analytics...")
    create_gcs_bucket()
    nba_data = fetch_nba_data()
    upload_data_to_gcs()
    create_bigquery_table()
    load_data_to_bigquery()
    print("Data lake setup complete.")

if __name__ == "__main__":
    main()

Script explanation:

We start by importing the necessary modules like: csv, requests, os, bigquery, storage, dotenv that we’ll be using.

We then load the environment variables using load_dotenv(). It’s good practice to store sensitive credentials like API keys in an .env file rather than hardcoding them in the working file.

We then create instances of storage_client and bigquery_client to interact with Cloud Storage and BigQuery services

Creating a Cloud Storage bucket:

The create_gcs_bucket method checks if a bucket with the same name exists in our Cloud environment. If it exists, we get a print message informing us that it does. If it doesn’t exist, it tries to create the bucket in the specified region using the storage_client.create_bucket method.

Fetching the NBA Player data:

The fetch_nba_data method fetches the player data from the given endpoint using the requests library. We then define the column names (fields) that will be created in the CSV file. A for loop is used to iterate through the JSON response and writes the data into a CSV file called players.csv that will be located in the current working directory.

Uploading the data to Cloud Storage:

The upload_data_to_gcs function uploads the players.csv to the Cloud Storage bucket created earler. It defines the path to the CSV file and the destination file name in the bucket (raw-data/nba_player_data.csv)

It then uploads the CSV file to the bucket using the blob.upload_from_filename method of the blob object

Loading the data into BigQuery:

The load_data_to_bigquery method loads the data from the Cloud Storage bucket to the BigQuery table. It defines the BigQuery table referenced by specifying the dataset and table name.

It creates a LoadJobConfig object specifying the source format as CSV and enabling autodetection of schema.

It initiates a load job using bigquery_client.load_table_from_uri method, specifying the source URI, destination table reference, and load job configuration.

It then waits for the load job to complete using the load_job.result() method

Running the script:

The main function is the entry point of the script. This function calls the create_gcs_bucket, fetch_nba_data & upload_data_to_gcs functions to execute the above jobs.

$ python nba_datalake.py

Screenshots:

Cloud Storage CSV file object:

BigQuery dataset & table:

You can proceed to run queries of your choosing directly on BigQuery on your dataset.

Conclusion

In this project, we successfully demonstrated the creation of a data pipeline for ingesting NBA player data. Utilizing Python, we extracted data from the Sportsdata.io API, transformed it into a suitable format, and loaded it into a Cloud Storage bucket. Subsequently, we leveraged BigQuery to efficiently store and analyze the data.

The architecture implemented in this project serves as a scalable and adaptable model for ingesting and analyzing data from various sports and other data sources.