Day 3 - Ingesting NBA Player Data into BigQuery using Python, Cloud Storage, and the Sportsdata.io API
In this article, we will learn how we can fetch data from the Sportsdata.io API, store that data in Google Cloud Storage & load that data into a BigQuery table. The code for this article can be found in this GitHub repo under the Day_3 folder.
Prerequisites:
Python 3.x
Google Cloud Project with a Service Account having
Storage Admin
andBigQuery User
privilegesSportsdata.io account and its NBA API Key
Setting up:
Create a Google Cloud project and enable the Google Cloud Storage API if not already enabled.
Create a Service Account with
Storage Admin
andBigQuery User
privileges and download the JSON key fileObtain an NBA API key from Sportsdata.io sign up form.
Create an
.env
file with the following environment variables and fill in their respective values:GCP_STORAGE_BUCKET_NAME = <your_bucket_name> GCP_REGION = <the_gcp_region_for_this_task> SPORTSDATA_API_KEY = <api_key_here> GCP_PROJECT_ID = <your_gcp_project_id>
Create and activate a virtual environment that you will use to install dependencies and run the project:
$ python -m venv venv $ venv\Scripts\activate # For Windows $ source venv/bin/activate # For MacOS/UNIX systems
Install the dependencies that will be used in this project
$ pip install google-cloud-storage google-cloud-bigquery python-dotenv requests
Writing the Script
Create a Python file called nba_datalake.py
and paste the following code:
import csv
import requests
import os
from google.cloud import storage, bigquery
from dotenv import load_dotenv
# Load BigQuery and Cloud Storage Clients
storage_client = storage.Client()
bigquery_client = bigquery.Client()
#load enviroment variables
load_dotenv()
file_name = '<your-gcp-service-account-credentials.json' # Replace this with the actual filename
the_abs_path = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
parent_path = os.path.abspath(os.path.join(the_abs_path, os.pardir))
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = os.path.join(parent_path, file_name) # load the service account json file to the environment
bucket_name = os.getenv('GCP_STORAGE_BUCKET_NAME')
region = os.getenv('GCP_REGION')
project_id = os.getenv('GCP_PROJECT_ID')
api_key = os.getenv('SPORTSDATA_API_KEY')
nba_endpoint = "https://api.sportsdata.io/v3/nba/scores/json/Players"
# The name for your BigQuery table
bigquery_table_name = "nba_players"
def create_gcs_bucket():
# Create a GCS bucket
try:
storage_client.get_bucket(bucket_name)
print(f"Bucket {bucket_name} exists")
except:
try:
bucket = storage_client.create_bucket(
bucket_name,
location=region
)
print(f"Successfully created bucket {bucket.name}")
except Exception as e:
print(f"Error creating bucket: {e}")
def fetch_nba_data():
"""Fetch NBA player data from sportsdata.io."""
try:
headers = {"Ocp-Apim-Subscription-Key": api_key}
response = requests.get(nba_endpoint, headers=headers)
response.raise_for_status()
print("Fetched NBA data successfully.")
# We will be storing the data in a CSV format
fields = ['PlayerID', 'FirstName', 'LastName', 'Team', 'Position']
out_file = "./players.csv"
rows = []
for player in response.json():
the_dict = {"PlayerID": player['PlayerID'],
"FirstName": player["FirstName"],"LastName": player["LastName"],
"Team": player["Team"],"Position": player["Position"]
}
rows.append(the_dict.values())
# Write the data to a CSV file
with open(out_file, 'w', newline='', encoding='utf-8') as csvfile:
csvwriter = csv.writer(csvfile)
csvwriter.writerow(fields)
csvwriter.writerows(rows)
except Exception as e:
print(f"Error fetching NBA data: {e}")
return []
def upload_data_to_gcs():
"""Upload NBA data to the GCS bucket."""
try:
# Upload the CSV data to GCS
file_name = "raw-data/nba_player_data.csv"
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(file_name)
blob.upload_from_filename(os.path.join(os.getcwd(), "players.csv"))
print(f"Uploaded data to GCS: {file_name}")
except Exception as e:
print(f"Error uploading data to GCS: {e}")
def create_bigquery_table():
try:
# Create a BigQuery dataset
dataset_name = bigquery.Dataset(f'{project_id}.my_nba_dataset')
dataset = bigquery_client.create_dataset(dataset_name)
# Define the BigQuery Schema
schema = {
bigquery.SchemaField("PlayerID", "INTEGER"),
bigquery.SchemaField("FirstName", "STRING"),
bigquery.SchemaField("LastName", "STRING"),
bigquery.SchemaField("Team", "STRING"),
bigquery.SchemaField("Position", "STRING"),
}
# Create the BigQuery table in the dataset
table_ref = dataset.table(bigquery_table_name)
table = bigquery.Table(table_ref, schema=schema)
bigquery_client.create_table(table, exists_ok=True)
print(f"BigQuery dataset {dataset_name} created successfully.")
except Exception as e:
print(f"Error creating BigQuery dataset: {e}")
def load_data_to_bigquery():
"""Load data from GCS to BigQuery."""
bigquery_dataset_name = f'{project_id}.my_nba_dataset'
try:
uri = f"gs://{bucket_name}/raw-data/nba_player_data.csv"
table_ref = f"{bigquery_dataset_name}.{bigquery_table_name}"
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV,
autodetect=True,
)
load_job = bigquery_client.load_table_from_uri(
uri, table_ref, job_config=job_config
)
load_job.result() # Wait for the job to complete
print(f"Data loaded to BigQuery table '{bigquery_table_name}'.")
except Exception as e:
print(f"Error loading data to BigQuery: {e}")
def main():
print("Setting up data lake for NBA sports analytics...")
create_gcs_bucket()
nba_data = fetch_nba_data()
upload_data_to_gcs()
create_bigquery_table()
load_data_to_bigquery()
print("Data lake setup complete.")
if __name__ == "__main__":
main()
Script explanation:
We start by importing the necessary modules like: csv
, requests
, os
, bigquery
, storage
, dotenv
that we’ll be using.
We then load the environment variables using load_dotenv()
. It’s good practice to store sensitive credentials like API keys in an .env
file rather than hardcoding them in the working file.
We then create instances of storage_client
and bigquery_client
to interact with Cloud Storage and BigQuery services
Creating a Cloud Storage bucket:
The create_gcs_bucket
method checks if a bucket with the same name exists in our Cloud environment. If it exists, we get a print message informing us that it does. If it doesn’t exist, it tries to create the bucket in the specified region using the storage_client.create_bucket
method.
Fetching the NBA Player data:
The fetch_nba_data
method fetches the player data from the given endpoint using the requests
library. We then define the column names (fields
) that will be created in the CSV file. A for loop is used to iterate through the JSON response and writes the data into a CSV file called players.csv
that will be located in the current working directory.
Uploading the data to Cloud Storage:
The upload_data_to_gcs
function uploads the players.csv
to the Cloud Storage bucket created earler. It defines the path to the CSV file and the destination file name in the bucket (raw-data/nba_player_data.csv
)
It then uploads the CSV file to the bucket using the blob.upload_from_filename
method of the blob
object
Loading the data into BigQuery:
The load_data_to_bigquery
method loads the data from the Cloud Storage bucket to the BigQuery table. It defines the BigQuery table referenced by specifying the dataset and table name.
It creates a LoadJobConfig
object specifying the source format as CSV and enabling autodetection of schema.
It initiates a load job using bigquery_client.load_table_from_uri
method, specifying the source URI, destination table reference, and load job configuration.
It then waits for the load job to complete using the load_job.result()
method
Running the script:
The main
function is the entry point of the script. This function calls the create_gcs_bucket
, fetch_nba_data
& upload_data_to_gcs
functions to execute the above jobs.
$ python nba_datalake.py
Screenshots:
Cloud Storage CSV file object:
BigQuery dataset & table:
You can proceed to run queries of your choosing directly on BigQuery on your dataset.
Conclusion
In this project, we successfully demonstrated the creation of a data pipeline for ingesting NBA player data. Utilizing Python, we extracted data from the Sportsdata.io API, transformed it into a suitable format, and loaded it into a Cloud Storage bucket. Subsequently, we leveraged BigQuery to efficiently store and analyze the data.
The architecture implemented in this project serves as a scalable and adaptable model for ingesting and analyzing data from various sports and other data sources.