AN7ONINO AN7ONINO
  • About Me
  • Skills
  • Tools
  • Education
  • Employment
  • Blog
  • Brands

How to Back up Your Universal Analytics Data to Google BigQuery (with Python Code)

09 May 202421 November 2024 / By Antonio
  • Home
  • How to Back up Your Universal Analytics Data to Google BigQuery (with Python Code)

As the deadline approaches for the deletion of all data from Universal Analytics in July 2024, businesses and companies must secure their web historical data. Storing this data not only ensures compliance and continuity but also preserves invaluable insights into customer behaviour over time. Using Python and Google BigQuery to back up your Universal Analytics data is a strategic approach that offers a robust, scalable, and cost-effective solution for companies looking to safeguard their digital legacy.

Prerequisites

  • Google Cloud Platform (GCP) account.
  • Access to Google Analytics and BigQuery.
  • Basic knowledge of Python. (You can run it with Collab or Jupiter notebooks)

Step 1: Setting Up Your Environment

  1. Create a Service Account in GCP: Navigate to your Google Cloud Console, go to IAM & Admin -> Service Accounts, and create a new service account. Download the JSON key file, which will be used to authenticate your API requests.
  2. Enable API: Enable the Google Analytics Reporting APIs in your GCP project. It should look like the screenshot below:
  3. Install Required Libraries: Install the google-api-python-client, oauth2client, and google-cloud-bigquery libraries in your Python environment using pip (if you’re using Google Colab)
!pip install --upgrade google-api-python-client
!pip install --upgrade google-auth google-auth-oauthlib google-auth-httplib2
!pip install --upgrade google-cloud-bigquery

Step 2: Python Script to Pull Data

Check out this handy Python script below! It sets everything up with the Google Analytics Reporting API, pulls your Universal Analytics data, and then neatly tucks it into BigQuery. It’s a smooth way to handle your data backup needs!

# Import necessary libraries for API access, data handling, and exceptions handling.
from googleapiclient.discovery import build
from oauth2client.service_account import ServiceAccountCredentials
from google.cloud import bigquery
from google.cloud.exceptions import NotFound
import pandas as pd
import os

# Set the scopes and the path to your service account key file for authentication.
SCOPES = ['https://www.googleapis.com/auth/analytics.readonly']
KEY_FILE_LOCATION = '/content/your_ua_api_keys_file_name.json'

# Specify your Google Analytics View ID, BigQuery Project ID, Dataset name, and Table name.
VIEW_ID = ''  
BIGQUERY_PROJECT = ''  
BIGQUERY_DATASET = ''  
BIGQUERY_TABLE = ''  

# Set the environment variable for Google Application Credentials to authenticate API calls.
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = KEY_FILE_LOCATION

def initialize_analyticsreporting():
    """ Initializes the Google Analytics Reporting API client. """
    credentials = ServiceAccountCredentials.from_json_keyfile_name(KEY_FILE_LOCATION, SCOPES)
    analytics = build('analyticsreporting', 'v4', credentials=credentials)
    return analytics

def get_report(analytics):
    """ Fetches the report data from Google Analytics. """
    return analytics.reports().batchGet(
        body={
            'reportRequests': [
                {
                    'viewId': VIEW_ID,
                    'dateRanges': [{'startDate': 'YYYY-MM-DD', 'endDate': 'YYYY-MM-DD'}],
                    'metrics': [
                        {'expression': 'ga:sessions'},
                        {'expression': 'ga:pageviews'},
                        {'expression': 'ga:users'},
                        {'expression': 'ga:newUsers'},
                        {'expression': 'ga:bounceRate'},
                        {'expression': 'ga:sessionDuration'},
                        {'expression': 'ga:avgSessionDuration'},
                        {'expression': 'ga:pageviewsPerSession'}
                    ],
                    'dimensions': [
                        {'name': 'ga:date'},
                        {'name': 'ga:country'},
                        {'name': 'ga:pageTitle'},
                        {'name': 'ga:browser'},
                        {'name': 'ga:channelGrouping'},
                        {'name': 'ga:source'},
                        {'name': 'ga:pagePath'},
                        {'name': 'ga:deviceCategory'}
                    ],
                    'pageSize': 100000  # Specifies the number of rows per batch of data.
                }
            ]
        }
    ).execute()

def response_to_dataframe(response):
    """ Converts the API response into a pandas DataFrame. """
    list_rows = []
    for report in response.get('reports', []):
        columnHeader = report.get('columnHeader', {})
        dimensionHeaders = columnHeader.get('dimensions', [])
        metricHeaders = columnHeader.get('metricHeader', {}).get('metricHeaderEntries', [])
        for row in report.get('data', {}).get('rows', []):
            row_data = {header: dimension for header, dimension in zip(dimensionHeaders, row.get('dimensions', []))}
            for values in row.get('metrics', []):
                for metricHeader, value in zip(metricHeaders, values.get('values')):
                    row_data[metricHeader.get('name')] = value
            list_rows.append(row_data)
    return pd.DataFrame(list_rows)

def upload_to_bigquery(df, project_id, dataset_id, table_id):
    """ Uploads the DataFrame to Google BigQuery. """
    df.columns = [col.replace('ga:', 'gs_') for col in df.columns]  # Format DataFrame column names for BigQuery.
    bigquery_client = bigquery.Client(project=project_id)
    dataset_ref = bigquery_client.dataset(dataset_id)
    table_ref = dataset_ref.table(table_id)
    try:
        bigquery_client.get_table(table_ref)  # Check if table exists.
    except NotFound:
        schema = [bigquery.SchemaField(col, 'STRING' if df[col].dtype == 'object' else 'FLOAT') for col in df.columns]
        table = bigquery.Table(table_ref, schema=schema)
        bigquery_client.create_table(table)  # Create table if not exists.
        print(f"Created table {table_id}")
    load_job = bigquery_client.load_table_from_dataframe(df, table_ref)  # Load data into BigQuery.
    load_job.result()  # Wait for the job to complete.
    print(f"Data uploaded to {project_id}.{dataset_id}.{table_id}")  # Confirm upload.

def main():
    """ Main function to execute the script. """
    try:
        analytics = initialize_analyticsreporting()
        response = get_report(analytics)
        df = response_to_dataframe(response)
        upload_to_bigquery(df, BIGQUERY_PROJECT, BIGQUERY_DATASET, BIGQUERY_TABLE)
    except Exception as e:
        print(f"Error occurred: {e}")  # Print any errors that occur during execution.

if __name__ == '__main__':
    main()  # Run the script if this file is executed as the main program.

Tip: When working in Google Colab, remember to upload your API key file each session. Since Google Colab does not retain files after the session ends, you must re-upload your API keys file whenever you restart your Colab notebook to ensure your code runs without interruption.

There should be a confirmation message at the bottom of the output from your Python script once you have backed up your GA3 data to BigQuery. It is indicated by the message ‘Data uploaded to {Your_BigQuery_Created_Table}‘ that your data has been successfully moved and put into the designated BQ table. If you see this message, everything went according to plan, and your historical data is now securely backed up in BigQuery and available for review or analysis later.

Costs of Running This Setup

  • API Costs: The Google Analytics Reporting API is free under usage limits (50,000 requests per day per project).
  • BigQuery Costs: BigQuery offers free first 10 GB of monthly storage. Afterward, it charges $0.02 per GB per month. Considering the data size from our example (17.97 MB per table), the cost is conveniently low.

Benefits of This Approach

  • Data Security and Longevity: Safely back up your UA data as GA4 takes precedence.
  • Cost-Effective: With minimal costs for storage and free data transfer, this is a cost-effective way to handle large datasets.

Conclusion

As the deadline to delete Universal Analytics data approaches, it is crucial to securely back up GA3 historical data. This guide introduces using Python and Google BigQuery to protect Universal Analytics data, emphasizing the importance for strategic decisions and business growth. The process is beginner-friendly, starting with setting up Google Cloud and ending with a successful data transfer confirmation. Storing data in Google BigQuery is cost-effective due to low data costs and free Google Analytics reporting API usage. This proactive approach secures investments in digital analytics and ensures a robust analytics framework for future innovation and success. Access to historical data is vital for maintaining consistency and alignment of digital strategies over time.

Tags
export universal analytics data to bigquery, ga3 to bq, ua to bq, universal analytics back up
About The Author

Antonio