0

Hi I would like to split a large bigquery table (10 Billion Event Records) into multiple tables based on the event_type in the large table.

Note the events table is time/day/event_time partitioned. Further assume that it as a year of data(365 days)

let's assume the event_type=['sign-up', 'page-view']

My approach:

  1. Create a new table, each for the event type
  2. Run and insert job, for each event type for each day[also i will be using dml inside a python script]

My questions:

  1. what load job type should i use: copy or load job?
  2. can i queue the load jobs to google big query?[would it work asynchronously?]
  3. would google big query process this load job in parallel?
  4. Is there anything I need to do it interms of using multiprocessing inorder to speed up the process? [the load job is handled by bigquery, if i can queue in the jobs than i don't need to do any multiprocessing on the client side]

Any pointers to an efficient solution is highly appreciated.

jarvis
  • 147
  • 9

1 Answers1

1

You can use query jobs for your requirement. Load jobs are used to ingest data into BigQuery from GCS buckets or local files.

The quotas and limits for query jobs can be found here. These quotas and limits apply to query jobs created automatically by running interactive queries, scheduled queries, and jobs submitted by using the jobs.query and query-type jobs.insert API methods. In a project, up to 300 concurrent API requests per second per user can be made.

The query jobs that use the jobs.insert method will be executed asynchronously. The same can be achieved using the Python client library (as you intended) as shown below. Please refer to this doc for more information.

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set table_id to the ID of the destination table.
# table_id = "your-project.your_dataset.your_table_name"

job_config = bigquery.QueryJobConfig(destination=table_id)

sql = """
    SELECT corpus
    FROM `bigquery-public-data.samples.shakespeare`
    GROUP BY corpus;
"""

# Start the query, passing in the extra configuration.
query_job = client.query(sql, job_config=job_config)  # Make an API request.
query_job.result()  # Wait for the job to complete.

print("Query results loaded to the table {}".format(table_id))

Since the jobs will be running concurrently, there is no need to implement explicit multiprocessing.

Prajna Rai T
  • 627
  • 2
  • 9