1

Context: I have an app that serves interactive graphs and data analysis. In order to calculate plots and data summaries, it uses a dataset that is loaded upon App initialization by querying google BigQuery. The data is then kept as a global variable (in memory) and is used in all data calculations and plots that might be run by different users (each user saves in their session their own filters/mask).

This dataset changes in BigQuery once per day during the night (I know the exact datetime of refresh). Once the data is refreshed in BigQuery, I want the global variable of the dataset to be refreshed.

I know that the proper solution would be to call a Database on each user request, but BigQuery high delay on requests doesn't make this a good solution, and I can't use another DB.

The only solution I've came across so far is to restart the Google App Engine service (all instances) after BigQuery data refresh. Please note that this should be a scheduled action, done programatically.

My questions:

  • In case restarting the service is the best possible solution, how should I be restarting the service?
  • In case there is another way to accomplish what I want, please let me know

3 Answers3

0

It's likely good practice to cache your dataset as you're doing; if you know the data hasn't changed. then there's no need to requery BigQuery for it.

However, your dataset does change, just once per day.

So, I think your approach should be to revise your app so that it refreshes the cached copy of your BigQuery dataset every day and stop|block your users from querying the dataset as it changes.

You actually need only change the dataset if a user requests it (there's no need to refresh the dataset on days when no users need it), so, depending on the time the refresh takes and your users' expectations on latency, you could trigger the refresh by a user request: has the dataset changed? If so, block this request, refresh the data and then respond to the user.

I assume you've already solved the problem that your users' data plots and calculations will differ for different datasets.

DazWilkin
  • 24,003
  • 5
  • 35
  • 71
  • Waiting for user request is not possible, as querying and processing the dataset requires 20 seconds (something that would not be acceptable). However, your answers made me think about refreshing the global variable with the dataset with a specific request made to the server. Unfortunately this would not be possible, as if there are multiple instances, only one will refresh the dataset.. – David Olmo Pérez Mar 17 '19 at 19:42
  • You could take advantage of a shared (by all instances) cache in the form of App Engine Memcache or more simply create a 'cache' service in your app. Per instance caching requires you to make mulitple calls to BQ (one/instance) to refresh the dataset each day and exposes you to potential inconsistency between instances. However, I feel my advice stands, your app (manifest per instance) should expire the cached dataset daily, suspend user access, refresh its copy and then resume. – DazWilkin Mar 17 '19 at 20:30
  • I just checked memcache and it seems like it is only available in python 2 (not 3) because the google.appengine module is not available in 3 `from google.appengine.api import memcache` as seen [here](https://cloud.google.com/appengine/docs/standard/python/memcache/using) – David Olmo Pérez Mar 17 '19 at 22:50
0

One possible approach would be to trigger the running instances to exit (by themselves, i.e. commit suicide) once the BQ dataset is updated and leave GAE start new/replacement instances, which will load the updated dataset.

The trigger can be based on memcache, datastore or cloud storage/GCS (all faster than BQ - less penalty for checking them in every request). You want to be certain that the trigger doesn't also affect the freshly started instances:

  • make the trigger be, for example, the timestamp of the most recent BQ dataset update
  • add a global variable with the timestamp of the dataset loading in memory
  • the trigger would fire when memcache/datastore timestamp is ~24h (or just "a lot") newer than the one in memory

For the action causing the exit I'd try:

  • a regular sys.exit(0) call (not quite sure if/how this works on GAE)
  • raising an exception (not so nice, it'll leave nasty traces in the logs). If you use it try to make it as clear as possible to minimize the chances of being accidentally interpreted as a real failure. Maybe something like:

    assert False, "Intentional crash to force an instance restart"
    

Another possible approach would be to force an instance restart from outside - by re-deploying the application using the same version string. The outage associated with the instances' restarts caused by re-deploying the same version is actually why I dislike using the service version based environment implementations, see Continuous integration/deployment/delivery on Google App Engine, too risky?

But for this to work you need some other environment(s) to trigger and execute the deployment. It could be some other GAE service or even a Cloud Function (in which case using a Storage event trigger would eliminate the need for explicitly polling for the dataset updated condition).

Dan Cornilescu
  • 38,757
  • 12
  • 59
  • 95
  • sys.exit seems a good hack and would completely work in my case. Anyway, in case there are multiple instances running, wouldn't this only terminate 1? I would like all of them to terminate... – David Olmo Pérez Mar 17 '19 at 19:48
  • Regarding the re-deployment option, it seems a good alternative as well. Anyway, I would like this process to be done automatically and programatically. How could I re-deploy without using my local SDK? The only answer that comes to my mind would be installing the SDK in a google cloud compute engine, and then create a cron job... although this seems excessive – David Olmo Pérez Mar 17 '19 at 19:49
  • for `sys.exit()` (and the exception raising) you'd run the logic in each instance, so each of them would independently commit suicide. – Dan Cornilescu Mar 17 '19 at 21:28
  • Indeed, the re-deploy is more tedious, consider it only if the suicide approach doesn't work. – Dan Cornilescu Mar 17 '19 at 21:32
  • Thanks Dan for your answer. Maybe my understanding of google app engine is not correct, but I thought that one user only does requests to 1 instance. In other words, if I request a sys.exit, only the instance I am making requests to would restart. I would not be able to choose each instance one by one to exit them specifically – David Olmo Pérez Mar 17 '19 at 21:58
  • Requests from one user can hit any instance (unless you have [cookie-based traffic split](https://cloud.google.com/appengine/docs/standard/python/splitting-traffic#cookie_splitting) enabled). But if an instance doesn't get a request it doesn't matter if it's not restarted. But this reminds me: you will get an error for that particular request during which you discover you need the instance to commit suicide. Unless you tolerate one "outdated" reply before the suicide. – Dan Cornilescu Mar 17 '19 at 22:05
  • Then this means that `sys.exit` would not work, right? – David Olmo Pérez Mar 17 '19 at 22:56
  • No, I mean that you wouldn't reply to the request since you know the reply would contain obsolete info, so the user would have to retry it. Maybe even several times, if subsequent requests keep hitting instances that didn't yet commit suicide. – Dan Cornilescu Mar 17 '19 at 23:00
0

I finally found a way to restart all instances programatically, by using the Python API discovery Client and a service account. It first gets the list of active instances and delets all of them. Then, performs a simple request to initiate one of them.

import requests
from apiclient.discovery import build
from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file('credentials.json')
scoped_credentials = credentials.with_scopes(['https://www.googleapis.com/auth/appengine.admin',"https://www.googleapis.com/auth/cloud-platform"])
appengine = build(serviceName="appengine",version="v1",credentials=scoped_credentials)

VERSION_ID = "version_id"
PROJECT_ID = "project_id"
SERVICE_ID = "appengine_service_name"
APP_URL = "http://some_url.com"

active_instances_dict = appengine.apps().services().versions().instances().list(servicesId=SERVICE_ID,appsId=PROJECT_ID,versionsId=VERSION_ID).execute()
list_of_instances = active_instances_dict["instances"]

for instance in list_of_instances:
    appengine.apps().services().versions().instances().delete(servicesId=SERVICE_ID,appsId=PROJECT_ID,
                  versionsId=VERSION_ID,instancesId=instance["id"]).execute()

requests.get(url=APP_URL)