I have a set of notebooks for teaching Spark's structured streaming API. I don't mind the fact that it shows so many logging messages at the Jupyter Lab output cell since I can always configure log4j, but when I use the "console" sink, the output shows up in the notebook and Jupyter doesn't auto scroll to track the tail-end of the output, making it annoying to work with, especially for fast streams.
Related solutions I have found:
- There is an auto scroll extension for Jupyter Lab but it does not work with my version (v3.4.2) of Jupyter Lab. This would be nice if it worked since what I am really after is tracking the tail-end of the console sink output.
- There is a compromise solution on stackoverflow where you can use IPython's display functions to continuously query the in-memory result table and manually refresh the output cell. The problem with this is that there is no easy way to show the latest N rows of the result without adding extra id columns for ordering. By default, only the first N=20 rows are shown.
My question: Is it possible to configure the console sink so that the output goes to the controlling terminal where I ran the jupyter lab command from?
My software versions (jupyterlab, pyspark via pip):
- Python 3.10.4
- jupyterlab 3.4.2
- pyspark 3.2.1