Problems with Jupyter

Goals:

  • Provide the other side of jupyter so users know what to look out for.



Not everyone loves Jupyter!

shouting-development.gif

 

shouting-development.gif

 


Why is this?

Jupyter notebooks can encourage bad software development practices.

  • Can discourage software written for modularity, reproducibility

  • Linters (tools that check code for correctness) are difficult to use in jupyter’s disjointed cell interface.

Counterintuitive to object oriented programming and can discourage OOP practices.

  • As part of that, it can discourage modularity.

  • Classes cannot be defined across multiple cells

Most jupyter notebooks and their output are not easily reproducible.

  • You can’t necessarily tell what order cells get run in the notebook resulting in specific changes to variables. This can result in “hidden states” of a cell and it’s variables.

    • Skipping cells, or running out of order results in different cell states in individual instances of the same notebook.

    • Changes to run order are not retained once the kernel is destroyed.

    • Therefore the next person who runs it can’t reproduce results running it from scratch.

  • Overriding the linear run of cells in jupyter is one of it’s best characteristics. It is also one of it’s biggest weaknesses.

  • Without knowing the original notebook creators environment, and installed software stack, it’s difficult to reproduce their results

Jupyter lacks version control

  • There’s no easy way to determine whether a cell has been edited and when.

Jupyter can be slowwwwwwww….

  • Jupyter is an interactive tool. It must therefore to load the entire notebook in memory in order to provide the interactive features.

  • If you're working with extremely large data sets or large notebooks, this can become a problem.

  • Jupyter is not designed to be used with extremely large data sets.


We can’t tell you what to use, but can offer some best practices

While we may recommend best practices, and provide reasoning, the tools you use for your research are entirely up to you.

Over the long term, this boils down to discipline and using the right tools for the right jobs.
**Generally, these rules will often apply, but one cannot predict every use case:

  • Jupyter is probably not the right tool to share application code**

  • Jupyter was built for collaboration, communication, and interactivity. It is not meant to running critical code**

  • It's convenient to show ideas to others, or experiment with your code, or for draft code

    • Once you’re done experimenting, code it properly in a serious application if you plan to use the code in production**

  • If the code needs to run for a long time, it’s likely not a good idea to write it or run it from a jupyter notebook**

  • Jupyter is not intended for asynchronous tasks**

    • It's designed to keep all cells in a notebook running in the same kernel.

    • This means that if one cell is running a long, asynchronous task, it will block the execution of other cells.

    • This can be a major problem when you're working with data that takes a long time to process, or when you're working with real-time data that needs to be updated regularly.

    • In these cases, it can be much better to use a tool designed for parallel computing.


Jupyter wasn’t originally intended for use on an HPC

In many ways, HPC computations and Jupyter notebooks don’t suit each other’s strengths. Their use cases and original intentions are very different. Jupyter Notebooks can be powerful development and collaboration tools, but they often aren’t suitable for long-running, computationally intensive workflows. Classic HPC runs in batches, with long running jobs through terminal access.

You can however use them together, and tools are available if you want to do this:

  • OpenOndemand - makes it easier to launch from HPC with requested resources.

  • ipython parallel (designed to integrate with MPI libraries)

  • dask

  • spark

In some cases the tools end up being more of a “workaround” and don’t really allow your computation to be run as one job inside the notebook. In these cases, you usually have your classic hpc jobs spawned from a jupyter session. These jobs run simultaneously with jupyter and information gets communicated between them.

 


Next Steps