Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Table of Contents
stylenone

...

Not everyone loves Jupyter!

shouting-development.gif

...

Why is this?

  • If you only learn to program in Jupyter notebooks, it’s possible that you’ll develop bad coding practices.

  • It’s somewhat counterintuitive

    Jupyter notebooks can encourage bad software development practices.

    • Can discourage software written for modularity, reproducibility

    • Linters (tools that check code for correctness) are difficult to use in jupyter’s disjointed cell interface.

    Counterintuitive to object oriented programming and can discourage OOP practices.

    • As part of that, it can discourage modularity.

    • Classes cannot be defined across multiple cells

    Most jupyter notebooks and their output are not easily reproducible.

    Even if you don’t have cells that use randomization, if the original creator only ran some cell a few times and then just kept the state without running it in subsequent runs,

    why.gifImage Added

    • You can’t necessarily tell what order cells get run in the notebook resulting in specific changes to variables. This can result in “hidden states” of a cell and it’s variables.

      • Skipping cells, or running out of order results in different cell states in individual instances of the same notebook.

      • Changes to run order are not retained once the kernel is destroyed.

      • Therefore the next person who runs it

    won’t be able to reproduce it
      • can’t reproduce results running it from scratch.

    • You may run a notebook always skipping a cell, while the next person to run it doesn’t skip that cell. You’ll end up with different output.

    • Cells executed in different orders give you different output. You can override the linear run of cells in jupyter.

    There are great things about Jupyter

    ...

    Encourages well documented/commented code

    ...

    Great visualization

    ...

    • Overriding the linear run of cells in jupyter is one of it’s best characteristics. It is also one of it’s biggest weaknesses.

    • Without knowing the original notebook creators environment, and installed software stack, it’s difficult to reproduce their results

    Jupyter lacks version control

    • There’s no easy way to determine whether a cell has been edited and when.

    Jupyter can be slowwwwwwww….

    • Jupyter is an interactive tool. It must therefore to load the entire notebook in memory in order to provide the interactive features.

    • If you're working with extremely large data sets or large notebooks, this can become a problem.

    • Jupyter is not designed to be used with extremely large data sets.

    ...

    We can’t tell you what to use, but can offer some best practices

    While we may recommend best practices, and provide reasoning, the tools you use for your research are entirely up to you.

    Over the long term, this boils down to discipline and using the right tools for the right jobs.
    **Generally, these rules will often apply, but one cannot predict every use case:

    • Jupyter is probably not the right tool to share application code**

    • Jupyter was built for collaboration, communication, and interactivity. It is not meant to running critical code**

    • It's convenient to show ideas to others, or experiment with your code, or for draft code

      • Once you’re done experimenting, code it properly in a serious application if you plan to use the code in production**

    • If the code needs to run for a long time, it’s likely not a good idea to write it or run it from a jupyter notebook**

    • Jupyter is not intended for asynchronous tasks**

      • It's designed to keep all cells in a notebook running in the same kernel.

      • This means that if one cell is running a long, asynchronous task, it will block the execution of other cells.

      • This can be a major problem when you're working with data that takes a long time to process, or when you're working with real-time data that needs to be updated regularly.

      • In these cases, it can be much better to use a tool designed for parallel computing.

    ...

    Jupyter wasn’t originally intended for use on an HPC

    In many ways, HPC computations and Jupyter notebooks don’t suit each other’s strengths. Their use cases and original intentions are very different. Jupyter Notebooks can be powerful development and collaboration tools, but they often aren’t suitable for long-running, computationally intensive workflows. Classic HPC runs in batches, with long running jobs through terminal access.

    You can however use them together, and tools are available if you want to do this:

    • OpenOndemand - makes it easier to launch from HPC with requested resources.

    • ipython parallel (designed to integrate with MPI libraries)

    • dask

    • spark

    In some cases the tools end up being more of a “workaround” and don’t really allow your computation to be run as one job inside the notebook. Instead what you may have is two separate jobs running simultaneously with information In these cases, you usually have your classic hpc jobs spawned from a jupyter session. These jobs run simultaneously with jupyter and information gets communicated between them.

    ...

    Previous

    Dive into Jupyter Notebooks

    Workshop Home 

    Starting Jupyter with OnDemand

    Use the following link to provide feedback on this training: https://forms.gle/qBBwXpKeTNqSR5516 or use the QR code below.

    ...