Data science is exploding due to the colossal increases in the amount of data being made available to businesses, and more and more organisations are using this data to power, well, everything. The de facto notebook framework in the data science community is the Jupyter Notebook, and has been widely adopted now for data exploration, visualisation and narration. But what is the Jupyter notebook?
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. Uses include data cleaning and transformation, numerical simulation, statistical modelling, data visualisation, machine learning, and much more.
Jupyter notebooks (along with R-markdown) is by far the most popular type of data science notebook. It is an open-source format created by the community, and there are over 7-million Jupyter notebooks on Github alone right now with exponential growth in usage.
And these notebooks are already being widely used in many large organisations:
But what are the reasons this tool is most popular above all others?
1. Exploratory Analysis and Data Visualisation
Jupyter notebooks have risen rapidly in popularity to become the go-to standard for quick prototyping and exploratory data analysis. Within the Jupyter interface, users can observe the results of their code by executing cells line-by-line, independent of the rest of the script.
This means one can interactively run code, explore output and visualise data seamlessly. Jupyter also supports the rendering of interactive plotting libraries like that of plotly, bokeh and Altair, alongside the sharing of code and data sets, thus enabling others to better interact with these generated insights.
2. Collaborative Code Sharing
Version control systems like Github and Bitbucket allow you to share and collaborate on code. One of the issues with this however, particularly in data science, whereby the intended use of any given script or code snippet is to generate some form of shareable output, is that the results are non-interactive.
Jupyter notebooks allow collaborators to not only view code, but to also execute it (in any order the end-user wishes) and display the results of said code in-line, directly in the web browser. The data analysed can be shared in attached files or it can be imported programmatically by downloading or fetching data from either public or private sources via Internet repositories or database connections.
3. Computational Narration
The final goal of any data exploration & analysis is not to generate interesting but arbitrary information from the company’s data — it is to uncover actionable insights, communicated effectively in technical reports that are ready to be used around the business to take more data-informed decisions.
Jupyter notebooks are, by default, perfect for generating data-based reports. It is all about storytelling - and in a notebook you can narrate the visualisations you create, such that the data and the computations run to process and visualise this data are all embedded into a narrative - a computational narrative - that tells a story for its intended audience and context. This makes presentation and communication of generated insights towards business value effortless.
4. Task Automation
Ok so we've seen so far the importance of Jupyter notebooks in the process of generating and presenting insights from data. However, the community is continuously working hard on new add-ons to the Jupyter ecosystem that make the lives of data scientists and engineers that much easier.
Much of the work carried out by data scientists is repeated on a periodic basis. Think of financial reports that need to be generated and shared at the end of each month. A lot of the time, this requires someone to actually pull in data and run the notebook manually every 30 days. What if there was a way to parameterise and schedule these types of notebooks, meaning such tasks would be completely automated?
In this regard, papermill has been big game changer. Papermill is a library that makes large-scale execution of multiple notebooks, with configurable parameters and production ecosystems in mind, possible. Parametrised notebooks essentially allow one to specify parameters within code blocks and provide input values while the code is running.
The opportunities this opens up are endless. The data insights, visualisations and reports we have already referred to above can now be automated and scheduled. Coming back to our example above, imagine that this same data scientist can use papermill, leveraging parameters for the values that change month-to-month, meaning that the report can now be scheduled to automatically execute on the 1st of each and every month.
With this and other tools, data teams can begin to set up full pipelines for generating reports in production, based on Jupyter notebooks.
With the introduction of tools like papermill, as well as others - for example, those aimed at improving version control with notebooks (a persistent and difficult problem to solve) - Jupyter notebooks are no longer only used in local development but also in production. In addition, Jupyter notebooks are more and more often used like reports and dashboards to be presented to business teams and executives. This means that, for many organisations, Jupyter notebooks are used throughout entire data pipelines - from data collection and transformation to analysis and modelling, right up to the direct publication of results.
And this is where Kyso comes in.
Internal data teams use tools like Github for project management, access to which is typically restricted to the data scientists & engineers. This means their work is not shared with the non-technical people in the company. They use technical documents like these Jupyter notebooks for data exploration, analysis and scheduling, but the presence of code, terminal output, etc. means they are not the best communication tool (in their default rendered format) for non-technical audiences. All of this causes a lot of locked business value because not everyone in the company learns from generated insights.
Kyso is compatible with the technical tools used by data teams, meaning it integrates seamlessly with their existing workflows - they can continue to generate reports as per usual and Kyso automatically renders these technical reports in a way a non-technical audience can read and interact with, consequentially becoming the company's central knowledge management system for data-based reporting and collaboration.
We created Kyso to optimize the computation-to-communication workflow, and to bring non-technical stakeholders into data-based conversations that before now only happened within siloed groups across organisations.
We look at huge companies like Netflix and Stripe already betting big on Jupyter notebooks, making them integral parts of their data science infrastructure, and we are predicting a future in which notebooks become the singular interface companies leverage for all data-based actions.
In order for a business to be able to scale these actions across a wide range of daily operations - from data collection, exploration, preparation, to productionalisation such as model training and scheduling of workflows, all the way to reporting - ideally the number of supported tools required to maintain such a system should be minimised.
Stripe and Netflix (in particular) will be the trailblazers in this regard. It is because of their efforts that Jupyter notebooks are becoming the new Excel. In order for businesses to reap the full benefit of this phenomenon, the knowledge within these documents needs to be communicated effectively across entire organizations, so everyone — I mean everyone — can learn & apply data insights to their respective roles to drive business value. At Kyso we want to help you do just that.