If you are a Data Scientist, or if you manage a team of them, this blog will try to convince you that standardization is good for you, eventually.
Full disclosure: as I work for the team responsible for the recently released Red Hat OpenShift Data Science cloud service, I am obviously biased. Even so, I hope the information shared here is useful.
In this blog, I will:
- Review why data scientists often like running Jupyter Notebooks on their laptop (Part 1)
- Cover some of the implications of this choice (part 2)
- Discuss some of the OpenShift-based alternatives (part 3)
Doing Data Science work on your laptop might be the easiest and simplest solution in the short term. That does not mean it will be the best strategy for the long term.
Part 1: Why do Data Scientists tend to run Jupyter Notebooks on their laptops?
Let's first look at the various rationales you may (or may not) hear.
Because it's easy enough and gives complete freedom
There is no shortage of resources online that will walk you through setting up a Jupyter Notebook on your laptop. For example, this page lists a few options. And most of those methods are fairly agnostic as to what Operating System your laptop is running (Windows, Linux, Mac OS). So, yes, setting this up on your laptop is fairly easy to do and you probably don't need to call your IT support.
Even if you do reach out to your IT team, it could be that they are not familiar enough with Jupyter to be able to help you much. At best, they might say "Go for it, good luck!" and at worst, they might say "You're not allowed to do that."
Because sometimes, it's all that is available
If your employer or university is not providing you with a specific environment in which to perform Data Science, and all your colleagues or classmates are using their laptops, then it's pretty likely you will simply do the same.
If you are still in school and the data you use is both small and non-sensitive, using your own laptop probably makes the most sense.
But in most other situations, you might want to rethink this choice. Part 2 of this blog will give you many of the reasons for this statement.
Because that's the way they started
Many Data Scientists come either from academia, or come from another (non-IT) field, and "picked up" Data Science on the go.
In both cases, they would have rarely had to deal with enterprise-grade IT requirements and support. Since they had no IT support to rely on and build tools for them, they did their own IT support and stopped as soon as they got to an environment that was good enough.
However, this says nothing about how performant, secure, maintainable and standardized that environment might be. These criteria become a lot more critical, as your career in Data Science progresses!
Because they can work offline
Having your entire working environment on your laptop can give you a feeling of control. You feel better knowing that you can keep on working even when the Wifi is rebooting for no reason, even when you are on a plane, or even at the cabin where the cell reception only works by the creek, on sunny days. Theoretically, you could keep working a few hours during a power outage! That is one way to demonstrate your dedication to your craft!
Part 2: Why should Data Scientists consider the alternatives
Because applications are moving to the cloud
I still remember back in 1999 when a colleague explained the concept of "hotmail" to me: "No man! That's the thing. There is no email client on your desktop. You don't need to download the emails to Outlook Express or Thunderbird. It's just a website, you log in, and your emails are there! So you can check your emails from any computer, anywhere in the world!"
At the time, that blew my mind.
Since then, most applications have migrated to the web, and are accessed from a browser. In fact, 5 years ago, I might have typed this draft in a local copy of MS Word 2016. But instead, today, I'm typing the draft directly in a Web Browser.
Now, just because everyone's doing it does not mean that you should too. But when there is a pattern, there's often good reasons behind the pattern. This XKCD illustrates my point: If everyone is doing it, you should at least ask yourself why they are.
Because laptops sometimes break
Hardware and software failures happen. When those events happen, you are confronted with unexpected questions:
- How quickly will you be able to get back to work after you receive your replacement laptop?
- Did you have good, recent backups of everything you need?
If you don't have good backups, you might have lost some things forever.
Or, your local Jupyter Notebook server was lovingly crafted over the years, and you just can't remember all the things you had done to get there.
If any of this rings true, your dead laptop was your "perfect snowflake," and you won't be able to rebuild it the same way. And even if your laptop survives unscathed for a few years, your employer will eventually replace it with a newer, faster and shinier one eventually. So it makes sense to keep as little content and tools as possible on it.
Because "SECURITY" (aka, laptops can get stolen)
This is the big one, in my opinion.
Many years ago, I bought my parents a USB drive. I explained to them that if their laptop was to stop working, they would lose all the photos stored on it. And therefore, how important a good backup was. About a year later, someone broke into their house while they were away. You guessed it, they stole both the laptop and the USB drive.
Thieves: 2, Erwan: 0
I ignored the 3-2-1 rule of backups, and my parent's photos paid the ultimate price.
But then, my parents are not Data Scientists.
With risk assessment, you have to look at both probability and impact of something bad happening. I'd rather have a 50% chance of catching a cold than a 1% chance of dying.
If you are running your Jupyter Notebooks locally on your laptop, there is a good chance that a lot of important, and potentially sensitive data is on it. Ideally, all of this data would have been anonymized before you received it, so that even if it were to fall into the wrong hands, it would be useless. Ideally also, your laptop's hard drive would be encrypted, which would make its content completely unreadable. Oh, and by the way, if you have been diligent about backups, maybe onto an external hard drive, I would also hope that this device too would ideally be encrypted.
I used the word "ideally" quite a lot in the above paragraph. So let's talk about the "realistically" for a minute.
Ask any IT professional you know if they have ever ended up seeing data they should not have been able to see. Ask them if they have ever received something that should not have been sent over email.
I know I have. These things do happen.
It's true that in recent years, many of the higher-profile data breaches have been linked to leaky S3 buckets (Feel free to google "leaky S3 buckets". It's a fun read.).
But don't underestimate the risks of simply getting your laptop stolen:
Of course, in most cases, the thieves will just want to sell your laptop. They may or may not wipe and reinstall.
But since you can't be certain, imagine the awkwardness of the following...
"Hey Boss.... so, someone broke into my car and stole my laptop. Yes, I did have the customer database with all the information in clear, and no it was not encrypted. Sorry, eh!". That's probably a career-ending discussion right there. One you want to avoid at all costs.
Even if it's not your fault your laptop was not encrypted, you'll still be the one who lost the data.
So, if you are doing data science work locally on your laptop, make sure that you understand the risks and can live with the consequences of your laptop getting lost or stolen. Having any data on your laptop automatically increases the burden of security that applies to it.
Because laptops can hinder collaboration
Imagine that you've just finished creating a really cool Jupyter notebook, and you need me to review it for you.
How would you proceed?
I'm guessing most people would simply email me the file. Great! I hope it does not contain any of that sensitive data I've mentioned earlier, because email really isn't the most secure medium. Let's assume it does not.
What would you estimate is the likelihood that I'll be able to run your Jupyter notebook on my laptop? Am I running the same version of the software as you? Do I have the same packages? Will I be able to pull in the same data that you are? Does my laptop have as much memory as yours? If your laptop is your perfect snowflake, chances are that mine is a different snowflake.
If your IT team has provided you with a consistent build of laptops and of Jupyter on it, and has made sure that you cannot modify it too much, you have a fighting chance. And if you have a shared drive where you store all the shared Jupyter Notebooks, you might also not need to email me the files.
But for many people, this level of consistency might not be there, and the amount of friction described above will really hinder collaboration.
Because laptops don't scale
As a data scientist, you have to tackle a variety of problems. The size of the data you have to deal with will definitely vary. You might ask your IT admin for a laptop with 32 GB of RAM, because you have to deal with a 10 GB problem. In some cases, a GPU will help, but in others, it just won't.
We have come to expect a flexibility in life. We have cars that can go from 0 to 150 km/h. And if we are going too far, we'll take a plane. But if we need to move furniture, we'll rent a truck.
How can one laptop be expected to deal with so many different requirements? And even if you are lucky enough to score a sweet laptop with a GPU and 32 GBs of RAM... what will happen when you are asked to solve a 40 GB problem?
This is not a case of if it happens, but of when it will happen.
Instead, an ideal working environment for a data scientist is one that can grow and shrink on demand. Maybe one day you need 4 GPUs to yourself for 6 hours straight. And then you go on vacation for a couple weeks and you need zero GPUs during that time.
One of the main tenets of the cloud is to rent out what you need, when you need it.
Data Science work is clearly a great candidate for this. So it's no surprise that we see more and more Kubernetes-based environment underpinning Data Science platforms. (Hint: more on that in part 3)
Because laptops are not integrated into the larger ecosystem
"Well, I'm done with my notebook, so that's the end of this project!" ... said no Data Scientist, ever!
Finalizing some work in a Jupyter Notebook is just one of the many steps of data science. After it, you might still need to
- package your model
- deploy your model
- validate your model
- monitor your model
Things that may have worked perfectly on your laptop may or may not work the same in the environment where it's supposed to run. And you can't phone your IT friend to ask that your laptop be moved to production. (in spite of what the memes suggest). As practices improve and you move into the world of MLOps and related automation, it will be critical to have the ability to work within the bounds of the AI/ML platform that your company is using.
A cronjob on a laptop is just not going to cut it in the long run.
Because "SECURITY", the other kind. (aka Keeping an eye on vulnerabilities)
Every software ever written (except maybe this one) will contain flaws. That comes with the territory.
If that flaw annoys the users, we call it a bug. If that flaw lets someone misuse the software, we call that a vulnerability.
So, if you deployed your own Jupyter Notebook Server and then added a number of packages to it, are you staying on top of what is in it, and whether there are vulnerabilities?
But if you're a Data Scientist, you're probably not getting paid to do this sort of work. And even if you are willing to spend the time and effort, it's very inefficient to have each member of a team duplicating this effort.
To quote the article:
A series of misconfigurations in the Jupyter Notebook feature opened up a new attack vector we were able to exploit. In short, the Jupyter Notebook container allowed for a privilege escalation into other customer Jupyter Notebooks.
My point here is that security is both important and complicated. Just because you use a Cloud-Based Jupyter Notebook does not inherently make it secure. However, chances are that the team in charge of cloud-based Jupyter Notebooks is a lot more knowledgeable than you or your IT team can be, when it comes to security. (And in the case of Azure, actions were taken very quickly to address the problem.)
Some OpenShift-based alternatives for Jupyter Notebooks
If you are on this website, there is a good chance that you have heard of OpenShift. If not, you can just click this link discreetly. We can pretend like we never had this conversation and you knew all along that OpenShift is Red Hat's enterprise-ready Kubernetes container platform.
Open Data Hub (ODH)
Open Data Hub is a community-driven project that allows you to use your OpenShift Cluster as an AI/ML-as-a-Service platform. It includes many tools and technologies that can be assembled to build an environment in which Data Scientists can be productive. Jupyter Notebooks are one of the available tools.
Although Red Hat is very involved in this project, Red Hat does not provide support for it. It is fully up to you or your OpenShift admin to deploy, configure, adjust, monitor and maintain your Open Data Hub environment. In this scenario, while OpenShift itself is supported by Red Hat, the various tools you deploy on it through ODH do not benefit from the same support.
Red Hat OpenShift Data Science (RHODS)
Red Hat OpenShift Data Science is a RedHat-provided cloud service that offers Data Scientists a stable environment in which to work. It is supported by Red Hat teams, and offers a curated list of Jupyter Notebook images for Data Scientists to use.
Curated list of Jupyter Notebooks
RHODS includes multiple Jupyter Notebook Images that let you easily and painlessly use frameworks such as Pytorch, Tensorflow, etc.... These Jupyter Notebook images are kept up to date so that data scientists can always have the latest and greatest software available.
Partner ecosystem for AI/ML
Beyond the Red Hat supported tools that come with RHODS, many partner software will also be available. At time of this writing, this includes:
- IBM Watson
And the list will keep growing in the coming months
Wider Red Hat Marketplace
Beyond the RHODS partner ecosystem, there is also a lot of software available on the Red Hat MarketPlace.
Another advantage of using Jupyter Notebooks on OpenShift is that you can easily resize your working environment. In the cloud, the sky's the limit (pun intended). If you have very large nodes in your cluster, you can use Jupyter Notebooks that can be very large in size.
Horizontal Scalability and Auto-Scaling
Whether your Red Hat OpenShift Data Science is deployed on OpenShift Dedicated on AWS, or on Red Hat OpenShift on AWS, you can configure MachinePool AutoScaling, in order to easily and automatically meet the demand from your end users.
This will be a topic for a future blog post, but once enabled, the number of nodes on which your notebooks are launched will grow and shrink automatically, based on how many users are active, and how big of a notebook they requested.
Limitations of OpenShift-based Jupyter Notebooks
Unable to work without Internet access
For all the promises of Cloud Computing, you still have to abide by the laws of Physics: There has to be some hardware somewhere, and you'll need to be connected to it somehow. If the network is down, and what you need is not on your local computer, you can't access it. Period.
I would point out however that this is not specific to your OpenShift-hosted data science platform! No network means no e-mail, no Zoom calls, no StackOverflow, no access to most documentation, etc...
And in the era of smartphones hotspots, 5G, and plane-based wifi ... the amount of time you are truly without Internet access is quickly dwindling. For better or for worse.
More limited control over the Jupyter Notebooks.
If you are using a set of Jupyter Notebook images that are built and supported by Red Hat, you will not be able to modify those images. However, you will still have the freedom to
pip install anything you want on top of the provided images, if you need to do so.
With that said, RHODS now also supports the ability to add your own Jupyter Notebook Images, in order to go beyond the default ones provided by Red Hat and ISVs. Check out the documentation on to configure a custom image for more details.
In this post, I've covered some of the downsides and risks associated with running your data science workloads on your own laptop. I did so because, as the saying goes, "an ounce of prevention is worth a pound of cure". I shared all this information here because, as Data Scientists very well know, having more information can lead to making better decisions.
I imagine that many will have overriding reasons to keep using their laptop the same way, and that is totally fine by me.
If you want to take our Red Hat OpenShift Data Science Jupyter Notebooks for a spin, you should head over to https://red.ht/rhods-sandbox for a simple, single-user, 30-days trial. Completely free of charge!