Want to learn data science in 2021? Here’s the internet’s best curriculum

Curated by David Venturi for the #NotARealDegree community

Note: Not a Real Degree is learner-supported. Some of the resources I recommend may be affiliate links, meaning I receive a commission (at no extra cost to you) if you use that link to make a purchase.

In a previous post, I announced Not a Real Data Science Degree, a community of learners following a curated curriculum made up of the internet’s best resources. I believe it’s the best bang-for-your-buck method for learning data skills in the digital era of education.

In this post, I list the courses, books, and other resources included in the 2021 version of the curriculum, plus the rationale behind those picks. Here’s what you’ll read:

If you want to get started now, follow the steps in this post.

The process you’ll use to build your new data skills.

Before we begin, here’s a quick TL;DR of the announcement post, which provides a general overview of the curriculum and the community.

Let’s get to the 2021 edition of the curriculum.

Curriculum overview

Term 1: Data Analysis

Term 2: Machine Learning & More

Learn: courses, books, and tutorials

Below, I’ll link to each course, book, and tutorial I selected in the order they appear in the curriculum. I’ll also briefly explain why I put it there.

I don’t include individual explanations for DataCamp courses because one explanation can be applied to the 39 courses I recommend. In my opinion, DataCamp’s combination of product and content creates the most beginner-friendly experience for learning data skills online.

I filtered through their 300+ course catalog and identified the courses that I think are the best options for their specific subject. These courses represent 30% of the curriculum. The other courses, books, and tutorials I recommend either 1) are unique in some way that outweighs this product/content combo or 2) teach a subject/tool for which DataCamp does not have a course.

Term 1: Data Analysis

Introduction to Data Science

First, you’ll acquire a framework for understanding the data science industry in a theory-only course. Then, you’ll do a little data science using Python. You’ll make your first coding errors under the instructor’s guidance, so no need to be intimidated!

A video from Data Science for Everyone on DataCamp.

Introduction to Python Programming

Next, you’ll learn Python programming and the fundamentals of computer science, which are foundational to the data skills you’ll learn next.

My research suggests that Dr. David Joyner’s Introduction to Python Programming series on edX is the clear winner for this subject area. This series is identical to Georgia Tech’s first class in undergraduate computer science:

Over 400 students on campus have completed this version of the course, and our analysis shows that they exit the course with the same learning outcomes as students taking the traditional on-campus version. This Professional Certificate uses the same instructional material and assessments as learning Python on campus, giving you a Georgia Tech-caliber introduction into the field of computing at your own pace.

You’ll take the first three courses in Term 1. These courses and a fourth in Term 2 make up 25% of the curriculum.

Dr. Joyner teaching in Georgia Tech’s Introduction to Python Programming series on edX.

Setting Up Your Computer

Next, you’ll set up your computer and learn how to work in your own computing environment (as opposed to the environment set up for you by DataCamp or edX, for example).

First, you’ll learn how to interact with the command line:

Then you’ll learn how to set up and manage data science software using conda:

Then you’ll learn how to use JupyterLab, a popular web-based user interface for data science:

And finally, you’ll learn how to keep track of your work and collaborate on projects in team environments with Git:

The JupyterLab interface, where you can interact with the command line, conda, and Git, as well as do fancy data science as displayed in the notebooks.

For the command line and Git courses, I like Dataquest’s offerings because of their depth, plus how they teach those in the context of the data science workflow. Note: use this referral link for $15 off.

I like DataCamp’s conda course because you learn conda without you having to install it first, which is a stumbling block for many beginners (myself included a few years ago). You’ll install conda on your computer next by following my recommended installation process for this curriculum.

The JupyterLab blog post I compiled contains various online resources (documentation, videos, and tutorials) that together resemble a course.

Python Data Science Toolbox

Next, you’ll add some more programming tools to your Python toolbox. These tools will come in handy later in the curriculum.

Chapter 1 of Part 1 is skippable because you already learned how to write functions in Introduction to Python Programming. Chapter 3 of Part 2 is skippable because it is a review of the topics you just learned (i.e., a case study) and this curriculum uses adaptive tests and self-directed projects instead.

Importing Data

As you’ll have learned in Data Science for Everyone, importing data is part of the first step of the data science workflow. You’ll learn how to import data using pandas, the most popular analytics library in Python.

I suggest you skip Chapter 3 because you will learn the “Importing Data from Databases” skill in a later course after you have learned some SQL.

Preparing Data

You’ll then learn how to prepare your data for analysis. The following courses teach the skills you’ll use most often.

You will learn more advanced data preparation skills later in the curriculum.

Exploratory Data Analysis

Exploratory data analysis (EDA) is the process of exploring data to summarize their main characteristics. You’ll learn EDA next, and you’ll do it often in your data career.

Statistics

Statistics is the study of how to collect, analyze, and draw conclusions from data. You’ll learn how to do that in Python, while also learning some of the probability theory that underlies statistical inference.

An exercise from Introduction to Statistics in Python on DataCamp.

Data Visualization

You’ll then learn how to visualize your data. First, you’ll learn the theory behind data visualization, then how to use the most popular data viz libraries in Python.

More Statistics

You’ll then dive a little deeper into statistics, which will prepare you for machine learning in Term 2. You’ll also learn the basics of R, a programming language that is optimized for statistics.

ModernDive is an online book that is frequently recommended for learning statistical inference. I read it in 2020 and was amazed by how intuitively the authors teach the subject, which is why I placed it first in this section. I think the content more than makes up for the book’s lack of software features (e.g., interactive grading) and choice of language (it uses R, i.e., not the language of focus for this curriculum).

The ModernDive website.

The Bayesian course is in R because 1) Rasmus Bååth is a great teacher and 2) (as far as I am aware) there isn’t anything comparable in quality and length in Python right now.

Databases & SQL

Nearly every data role requires the basics of databases and SQL, and you’ll acquire them next.

Chapter 5 of the last course is skippable because it is a case study.

An exercise from Joining Data in PostgreSQL on DataCamp.

Data Engineering

You’ll then play data engineer so you can understand how data analysts (and later, analytics engineers) interact with them. You’ll also dive deeper into modeling data.

Data Warehouses & Cloud Computing

You learned the basics of storing and querying data, now you’ll scale these skills out to data warehouses in the cloud. First, you’ll learn about data warehouses and how they’re different from regular databases. Next, the basics of cloud computing. You’ll then learn how to use Snowflake, a cloud-based data warehouse that is rapidly conquering their industry.

Snowflake is a cloud-based data warehousing company.

Analytics Engineering

An analytics engineer exists somewhere between the data engineers and the data analysts. You’ll learn how the role came to exist and how to use the hottest tool for analytics engineering in 2021 — dbt.

dbt is pioneering modern analytics engineering.

Term 2: Machine Learning & More

Objects & Algorithms

To kick off Term 2, you’ll wrap up Dr. David Joyner’s Introduction to Python Programming series. Object-oriented programming and algorithms are concepts that beginners can struggle with, so I delayed this course until right now. These topics are important foundations for machine learning.

Introduction to Machine Learning

Next, you’ll learn the basics of machine learning. First, you’ll acquire a framework for understanding it from Karolis Urbonas (Head of Machine Learning and Science at Amazon) in a theory-only course. Then, you’ll start doing machine learning in Python with Aurélien Géron’s Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow book.

I prefer having one instructor teach me machine learning with a unified narrative, so I prefer Aurélien’s book (and accompanying notebooks) to DataCamp’s bite-sized courses from various instructors. I also like Aurélien’s choice of machine learning libraries (more on this shortly), the book’s rave reviews, and its recent release (late 2019). These factors make up for the lack of video and interactive grading in the book and notebooks.

More Python Programming

These advanced Python skills will round out your programming for data science toolbox. With them, you’ll feel fully in command of the code you write.

Writing Functions in Python covers advanced concepts like context managers and decorators. You’ll build on the functions skills you learned early on in Term 1, allowing you to write “complex and beautiful functions so that you can contribute research and engineering skills to your team.”

Supervised Learning

You’ll then continue with the supervised learning chapters of Aurélien Géron’s Hands-On Machine Learning book. These chapters teach the Scikit-Learn library, which is described by the author in the following fashion:

Scikit-Learn is very easy to use, yet it implements many Machine Learning algorithms efficiently, so it makes for a great entry point to learn Machine Learning.

More Data Visualization

To break up the machine learning content, you’ll up your data visualization game. First, more Seaborn. Second, tips to make your visualizations more compelling. Third, interactive data visualization with plotly.

An exercise from Intermediate Data Visualization with Seaborn on DataCamp.

Unsupervised Learning

You’ll then hop back into Aurélien Géron’s Hands-On Machine Learning, where you’ll learn unsupervised learning techniques in Scikit-Learn.

Note, dimensionality reduction can be supervised (e.g., LDA), but the dimensionality reduction techniques you’ll learn in this chapter are unsupervised.

Introduction to Neural Networks

You’ll then learn how to train neural networks (i.e., do deep learning) with my final recommended chapters of Aurélien Géron’s Hands-On Machine Learning. You’ll pick up Keras, which is a high-level deep learning library that makes it simple to train and run neural networks.

Why stop at Keras in this curriculum? In Chapter 12 (i.e., the chapter after these ones), Aurélien writes:

Up until now, we’ve used only TensorFlow’s high-level API, tf.keras, but it already got us pretty far: we built various neural network architectures, including regression and classification nets, Wide & Deep nets, and self-normalizing nets, using all sorts of techniques, such as Batch Normalization, dropout, and learning rate schedules. In fact, 95% of the use cases you will encounter will not require anything other than tf.keras.

If you’re interested, you’re welcome to continue with Chapter 12 onwards, where you’ll learn how to build more advanced models with TensorFlow 2.0.

Data Science Ethics

Next, you’ll learn how to navigate the ethical dilemmas when exercising your new data skills. The main resource you’ll use is H.V. Jagadish’s University of Michigan course, where you’ll learn about informed consent, data ownership, privacy, anonymity, data validity, and algorithmic fairness. You’ll then learn how to use deon, a command line tool that allows you to add an ethics checklist to your data science projects. You’ll then learn about ethics in AI at a deeper level.

Scalable Data Science

You’ll then learn how to scale up your work to “big data” using parallel computing, GPUs, and the cloud. I selected Dask, BlazingSQL, and Coiled for this curriculum because they are the easiest to learn given the Python skills you’ve acquired thus far, they have strong development teams, and they are gaining industry adoption.

Dask scales up the existing Python ecosystem to multi-core machines and distributed clusters. It allows you to use your NumPy, Pandas, and Scikit-Learn skills on big data, instead of having to learn a new programming style like you would have with big data tools like Scala or Spark.

BlazingSQL provides a high-performance distributed SQL engine in Python. Like Dask, it will feel natural for Python users. A quote from Dask co-creator Matthew Rocklin:

One of the common requests we get for Dask is, “Hey, do you support SQL? I love that [with Dask] I can do some custom Python manipulation, but then I want to hand it off to a SQL engine.” And my answer has always been, “No, there is no good SQL system in Python.” But now there is — if you have GPUs.

Built with the PyData ecosystem in mind, Dask and BlazingSQL work nicely together.

Coiled is a startup that aims to make parallel computing and cloud computing easy for Python and Dask users. Per their website, “Dask scales Python for data science and machine learning, Coiled makes it easy to scale on the cloud.”

Note that “Coiled runs on AWS today, with Azure support coming soon.” Google Cloud is on the roadmap, and Google’s Head of Decision Intelligence is excited about that:

Learning a tool that is still being built out shows the benefits of an opinionated curriculum curated by an individual. I can be a little more agile than a school or company — compiling online resources only takes a few hours. I can also take a little more “tool risk.” In this case, I believe the risk is worth the reward. Plus, you’ll still learn the basic mechanics of scaling to the cloud with Coiled.

I helped curate the learning materials for each of the resources above, which is why they have the same naming convention.

Time Series Analysis

Next, you’ll hop back into developing your analyst skills. A time series is a series of data points indexed in time order. This type of data is ubiquitous, particularly in finance and applied science disciplines. First, you’ll learn how to handle time series data, then you’ll learn how to forecast based on that data.

Text Analysis

You’ll then develop your text analysis skills, learning the basics of regular expressions and natural language processing.

Other Fun Stuff

You’ll wrap up the program by learning skills that don’t have obvious curriculum categories. First, you’ll experience common machine learning pitfalls and how to fix them in real-life workflows. Then A/B testing, a critical skill for successful online experiments. Then, web scraping, which is a hacky but effective way of importing data on the internet. Next, you’ll learn how to analyze data that has a geographic component to it. Finally, you’ll learn an exciting new data analysis tool.

Siuba, born in 2019, is a new library that emulates an R library called dplyr that you’ll learn in ModernDive. Though Siuba doesn’t have much adoption yet, I’m including it because doing EDA in a dplyr-like way in Python would be a massive addition to an analyst’s toolbox, and early feedback on it is positive. Plus, the creator of the package built an online course (and the software to deliver that course) to promote adoption.

Frame: blog posts and YouTube videos

Interspersed between the resources above are blog posts and YouTube videos. These high-level resources frame your new skills in the context of the data industry in real life.

For example, after your introduction to data science, you’ll read a piece called, “Is data science a bubble?” You’ll gain an appreciation for where the industry is today and where the author thinks it is going.

That piece is a Cassie Kozyrkov creation — most of the Frame resources I selected are. She’s an excellent communicator with her pieces striking a nice balance between informative and humorous. She also has major industry experience so her opinions carry weight.

Other examples of blog posts of hers that you’ll read include:

An excerpt from the last linked piece to get a sense of her writing:

Don’t build tools for their own sake, build them to fulfill your users’ needs and make your users happy. Focus on integration — it’s important to make these tools play well with the rest of the ecosystem, because no one wants to stop what they’re doing to give your tool special treatment unless it’s a cure-all.

I’ve personally learned a lot from Cassie’s pieces. They’ve also shaped many of my decisions for this curriculum. I think you’ll find them valuable, too.

Assess: adaptive tests

After you learn a new skill and frame that new skill, you’ll then assess how proficient you are at this new skill. You’ll use DataCamp Signal, a new adaptive testing tool launched in 2019. You’ll mainly use this tool to:

Here’s how your score is presented:

From the DataCamp Signal white paper: “Assessment results include a score (0–200), a percentile (0%-100%), and an associated knowledge level (Novice, Intermediate, Advanced).”

Each assessment is a series of 15 challenges. The difficulty of your next challenge changes based on how well you’ve scored up until that point. The entire assessment takes 5–10 minutes total.

The screen before you start DataCamp’s Python Programming assessment.

I strategically interspersed the following assessments within the curriculum to leverage a memory phenomenon called the spacing effect, which describes how our brains learn more effectively when we space out our learning over time.

At points in these adaptive tests, you’ll encounter some skills that you haven’t learned yet, and that’s okay. Again, these tests are designed to adapt to your skill level. Skip those questions or give your best guess. You’ll come back to that assessment later and you’ll be able to visualize your progress.

At the end of Term 1 and Term 2, you’ll revisit all of the skill assessments you’ve completed up until that point. These scores will provide a quantitative gauge for how prepared you are for the analyst role (Term 1) and the analyst-ML expert hybrid role (Term 2).

My Python skills measured over time. June 8th: A little rusty. June 9th: After refreshing my skills, I scored 149 (95th percentile). December 24th: Rusty again (plus a little tired). Just like any skill, your data skills can erode over time if you don’t keep them sharp!

Note that I didn’t include any R assessments because learners are unlikely to score well on those even if they master the R resources I recommend. Learners should, however, be able to score well on the SQL assessment.

Create: self-directed projects

Here’s where you’ll set yourself apart from the crowd.

A self-directed project is a project with no defined end goal, no starter code or dataset, and no templated grading. These projects, in my opinion, are the only kind of projects that employers and clients truly want to see.

You’ll use your newly acquired skills to create something unique on a subject that you’re passionate about. There are eight projects spread throughout the curriculum (four in each term) and a capstone project at the end. I recommend spending two days on each regular project, and four days on the capstone.

You’ll feature some or all of the skills you learned in the courses immediately preceding each project. Here’s one potential outcome:

Not a Real Degree community members get access to my guides for creating self-directed projects, plus recordings of me creating my self-directed projects.

You can tailor the curriculum to an industry you’d like to target. Interested in Bitcoin? Interested in fashion? Interested in healthcare? Find a dataset (using Google’s Dataset Search, for example) and create a project on it. You can dedicate all of your projects to that industry if you’d like!

You’ll include all of these projects in your digital transcript with your skill assessment scores to prove what you learned.

The main drawbacks of self-directed projects are:

For the first one, DataCamp’s adaptive tests will help. Right before you start a project, you’ll get quizzed on your new skills. You’ll receive a score and a diagnosis of your skill gaps. You can revisit learning materials if necessary. If you score well, these will serve as 10-minute skill refreshers that will make starting your project a little less daunting.

DataCamp Signal telling me my current strengths and skill gaps for Python programming.

The community will also help mitigate these concerns.

First, I’ve set up a Circle community with dedicated spaces for each project.

I’ve also set up a Deepnote team. Deepnote (the tool) is a new kind of data science notebook with real-time collaboration. Think Google Docs, but for data science.

How we’ll collaborate in Deepnote.

If you get stuck, post in the community and someone (me, a fellow learner, or a community mentor) can help you debug in Circle and/or Deepnote. In 2021, my main priority will be solving the grading problem with these tools.

I’m excited to see the projects that you create.

Career services

The final piece that weaves the curriculum together is Build a Career in Data Science by Emily Robinson and Jacqueline Nolis. Published in March 2020, it’s comprehensive and up-to-date. It even has an accompanying podcast.

The book is divided into four parts, with the parts spread equally throughout this curriculum.

I will also experiment with additional career services as a part of the paid community throughout 2021. Resume reviews, interview coaching, etc.

That’s the curriculum! Read on to learn about the curation process, and your next steps.

How I created the curriculum

My process for curating this curriculum started with two questions:

First, I consulted my list of subjects from when I curated my personal data science curriculum in 2015. I then compared this list to the subjects covered in data science curricula from colleges, universities, bootcamps, and EdTech companies in 2020. I also asked for feedback from some data analyst and data scientist friends to ensure I was covering the latest tools used in industry.

I made some adjustments to my subject list, then got to selecting individual resources based on the explanations I provide above. I spread this process out over fifty hours over multiple weeks.

Next steps

Interested in taking the program? Here are your next steps:

Curating the internet’s best data science program.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store