Want to learn data science in 2021? Here’s the internet’s best curriculum

Curated by David Venturi for the #NotARealDegree community

  • Learn: courses, books, and tutorials
  • Frame: blog posts and YouTube videos
  • Assess: adaptive tests
  • Create: self-directed projects
  • Career services
  • How I created the curriculum
The process you’ll use to build your new data skills.
  • Just show me your picks: Sure! Curriculum.
  • Prerequisites: Basic arithmetic and high school algebra.
  • Target role: The analyst-machine learning expert hybrid.
  • Languages: 85% Python, 10% SQL, 5% R.
  • Two terms: Term 1 covers basic data analysis. Term 2 covers machine learning and advanced data analysis topics.
  • Time commitment: Each term is roughly 65 days, where one day contains 4–5 hours of focused learning. You set your schedule and location.
  • Price: Varies since some resources require a subscription. As little as $249 for those who complete the curriculum in six months, and $375 for a year.
  • Community: I’m creating a community so you don’t have to learn alone. Parts of the community will be paid ($8/month or $67/year) so members are invested and engaged. Join the waitlist!
  • Why I’m doing this: To help democratize education. I chose an affiliate revenue model blended with a paid community to make maintaining the curriculum and building the community my full-time job.

Curriculum overview

Term 1: Data Analysis

  • Introduction to Data Science
  • Introduction to Python Programming
  • Setting Up Your Computer
  • Python Data Science Toolbox
  • Importing Data
  • Preparing Data
  • Exploratory Data Analysis
  • Statistics
  • Data Visualization
  • More Statistics
  • Databases & SQL
  • Data Engineering
  • Data Warehouses & Cloud Computing
  • Analytics Engineering

Term 2: Machine Learning & More

  • Objects & Algorithms
  • Introduction to Machine Learning
  • More Python Programming
  • Supervised Learning
  • More Data Visualization
  • Unsupervised Learning
  • Introduction to Neural Networks
  • Data Science Ethics
  • Scalable Data Science
  • Time Series Analysis
  • Text Analysis
  • Other Fun Stuff

Learn: courses, books, and tutorials

Below, I’ll link to each course, book, and tutorial I selected in the order they appear in the curriculum. I’ll also briefly explain why I put it there.

  • By content, I am referring to how they’ve unbundled the multi-week course into four-hour mini-courses taught by subject matter experts.

Term 1: Data Analysis

Introduction to Data Science

First, you’ll acquire a framework for understanding the data science industry in a theory-only course. Then, you’ll do a little data science using Python. You’ll make your first coding errors under the instructor’s guidance, so no need to be intimidated!

A video from Data Science for Everyone on DataCamp.

Introduction to Python Programming

Next, you’ll learn Python programming and the fundamentals of computer science, which are foundational to the data skills you’ll learn next.

Dr. Joyner teaching in Georgia Tech’s Introduction to Python Programming series on edX.

Setting Up Your Computer

Next, you’ll set up your computer and learn how to work in your own computing environment (as opposed to the environment set up for you by DataCamp or edX, for example).

The JupyterLab interface, where you can interact with the command line, conda, and Git, as well as do fancy data science as displayed in the notebooks.

Python Data Science Toolbox

Next, you’ll add some more programming tools to your Python toolbox. These tools will come in handy later in the curriculum.

Importing Data

As you’ll have learned in Data Science for Everyone, importing data is part of the first step of the data science workflow. You’ll learn how to import data using pandas, the most popular analytics library in Python.

Preparing Data

You’ll then learn how to prepare your data for analysis. The following courses teach the skills you’ll use most often.

Exploratory Data Analysis

Exploratory data analysis (EDA) is the process of exploring data to summarize their main characteristics. You’ll learn EDA next, and you’ll do it often in your data career.

Statistics

Statistics is the study of how to collect, analyze, and draw conclusions from data. You’ll learn how to do that in Python, while also learning some of the probability theory that underlies statistical inference.

An exercise from Introduction to Statistics in Python on DataCamp.

Data Visualization

You’ll then learn how to visualize your data. First, you’ll learn the theory behind data visualization, then how to use the most popular data viz libraries in Python.

More Statistics

You’ll then dive a little deeper into statistics, which will prepare you for machine learning in Term 2. You’ll also learn the basics of R, a programming language that is optimized for statistics.

The ModernDive website.

Databases & SQL

Nearly every data role requires the basics of databases and SQL, and you’ll acquire them next.

An exercise from Joining Data in PostgreSQL on DataCamp.

Data Engineering

You’ll then play data engineer so you can understand how data analysts (and later, analytics engineers) interact with them. You’ll also dive deeper into modeling data.

Data Warehouses & Cloud Computing

You learned the basics of storing and querying data, now you’ll scale these skills out to data warehouses in the cloud. First, you’ll learn about data warehouses and how they’re different from regular databases. Next, the basics of cloud computing. You’ll then learn how to use Snowflake, a cloud-based data warehouse that is rapidly conquering their industry.

Snowflake is a cloud-based data warehousing company.

Analytics Engineering

An analytics engineer exists somewhere between the data engineers and the data analysts. You’ll learn how the role came to exist and how to use the hottest tool for analytics engineering in 2021 — dbt.

dbt is pioneering modern analytics engineering.

Term 2: Machine Learning & More

Objects & Algorithms

To kick off Term 2, you’ll wrap up Dr. David Joyner’s Introduction to Python Programming series. Object-oriented programming and algorithms are concepts that beginners can struggle with, so I delayed this course until right now. These topics are important foundations for machine learning.

Introduction to Machine Learning

Next, you’ll learn the basics of machine learning. First, you’ll acquire a framework for understanding it from Karolis Urbonas (Head of Machine Learning and Science at Amazon) in a theory-only course. Then, you’ll start doing machine learning in Python with Aurélien Géron’s Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow book.

More Python Programming

These advanced Python skills will round out your programming for data science toolbox. With them, you’ll feel fully in command of the code you write.

Supervised Learning

You’ll then continue with the supervised learning chapters of Aurélien Géron’s Hands-On Machine Learning book. These chapters teach the Scikit-Learn library, which is described by the author in the following fashion:

More Data Visualization

To break up the machine learning content, you’ll up your data visualization game. First, more Seaborn. Second, tips to make your visualizations more compelling. Third, interactive data visualization with plotly.

An exercise from Intermediate Data Visualization with Seaborn on DataCamp.

Unsupervised Learning

You’ll then hop back into Aurélien Géron’s Hands-On Machine Learning, where you’ll learn unsupervised learning techniques in Scikit-Learn.

Introduction to Neural Networks

You’ll then learn how to train neural networks (i.e., do deep learning) with my final recommended chapters of Aurélien Géron’s Hands-On Machine Learning. You’ll pick up Keras, which is a high-level deep learning library that makes it simple to train and run neural networks.

Data Science Ethics

Next, you’ll learn how to navigate the ethical dilemmas when exercising your new data skills. The main resource you’ll use is H.V. Jagadish’s University of Michigan course, where you’ll learn about informed consent, data ownership, privacy, anonymity, data validity, and algorithmic fairness. You’ll then learn how to use deon, a command line tool that allows you to add an ethics checklist to your data science projects. You’ll then learn about ethics in AI at a deeper level.

Scalable Data Science

You’ll then learn how to scale up your work to “big data” using parallel computing, GPUs, and the cloud. I selected Dask, BlazingSQL, and Coiled for this curriculum because they are the easiest to learn given the Python skills you’ve acquired thus far, they have strong development teams, and they are gaining industry adoption.

Built with the PyData ecosystem in mind, Dask and BlazingSQL work nicely together.

Time Series Analysis

Next, you’ll hop back into developing your analyst skills. A time series is a series of data points indexed in time order. This type of data is ubiquitous, particularly in finance and applied science disciplines. First, you’ll learn how to handle time series data, then you’ll learn how to forecast based on that data.

Text Analysis

You’ll then develop your text analysis skills, learning the basics of regular expressions and natural language processing.

Other Fun Stuff

You’ll wrap up the program by learning skills that don’t have obvious curriculum categories. First, you’ll experience common machine learning pitfalls and how to fix them in real-life workflows. Then A/B testing, a critical skill for successful online experiments. Then, web scraping, which is a hacky but effective way of importing data on the internet. Next, you’ll learn how to analyze data that has a geographic component to it. Finally, you’ll learn an exciting new data analysis tool.

Frame: blog posts and YouTube videos

Interspersed between the resources above are blog posts and YouTube videos. These high-level resources frame your new skills in the context of the data industry in real life.

Assess: adaptive tests

After you learn a new skill and frame that new skill, you’ll then assess how proficient you are at this new skill. You’ll use DataCamp Signal, a new adaptive testing tool launched in 2019. You’ll mainly use this tool to:

  • Create a digital transcript using your test scores to prove what you learned.
  • Track your scores throughout the curriculum to visualize and gamify your progress.
From the DataCamp Signal white paper: “Assessment results include a score (0–200), a percentile (0%-100%), and an associated knowledge level (Novice, Intermediate, Advanced).”
The screen before you start DataCamp’s Python Programming assessment.
My Python skills measured over time. June 8th: A little rusty. June 9th: After refreshing my skills, I scored 149 (95th percentile). December 24th: Rusty again (plus a little tired). Just like any skill, your data skills can erode over time if you don’t keep them sharp!

Create: self-directed projects

Here’s where you’ll set yourself apart from the crowd.

  • Grading is hard. How do I know if my work is correct?
DataCamp Signal telling me my current strengths and skill gaps for Python programming.
How we’ll collaborate in Deepnote.

Career services

The final piece that weaves the curriculum together is Build a Career in Data Science by Emily Robinson and Jacqueline Nolis. Published in March 2020, it’s comprehensive and up-to-date. It even has an accompanying podcast.

  • Part 2: Finding Your Data Science Job
  • Part 3: Settling Into Data Science
  • Part 4: Growing in Your Data Science Role

How I created the curriculum

My process for curating this curriculum started with two questions:

  1. What are the best resources for learning those subjects?

Next steps

Interested in taking the program? Here are your next steps:

  • Follow the steps in this post to set up your learning experience.
  • Start learning.

Curating the internet’s best data science program.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store