'We're at the beginning': Teaching Data Science at Illinois

It’s the “sexiest” job of this century. The future looks bright for data scientists. Companies large and small want these skills. And they’re paying for it.

Salary ranges vary across companies, of course, but a quick Google shows definitive mad money. Worcester Polytechnic Institute estimates the median base salary for data scientist at $116,000. Boosters aren’t the best indication of just how comfortable data scientists find themselves in this marketplace, however. We can turn to Fortune Magazine’s gleeful schadenfreude for that. “Enjoy your fat salaries while you can data scientists, because the rising tide of new talent and—gasp—automation will take their toll,” declares a recent article.

No wonder, then, that the students in Professor Robert Brunner’s INFO 490: Introduction to Data Science would rather shell out for a new computer than let technical difficulties stand in their way.

Brunner (left) started teaching INFO 490 a few years ago. His primary appointment is in Astronomy, but he saw that students were looking for courses to formally teach them lucrative data analysis skills. So Brunner created INFO 490, to great success. Brunner raises the enrollment cap every time he teaches it.

But there’s a catch.

“You can’t teach cloud computing without cloud computers,” Brunner says. But that’s exactly what he’s been doing.

Until this semester, Brunner’s students have been using their own laptops to access and analyze large data sets. It’s been hard, to say the least.

“If you try to teach programming to students and you don’t provide hardware, it’s very difficult. Each student had to install the course software on their own computer. They might use a Windows, Mac, or Linux based computer, and there are different versions of each operating system,” Brunner says.

Brunner estimates that in Fall 2015, as much as 50% or more of office hours were devoted to fixing tech problems, rather than teaching content.

Some students decided to solve the problem on their own. They asked Brunner for the ideal “specs” – how fast should a processor be, how much RAM should be on the hard drive. What, essentially, would they need to buy in order to have a laptop that worked in Intro to Data Science.

“It was great in one sense because they wanted to do well. But buying a new machine -- that’s not the sort of thing we want to require to take the class,” Brunner says.

Instead, he’s looking to the skies.

On the Ground

A geek sits in front of three computer screens. Over her turquoise ponytail, you see lines and lines of lizard-green text against a black background, raining endlessly down beyond a monitor’s frame. These are the secrets of the universe.

Data science doesn’t look like this. At least it doesn’t when you’re just starting out.

Introduction to Data Science is a relatively new class at Illinois. Brunner started teaching it five years ago, in 2011, to meet the growing demand industry has for people with skills in data collection and analysis.  He’s taught it through the Department of Astronomy, the Research Park, and the Informatics program. Brunner first offered the course as INFO 490 in the summer of 2014.

INFO 490 is a deeply practical course. It’s meant to teach solid skills that can be immediately applied.

And it’s not for pros. The class currently spans two semesters, with the first semester providing foundational skills like basic programming in Python, databases, and visualization. In the second semester, students go on to learn more advanced concepts like machine learning, text and social media analysis, and cloud computing. To take the second course, you have to take the first, or have instructor approval.

Screenshot of INFO 490 course assignment

Skill levels vary across students. About one-third of students in the course are in Statistics or Computer Science. Another third are graduate students from science or engineering departments. The remaining students are from all over campus.

And without cloud computing, learning data science is a test of endurance for many students, regardless of their major or background.

Brunner’s tried to provide a standard experience for all students by using Docker containers. This worked. Sort of.

Docker is a service that bundles all of the parts that go into an application – code, libraries, tools —into a “container.” You can take your container from, say, a Microsoft computer to a Linux computer to a Mac computer and then back to Microsoft again, and the app will continue to run the same.

Inside of Brunner’s Docker container was a JupyterHub server. Students needed to use this server to develop and test their code.

But some students couldn’t get Docker to work. And even for those who could, they had to follow a multistep process just to submit their work.

“I use peer assessment in my classes. For students to submit work, they had to go to Docker to test their code, upload that code to Moodle, and then wait for their peers to download their code from Moodle and run it inside their Docker container. A significant portion of class time and office hours was given over just to answering Docker questions,” Brunner says.

But students kept coming to office hours to ask their questions. With every answer, they were one step closer to becoming that magical unicorn known as the data scientist.

In the Sky

In Spring 2016, however, they stopped coming.

Brunner used to have 10 or more students at each office hour. Now, office hours are generally empty.

What happened? The Computer Science Department rode in on a white cloud and saved them.

“The Head of CS is very supportive and gave us access to the CS cluster for this semester. We’re not trying to teach CS; we’re about application, or how to run existing techniques. But he helped us,” Brunner says.

With access to the CS cluster, students are now able to run everything in the cloud. To see assignments and run code, a student just needs a modern web browser. Students can access course materials from a computer at the library, from an iPad (Brunner doesn’t recommend this, though), or from a phone. No expensive new computer necessary.

And no need for students to take their tech problems to office hours. They’re not experiencing any.

“It’s been a game changer for us. If we have to go back to not having [access to the cluster], we couldn’t teach this course,” Brunner says.

At the Beginning

Losing INFO 490 would affect the campus for years to come.

Data science is not just a way for students to make money when they graduate. It’s a way of understanding how information works in the twenty-first century.

Teaching students to understand their world is the responsibility of the University, Brunner contends. “Universities prepare people for careers, true, but they do so by broadening people’s horizons. We want students to understand the world.”

Students don’t really understand data until they work with it themselves.

“Data science asks students, ‘Are you aware of all of the data that’s collected about you?’ Students don’t understand just how much until they see how easy it is to run machine learning algorithms and make predictions,” Brunner says.

And the rate and sophistication with which data is going to be collected and used are only going to increase. This leads Brunner to make some predictions of his own.

“In terms of data science education, we’re at the beginning. Eventually, there will be some sort of data science requirement for all students on campus. It will penetrate pretty much every area on campus. If we do this right and do it right now, it will be a game changer.”