The pedagogy of computer programming for data science

Published in ConnectED Newsletter - Volume 7 - Issue 3 - March 2024

Nicole Lorenzetti

Nicole L. Lorenzetti, Lecturer in Educational Foundations and Special Education, recently published an article entitled “Some pedagogical elements of computer programming for data science: A comparison of three approaches to teaching the R language.” Co-authored with David Shilane (Columbia University) and Nicole Di Crecchio (Rutgers University), the article appeared in the November 2023 issue of Teaching Statistics: An International Journal for Statistics and Data Science Teaching

Coursework in data science and data analysis is often a fundamental part of many academic disciplines. This includes skills such as learning how to read data, clean data, and implement statistical methods for analysis and discussion of evidence. This can be done through either point-and-click statistical software, such as SPSS, or through systems that require computer programming skills—namely, R and R Studio. This program is particularly popular due to its open source (and free!) platform, which allows users to either engage in popular programming languages that already exist, such as dplyr or data.table, or to write new languages, such as DTwrappers, written by the first author of this paper.

The article investigates three R programming languages for use in graduate-level coursework: the dplyr, data.table, and DTwrappers packages. The authors discuss the pedagogical elements of computer programming that are inherent in each package, including the functions, operators, general knowledge, and specialized knowledge needed to use each language package for teaching students data science. They found that relative to dplyr and data.table, the DTwrappers package is designed more for simplicity and standalone analyses. This may be beneficial for students who only plan to take a small number of statistically-oriented courses or for whom data analysis will only serve as a portion of their work rather than the primary component.

By contrast, dplyr and data.table require a greater degree of structuring in the coding syntax.  Creating more complex structures can be beneficial as students take on more challenging projects. The dplyr and data.table packages can be of greater benefit to students who will write code on a more regular basis in their future courses and projects. dplyr is widely known for its syntactic structure being both consistent and beautiful, while data.table's significant advantages in computational speed and efficiency can be compelling.

Each language has a specific use in a data science curriculum depending on the needs of the specific student population. For students who are seeking coursework or a degree in statistics or data science, their introductory curriculum must balance learning new skills while also creating a foundation for further development of progressively more demanding programming skills. Meanwhile, students of other disciplines who need statistical coursework but for whom it may not be a major focus of their program may only receive a single class in data science, so utilizing simple coding syntax and methods can help the students develop concrete skills in a short time. A range of methods and programming styles that can be customized to different degree programs are a good tool to help teachers to design curricula that use specific languages to best meet the needs of their students. This study is also a precursor to a randomized control trial the researchers are conducting to compare learning across all three languages, currently in the data collection phase.

Last Updated: 03/15/2024 13:59