CSCI 453
Large-Scale Data Analytics and Visualization
Coordinator: jingnan xie
Credits: 4.0
Description
A practical introduction to data analytics, visualization, and blending theory. Students will learn about and apply various clustering algorithms and techniques for dealing with noisy data, use a distributed data analytics framework, complete laboratory assignments using version control, and enforce reproducibility by having all science easily sharable. Students will become familiar with modern data analytics methods and explore real-world data sets. Visualization of results will be a large component of the course through interactive and static frameworks. Offered Periodically.
Prerequisites
CSCI 366 AND (MATH 235 OR MATH 333 OR MATH 335).
Course Outcomes
At the end of this course, a student will:
-
Create reproducible, explainable data science workflows
-
Use modern distributed Map-Reduce framework, such as Apache Spark, to analyze data
-
Implement parallel clustering methods
-
Develop strategies for overcoming common imperfections in real-world datasets
-
Apply visualization techniques to multi-dimensional data
-
Apply gained skills to extract insights from multi-dimensional, real-word datasets
These goals will be accomplished through the content of the lectures and textbook, as well as hands-on experience. This hands-on experience includes writing programs (both in the lab and in project assignments). There will also be a significant course project in which you identify an analysis topic, discover data, model the data using data mining techniques, analyze the results, and report outcomes. The achievement of the goals will be measured through your performance on approximately 7 lab assignments, the project, and two exams (midterm and final).
Tentative Semester Schedule
Week 1: Introductory materials on experimental design and data
Week 2: Data operations: filtering, transforming, reducing
Week 3: Distributed computing
Week 4: Distributed regression
Week 5: Visualization of one-dimensional data
Week 6: Visualization of two-dimensional data
Week 7: Exam
Week 8: Case Study: K-Means Clustering
Week 9: Distributed Graph Algorithms
Week 10: Case Study: Page Rank
Week 11: Distributed Regression
Week 12: Distributed Machine Learning + Cross Validation
Week 13: Distributed SQL
Week 14: Presentations