Resources
Course Topics
In this project-based course, we have projects and conceptual learning objectives.
Conceptual Learning Objectives
The course content will be structured into the following units:
Unit # | Title | Modules and Description |
---|---|---|
1 | From Problem Identification to Ethical Data Solutions | Problem Identification Distilling the Analytic Objective Data Science Lifecycle Data Collection Ethics of Data Science |
2 | Structures, Algorithms, and Systems | Data Structures Sparse Matrix Recommender System Matrix Factorization Linear Algebra with Numpy |
3 | Data Exploration and Preparation | Performing Exploratory Data Analysis Statistical Inference & Hypothesis Testing Feature Engineering |
4 | Analytic Algorithms and Model Building | Supervised and Unsupervised Techniques Bias/Variance Tradeoff [Research paper] Conceptual Complexity and the Bias/Variance Tradeoff |
5 | Text Data & Natural Language Processing | Introduction to Natural Language Processing Language Representation and Modeling NLP Tasks, Applications, and Practical Implementation |
6 | Model Lifecycle: Selection, Evaluation, and Deployment | Model Selection Metrics and Intepretation [Research paper] Manipulating and Measuring Model Interpretability Model Deployment |
7 | Deep Learning | CPU vs. GPU Deep Learning Computer Vision |
8 | Advanced Natural Language Processing | Language Representation and Transformers BERT [Research paper] Attention Is All You Need |
Projects
The programming projects are geared towards providing hands-on experience with the following topics:
Project 1: Data Structure Selection and Optimization for Large Data Sets
The growth of the home-sharing and short-term rental markets have presented opportunities and challenges for communities globally. While for some, it encourages tourism and provides additional income streams, for others, it exacerbates the shortage of affordable housing. Your goal is to create a website that computes and shares the relevant analyzing results by scraping a data set from the Airbnb website and implementing a helper website that provides a responsive, user-friendly interface that computes analytic results on the fly. In this project, you will get hands-on experience with different data types and data structures and will make decisions on which data structure to apply by analyzing the performance differences caused by different approaches. You will explore different caching strategies and understand time differences in read/write operations from and to different data storage implementations.
Project 2: Problem Representation
Product review is a rich source of data for any business and can be used to improve customer experience, for example, by recommending new products that cater to the customer’s preference (think Amazon or Netflix). In this project, students will use a dataset of product review ratings to implement a simple recommender system using the collaborative filtering approach. The project will help students understand different representations of the collaborative filtering problem as well as the runtime and storage tradeoff in each representation.
Project 3: Domain Analysis and Exploration
Exploratory data analysis and visualization is a critical step to analyzing data, understanding feature correlations, and ultimately preparing the data for predictive modeling. In this project, students will utilize the Climate Change and Food Production dataset to preprocess data and generate visuals such as bar charts, heatmaps, and time series analysis graphs. By merging two different data sets, students can visually observe how climate change has influenced food production over the years. This project will help with understanding how to preprocess data and generate meaningful visuals to draw inferences.
Project 4: Domain Data Preparation
Advances in the internet and in search engines have provided us with unparalleled access to data; with a simple search query, you can acquire countless data points from several sources, including social media, news articles, and even research papers. An experienced data scientist should be able to build their own dataset on any given topic by performing the necessary data collection and processing tasks. This project will introduce students to a variety of toolkits for retrieving text data and for transforming them into a format suitable for numerical computations.
Project 5: Machine Learning and Model Performance
Machine learning has become a very popular field in recent years due to its transformative applications coupled with the availability of good datasets in various domains. A large number of machine learning libraries and tutorials have also greatly reduced the entrance barriers to the field. On the other hand, one still needs a solid understanding of the underlying data and mathematics principles to properly utilize existing machine learning tools. For example, machine learning methods typically work best when the hyperparameters are tuned based on the dataset at hand, yet care must also be taken to preserve the boundary between training and testing data. This project will provide students with hands-on experience in some of these areas.
Project 6: Model Deployment and Comparison
Deep learning has transformed many modern technologies, from language translation to self-driving cars. At the same time, the abundance of deep learning libraries and frameworks now allows us to become not just users but also developers of deep learning applications. This project will provide you with hands-on experience in building and deploying convolutional network models in a practical setting, using PyTorch and Azure cloud resources.
Project 7: Evaluation Optimization
One of the major phases of any machine learning pipeline is the evaluation and optimization of models based on the results of successive evaluations. In this project, you will use the SQuAD dataset to build a question answering system, utilizing various techniques for data preprocessing and question answering tasks while understanding the strengths and weaknesses of each technique. This project will help you understand the various perspectives on how to optimize solutions in machine learning as you apply different techniques, as well as the caveats for each technique you consider.
Project Learning Objectives
The project learning objectives (LOs) are designated below. Students will be able to:
Item | Description |
---|---|
Computer Systems and Data Structures | - Implement a variety of basic data structures and algorithms in pure Python - Consider the differences between different data structures, and decide on the best one given performance limitations. - Optimize Python code to hit performance benchmarks. |
Problem Representation | - Read data and perform basic table operations in Pandas. - Use Numpy operations and sparse matrix to perform efficient computations on large datasets. - Acquire a basic understanding of recommender systems in general and collaborative filtering in particular. |
Domain Analysis and Exploration | - Formulate functional and non-functional requirements for an envisioned data-driven solution to a business/research problem. |
Domain Data Preparation | - Use HTTP request, web scraper, and pdfminer to retrieve data from a variety of sources, leveraging both structured, semi-structured, and unstructured data to build holistic views of user experience and deliver targeted analytic solutions. - Perform data cleaning and preprocessing using appropriate API to allow for the organization of the data. |
Machine Learning and Model Performance | - Build and deploy a machine learning model using the appropriate analytic algorithms (such as linear, and logistic regression, and SVM) to gain an understanding from data, make predictions to solve business problems and inform decision making. - Experiment with different corpus models to perform multi-class classification on datasets. - Interpret domain problems as instances of data science task patterns, including classification, regression, ranking, and clustering. |
Model Deployment and Comparison | - Compare the performance of (training or deployment) for a subset of solutions on CPUs vs. GPUs. - Use model evaluation metrics to assess the goodness of fit between a model and data and cross-validation frameworks to evaluate predictive models. - Select appropriate visualization techniques to facilitate understanding of model performance and support error analysis. - Gain familiarity with machine learning on Microsoft Azure. |
Optimization of Model Performance | - Develop a QA system using the SQuAD dataset. - Use various techniques to develop closed-domain QA systems ranging from unsupervised learning methods of Jaccard overlap, tf-idf vectors, and leveraging the syntactic information in the sentences through abstract syntax trees to supervised learning methods, including simple linear models like logistic regression and state of the art language models like BERT. - Gain familiarity with machine learning on Microsoft Azure. |
Getting help
Piazza
The best communication portal to inquire about coursework-related matters is Piazza (see the Overview for Piazza link). For urgent communication with the teaching staff, it is best to post on Piazza and then send an email for a timely response.
Office Hours (OH)
The teaching staff holds office hours weekly to assist students with any course-related matters. Students can find the office hours schedule in the Google Calendar provided in the Overview. Before joining the Zoom meeting rooms, students must join the OH Queue, as only those on the queue list will be invited to the meetings. Students attending office hours should join the OH queue, regardless of the queue’s current status. Doing so allows TAs to better prepare for students’ questions, helps maintain an orderly queue, and ensures students can track their place in line without concern.
Designated TA / Extra-help OH
Students will be assigned a designated TA who will be their primary contact for course-related matters. It is important to note that students are not limited to attending only their designated TA’s office hours; they are encouraged to interact with all course staff as needed.
Designated TAs also provide students with special extra-help OH if additional assistance is required. These extra-help sessions are separate from the regular office hours and follow a separate schedule. This offers students a valuable opportunity to receive extended support whenever needed.
Accommodations
For Students with Disabilities
If you have a disability and have an accommodations letter from the Disability Resources office, I encourage you to discuss your accommodations and needs with me as early in the semester as possible. I will work with you to ensure that accommodations are provided as appropriate. If you suspect that you may have a disability and would benefit from accommodations but are not yet registered with the Office of Disability Resources, I encourage you to contact them at access@andrew.cmu.edu.
Medical Accommodations:
If you require accommodations for medical reasons, please contact Carnegie Mellon University’s Disability Resources to initiate the process. It is important to engage with Disability Resources directly for all accommodation requests, as they are equipped to assess and provide the necessary support in line with university procedures. Please refrain from sending any medical documentation directly to course instructors. As instructors, we are committed to implementing the accommodations approved by Disability Resources to ensure equitable access to our course.
Take Care of Yourself
Do your best to maintain a healthy lifestyle this semester by eating well, exercising, avoiding drugs and alcohol, getting enough sleep, and taking some time to relax. This will help you achieve your goals and cope with stress.
All of us benefit from support during times of struggle. You are not alone. There are many helpful resources available on campus and an important part of the college experience is learning how to ask for help. Asking for support sooner rather than later is often helpful.
If you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, we strongly encourage you to seek support. Counseling and Psychological Services (CaPS) is here to help: call 412-268-2922 and visit their website at http://www.cmu.edu/counseling/. Consider reaching out to a friend, faculty, or family member you trust for help getting connected to the support that can help.
If you or someone you know is feeling suicidal or in danger of self-harm, call someone immediately, day or night:
CaPS: 412-268-2922
Resolve Crisis Network: 888-796-8226
If the situation is life-threatening, call the police:
- On-campus: CMU Police: 412-268-2323
- Off-campus: 911
Please let us know if you have questions about this or your coursework.