The project work involves three main components: data preprocessing, the application of statistical methods and visualization of major findings. The project work is carried out in teams. A team can have a maximum of 5 students. To ensure diversity in the data sets being used, each team is required to share a link to the dataset they are working on. Furthermore, it is important to note that no two teams should select the same dataset to work on. For effective statistical analysis, it is essential to confirm that certain columns within the dataset contain numeric data.To complete the project work, it is necessary to utilize an interactive notebook that incorporates both Markdown and Python code cells.
- You cannot utilize the Titanic dataset that was implemented in class.
- You may acquire the dataset for your team from Kaggle (https://www.kaggle.com/datasets) or any other open data source.
- It is essential to ensure that you have an adequate number of data entities, with a minimum of 2000 rows to work with.
- Each team is required to present their findings.
- The demonstration date is specified in the progression plan.
- It is mandatory for every team member to participate in the demonstration.
- One of the team member submits both the dataset and the notebook file in Moodle.
- You can utilize Google Colab to create a collaborative notebook, with each team member added as a contributor.
- Submit the link to the collaborative notebook and the link to the dataset.
Employ your imagination and creativity when it comes to preprocessing and analyzing your dataset. While you are encouraged to explore beyond the listed tasks below, you can consider the following as a starting point:
- Select a dataset to work on and understand your dataset. Your activities may include tasks such as viewing a random sample of data, getting the total number of rows and columns.
- Check to see if your dataset contains any missing values and get the percentage of the missing data. Within the context of your dataset, decide what to do with the missing values and take necessary steps.
- Identify and drop duplicate values from the dataset.
- Separate one of your columns in the dataset that contains continuous numeric data into appropriate bins. You may use cut or qcut function.
- Identify any outliers within your dataset. If the dataset does not include any outlier, you can randomly mess up some portion of your data.
- Decide what to do with the outliers.
- 
Use your dataset and print columns name that represent nominal attributes. 
- 
Use your dataset and print columns name that represent binary attributes. 
- 
Use your dataset and print columns name that represent ordinal attributes. 
In this task, you will utilize NumPy to perform fundamental statistical operations on your dataset. You can explore the following calculations as a guideline:
- Mean (Calculate the mean or average )
- Median (Find the median value)
- Standard Deviation (Compute the standard deviation of data).
- Variance (Calculate the variance of data)
- Minimum and Maximum (Find the minimum and maximum values)
- Sum and Product (Compute the sum and product)
You will utilize Matplotlib to create a diverse range of plots and charts. Depending on the dataset and your analysis, you may generate the following visuals as needed:
- Line Plot
- Scatter Plot
- Bar Chart
- Histrogram
- Piechart
- Heatmap
- 3D Plot
For this project work, the grading is conducted on a scale of 0 to 5 and encompass three main components: data preprocessing, the application of statistical methods and the visualization of major findings. Here are the grading criteria:
- 
Understanding the Dataset (1 point): Demonstrates a basic understanding of the selected dataset by performing tasks like viewing a random sample, determining the total number of rows and columns. 
- 
Handling Missing Values (1 point): Identifies and addresses missing values effectively, providing a clear strategy for handling them. 
- 
Handling Duplicates (1 point): Detects and removes duplicate values from the dataset, ensuring data cleanliness. 
- 
Nominal Attributes (0.5 points): Accurately identifies and prints the column names representing nominal attributes. 
- 
Binary Attributes (0.25 points): Accurately identifies and prints the column names representing binary attributes. 
- 
Ordinal Attributes (0.25 points): Accurately identifies and prints the column names representing ordinal attributes. 
- 
Mean and Median (0.5 points): Correctly calculates the mean and median values for appropriate columns. 
- 
Standard Deviation and Variance (0.5 points): Accurately computes the standard deviation and variance of data. 
- 
Minimum and Maximum (0.5 points): Finds the minimum and maximum values. 
- 
Sum and Product (0.5 points): Accurately calculates the sum and product of data. 
- 
Line Plot and Scatter Plot (0.5 points): Creates both a line plot and a scatter plot with appropriate labels and titles. 
- 
Bar Chart and Histogram (0.5 points): Successfully generates a bar chart and a histogram. 
- 
Pie Chart and Heatmap (0.5 points): Produces a pie chart and a heatmap. 
- 
3D Plot (0.5 points): Creates a 3D plot if applicable to the dataset. 
- Presentation Quality (1 point): The presentation is well-structured, informative, and effectively conveys the findings.
- Team Participation (1 point): All team members actively participate in the presentation.
- Work Distribution and Contribution (1 point): The distribution of tasks is transparent, and each student has a comprehensive understanding of their role within the project.
- Submission of Dataset and Notebook (1 point): Both the dataset and the notebook are correctly submitted through Moodle. The collaborative notebook in Google Colab is properly shared with team members.
- Innovative Approach (2 points): Demonstrates creativity and imaginative problem-solving throughout the project, going beyond the listed tasks.
Total Points (Max 14 points):
Grading will be based on the total points achieved according to the above criteria. The maximum achievable score for this project is 14 points.
- The team grade is 5 if the total points are greater than 12.
- The team grade is 4 if the total points are greater than 10.
- The team grade is 3 if the total points are greater than 9
- The team grade is 2 if the total points are greater than 7
- The team grade is 0 if you do not get a minimum of 7 points.
Please note that missing the demonstration will have a negative impact on your final grade.