This course requires knowledge of Python and SQL (the requirement is listed in the course description). If you do not know Python, you will not do well and the course will be that much harder.
For the Fall 2023 Semester, there are three sections of Data Science being offered. There are different Primary/Secondary Instructors and Chat Platforms for each Primary Instructor:
Section | Primary | Secondary | Chat Platform |
---|---|---|---|
81 | Butcher | Stewart | Slack |
83 | TBD | TBD | TBD |
- If you are in Section 83, you should still install everything as described below. At this time, we are unsure of who will be the primary instructor.
- You are very strongly encouraged to use a computer upon which you have administrator/superuser privileges. I cannot help you with problems associated with the installation of software and libraries.
- You are encouraged to use "Unix"-style operating system (MacOS or Linux flavor) either directly or in a virual environment (Docker or VirtualBox). It's not required but you should be multi-hosted when it comes to OSes and the examples of command line utilities will be in 'Nix. This is not necessary to excel in the class but it is helpful. Many platforms are built on Linux and you should learn to use it.
- Ideally, you should have your environment up and running before the semester starts but no later than the 2nd day of class (that first Friday). There is a test assignment due that day.
- Install Anaconda for Python for your operating system: Anaconda. Use the latest.
- Set
libmamba
to be the default installer:$ conda config --set solver libmamba
Note: If this fails, update Conda. - Create a directory/folder for data science and move into it.
- Download environment.yml into your directory (or just copy the Raw content, paste it into a file named
environment.yml
, and save it). - Execute
conda env create -f environment.yml
- You now have all the libraries needed for the course (as of now).
- Execute
conda activate en685648
(whenver working in that environment for any reason, activate it!). - Set up Jupyter notebook to use this environment:
python -m ipykernel install --user --name en685648 --display-name "Python (en685648)"
For now, the only thing in this directory will be the environment.yml
file.
NB: you must install the specified version of python-duckdb
. Database formats between versions are not compatible.
If you have an error setting the solver, you have an older version of Conda. Please update.
Once the class has started, you will be able to download the Jupyter notebooks for each module. In the interim, you may want to get a feel for the enviroment in which you'll be working. Use the following commands:
conda activate en685648
- this will activate the environment. (Useconda env list
to see your installed environments).jupyter notebook
- this will start the Jupyter notebook environment with the current directory as the root.- When you create a new Jupyter notebook, you can select "Python (en685648)" as the kernel.
Note - jupyter notebook
is eventually be "sunsetted" in favor of jupyter lab
. If you want to use jupyter lab
, that's fine.
When you're done, you can invoke conda deactivate
.
NB: You "must" use this Anaconda environment and the "en685648" kernel for this class. Failure to do so has consequences that are your responsibility. These consequences may include getting a zero on assignments.
Do not use a regular code editor for you assignments. Instead use something that can correctly edit and display Jupyter notebooks (ie, Jupyter Notebooks, VS Code, etc.).
The students taking this course come from a variety of backgrounds. While the course itself covers a lot of topics, you will have an easier time of it during the semester if you do a bit of preparation on your own. These links will get you started but feel free to explore other resources using Google.
- Python 3
- Jupyter notebooks (YouTube)
- Jupyter Notebooks for Beginners (blog)
- Advanced Jupyter Notebooks (blog)
- Markdown
- Pandas
- Matplotlib
Codewars is a great way to increase fluency with Python. Look for idiomatic Python solutions.
This course is not primarily a coding course and Data Science is not primarily about running code. Data Science is about analysis and communication. Style, usage, and organization matter. You must be equally adept at using the Markdown and Code cells in the Jupyter notebook. If nothing else, learn to use use Markdown effectively. Additionally, tabulate
has been included to help with the creation of tables.