Sean Smith [email protected] sean-smith
Ann Ming Samborski [email protected] asamborski
We are going to scrape github.com for publically acessible github repositories and collect the following information:
- stars
- forks
- followers
- owner_id
- org_name
- primary_language
- secondary_language
- lines_of_code
- num_contributors
- first_commit
- num_pull_requests
- speaking_language
- num_commits
- commit_times
- loc_per_comit
- type_of_computer (.DS_Store ftw)
With this data we want to answer a bunch of questions.
- What are the most popular coding languages? How has this changed over time?
- Given commit times, can we predict number of stars?
- Does lines of code correlate with number of stars?
- Do number of contributors correlate with number of stars?
- What are the most popular repositories to contribute to?
- How does contributions vary by country? By time zone?
- Does number of commits correlate with lines of code?
- What repository has the most number of commits?
- Do organizations have less or more contributors? Lines of code?
- Do Chinese speaking people use github? Do they use it in China? (see great cannon for context)