- Name: Pradyun S. Gedam
- Email: [email protected]
- Github: pradyunsg
- University: VIT University, Vellore, India
- Course: Bachelor of Technology in Computer Science and Engineering
- Course Term: 2016/17 - 2019/20 (4 Year)
- Timezone: IST (GMT +5:30)
- GSoC Blog RSS Feed URL: https://pradyunsg.github.io/gsoc-2017/feed.xml
I was introduced to Python about five years ago, through "Core Python Programming" by Wesley J Chun. After the initial struggle with Python 2/3, the ball was set rolling and hasn't stopped since. I have fiddled around with Game Programming (PyGame), Computer Vision (OpenCV), Data Analytics (Pandas, SciPy, NumPy), transcompilers (Py2C) and more.
As a high school student, I did my internship at Enthought in 2013 and 2014. The two summers that I spent at Enthought were a great learning experience and I thoroughly enjoyed the environment there.
Other than Python, I have also used C, C++, Web Technologies (JavaScript, HTML, CSS) and preprocessors (Pug, TypeScript, LESS/SASS/SCSS), Java and Bash/Zsh for some other projects.
Curious to understand how pip works, I began looking into pip's source code. I started contributing to pip in May 2016. I am now fairly familiar with the design of pip and have a adequate understanding of how it works, due to the various contributions I have made to pip in the past year.
- Donald Stufft (@dstufft on GitHub)
- Justin Cappos (@JustinCappos on GitHub)
Communication with the mentors will be mostly done on issues and pull requests on pip's GitHub repository. If at any point in time, a real time discussion is to be done with the mentors, it would be conducted over IRC or Skype. Email can also be used if needed.
This project is regarding improving dependency resolution performed within pip by implementing a dependency resolver within it.
Currently, pip does not resolve dependencies correctly when there are conflicting requirements. The lack of dependency resolution has caused hard-to-diagnose bugs/failures due to the installation of incompatible packages. The lack of a dependency resolver is also a blocker for various other features - adding an upgrade-all functionality to pip and properly determining build-time dependencies for packages are two such features.
At the end of this project, pip will have the ability to:
- detect requirement conflicts
- resolve conflicting dependency requirements where possible
The implementation language will be Python. The code will maintain the compatibility requirements of pip - the same source code will support the multiple Python implementations and versions, including but not limited to, CPython 2.7, CPython 3.3+, PyPy 2 and PyPy 3.
New Tests for the functionality introduced will be added to pip's current test suite.
User documentation would not be a major part of this project. The only change needed is to mention, in certain sections, that pip can now resolve dependencies properly even when there are conflicts and update the same sections accordingly.
The project will be composed of the following stages:
- Refactor the dependency resolution logic of pip into a separate module.
- Implement dependency information caching in pip.
- Implement a backtracking dependency resolver to resolve the dependencies.
Every stage depends on the previous ones being completed. This step-wise approach would make incremental improvements so that there is a constant feedback on the work being done as well as making it easy to change course without losing existing work; if needed for some unforeseen reason.
There is a class in pip - RequirementSet
, which is currently a God class. It
is responsible for the following:
- Holding a list of things to be installed
- Downloading Files
- Dependency Resolution
- Unpacking downloaded files (preparation for installation)
- Uninstalling packages
- Installing packages
This is clearly a bad situation since this is most of the heavy lifting
involved in pip. These responsibilities need to be separated and moved out into
their independent modules/classes, to allow for simplification of
RequirementSet
while providing a clean platform for the remaining work.
This is the most tricky portion of this project, given the complexity of
RequirementSet
as it stands today.
There are two kinds of distributions that may be used to install a package - wheels (binary) and sdists (source). When installing a package, pip builds a wheel from an sdist and then proceeds to install the wheel. The difference between the two formats of distribution relevant to this project is - wheels store the information about dependencies within them statically; sdists do not.
Determining the dependencies of a wheel distribution is merely the matter of
fetching the information from a METADATA
file within the .whl
file. The
dependency information for an sdist, on the other hand, can only be determined
after running its setup.py
file on the target system. This means that
dependencies of an sdist depend on how its setup.py
behaves; which can change
due to variations on target systems or could even contain through random
choices.
Computation of an sdist's dependencies on the target system is a time-consuming task since it potentially involves fetching a package from PyPI and executing it's setup.py to get the dependency information. In order to improve performance, once an sdist's dependencies are computed, they would be stored in a cache so that during dependency resolution, the dependencies of an sdist are not computed every time they are needed. Further, pip caches wheels it downloads or builds. This means that any built or downloaded wheel's dependency information would be available statically, from the wheel that has been cached, without the need to fetch from the sdist dependency cache.
Like the wheel cache, sdist-dependency-cache will be a file system based cache. The sdist-dependency-cache would only be used if the corresponding sdist is being used.
Since sdist dependencies are non-deterministic, the cached dependency information is potentially incorrect - in certain corner cases such as using random choices in setup.py files. Such uses are not seen as important enough to cater to, compared the benefits of having a cache. Further, this is already the case with the information in the wheel cache.
SAT solver based resolution is not feasible for pip since a SAT solver needs the entire set of packages and their dependencies to compute a solution, which cannot be generated due to the aforementioned non-deterministic behaviour of setup.py file. Computing dependency information for all of PyPI on a target system for installing a package is simply not feasible.
The most reasonable approach is using a backtracking solver. Such a solver can be incrementally provided information about the dependencies of a package and would only need dependency information about packages in the dependency graph of the current system.
There is a need to keep a cache of visited packages during dependency resolution. A certain package-version combination may be reached via multiple paths and it is an inefficient use of computation time to re-compute that whether it is indeed going to satisfy the requirements or not. By storing information about which package-version combinations have been visited and do (or do not) satisfy the constraints, it is possible to speedup the resolution.
Consider the following example:
A-1.0 (depends: B)
A-2.0 (depends: B and E-1.0)
B-1.0 (depends: C and D)
C-1.0 (depends: D)
D-1.0
D-1.1 (depends: E-2.0)
E-1.0
E-2.0
If an installation of A is required, either A-2.0 or D-1.1 should not be installed because they have a conflicting requirement - E. While there are multiple possible solutions to this situation, the "most obvious" one us to use the D-1.0, instead of not installing A-2.0. Further, as multiple packages depend on D, the algorithm would "reach it" multiple times. By maintaining a cache for the visited packages, it is possible to achieve a speedup in such a scenario.
Pull requests would be made on a regular basis during the project to ensure that the feedback loop is quick. This also reduces the possibilities of conflicts due to unrelated changes in pip.
All the code will be tested within pip's existing testing infrastructure. It has everything needed to write tests related to all the changes to be made. Every PR made to pip as a part of this project will contain new tests or modifications to existing ones as needed.
Initially, some abstractions will be introduced to the pip codebase to improve
the reuse of certain common patterns within the codebase. This includes cleaner
temporary directory management through a TempDirectory
class.
RequirementSet.prepare_files
and RequirementSet._prepare_file
are
downloading, unpacking packages as well as doing what pip does as dependency
resolution. Taking these functions apart neatly is going to be a tricky task.
The following is a listing of the final modules that will be responsible for
the various tasks that are currently being performed by RequirementSet
:
pip.resolve
- Dependency Resolutionpip.download
- Downloading Files & Unpacking downloaded filespip.req.req_set
- Holding a list of things to be installed / uninstalledpip.operations.install
- Installing Packages (preparation for installation)pip.operations.uninstall
- Uninstalling Packages
To be able to proceed to the next step, only the dependency resolution related
code needs to be refactored into a separate module. Other portions of
RequirementSet
are not required to be refactored but the same will be tackled
as optional deliverables. In other words, only pip.resolve
needs to be
completed to be able to proceed to the next stage in this project. This is
needed since in Stage 3, the resolver would be written in pip.resolve
,
independent of the rest of the codebase.
A new module pip.cache
will be created. Within this module, all the caching
will be handled. Thus, the code for the current wheel cache would be moved.
The new code for a dependency cache for sdists would also be written here.
The new cache would hold all of an sdist's egg-info. The information will be
stored on the file-system in a sub directory structure much like that of the
wheel cache, in a directory structure based on hash of the sdist file holding
the egg-info at the end. This will be done in a class named EggInfoCache
.
EggInfoCache
cache will be used only if a corresponding wheel for an sdist is
not available. Installing an sdist results in the creation of a wheel which
contains the dependency information, which would be used over the information
available in the EggInfoCache
.
To be able to proceed to the next step, it is required that EggInfoCache
is
implemented.
The module pip.resolve
will be modified and a class named
BacktrackingResolver
will be added to it. The class does what you expect it
to do - it would resolve dependencies with recursive backtracking. As described
above, there will be some global state stored about the packages that have been
explored. Other than the maintenance of global state, in the form of the cache,
the rest of the algorithm will essentially follow the same structure as any
backtracking algorithm.
The project would be complete when the aforementioned resolver is implemented.
There is existing work directly related to dependency resolution in pip, done by multiple individuals.
-
Robert Collins (un-merged closed pull request on pypa/pip)
This has an incomplete backtracking dependency resolver and dependency caching.
-
Sebastien Awwad (branch on a separate fork)
This was used for the depresolve project, to investigate the state of Dependency Resolution in PyPI/pip ecosystem.
-
pip-compile
(separate project)This has a backtracking dependency resolver implemented to overcome pip's inablity to resolve dependencies.
Their work would be used for reference, where appropriate, during the course of the project. Further, there are many package managers which implement dependency resolution, which would also be looked into.
-
Community Bonding Period: 5 May - 29 May
- Clean up and finalize my existing pull requests.
- Read existing literature regarding dependency resolution.
- Inspect existing implementations of dependency resolvers.
GOAL: Finish existing work to proceed for the work coming up.
-
Week 1: 30 May - 5 June
- Introduce abstractions across pip's codebase to make refactoring
RequirementSet
easier. - Begin breaking down
RequirementSet.prepare_file
.
- Introduce abstractions across pip's codebase to make refactoring
-
Week 2: 6 June - 12 June
- Continue working on the refactor of
RequirementSet
.
- Continue working on the refactor of
-
Week 3: 13 June - 19 June
- Continue working on the refactor of
RequirementSet
. - Finish and polish
pip.resolve
.
GOAL:
pip.resolve
module will be ready, using the current resolution strategy. - Continue working on the refactor of
-
Week 4: 20 June - 26 June
- Finish and polish all work on the refactor of
RequirementSet
.
- Finish and polish all work on the refactor of
-
Week 5: 27 June - 3 July
- Move code for
WheelCache
into a newpip.cache
module. - Write tests for
pip.cache.EggInfoCache
, based onWheelCache
. - Begin implementation of
pip.cache.EggInfoCache
.
- Move code for
-
Week 6: 4 July - 10 July
- Finish and polish
pip.cache.EggInfoCache
.
GOAL: A cache for storing dependency information of sdists would be ready to add to pip.
- Finish and polish
-
Week 7: 10 July - 16 July
- Create a comprehensive set of tests for the dependency resolver.
(in
tests/unit/test_resolve.py
) - Begin implementation on the backtracking algorithm.
GOAL: A comprehensive test bed is ready for testing the dependency resolver.
- Create a comprehensive set of tests for the dependency resolver.
(in
-
Week 8: 17 July - 24 July
- Complete a rough implementation of the backtracking algorithm
GOAL: An implementation of a dependency resolver to begin running tests on and work on improving.
-
Week 9: 25 July - 31 July
- Fixing bugs in dependency resolver
-
Week 10: 1 August - 6 August
- Finish and polish work on dependency resolver
GOAL: A ready-to-merge PR adding a backtracking dependency resolver for pip.
-
Week 11: 6 August - 13 August
Buffer Week.
-
Week 12: 13 August - 21 August
Buffer Week. Finalization of project for evaluation.
If the deliverable is achieved ahead of schedule, the remaining time will be utilized to resolve open issues on pip's repository in the order of priority as determined under the guidance of the mentors.
I believe I will not be able to put in 40 hour/week for at most 3 weeks throughout the working period, due to the schedule of my university.
I will have semester-end examinations - from 10th May 2017 to 24th May 2017 - during the Community Bonding Period. My university will re-open for my third semester on 12 July 2017. I expect mid-semester examinations to be held in my University around 20th August 2017. During these times, I would not be able to put in full 40 hour weeks due to the academic workload.
I might take a 3-4 day break during this period, regarding which I would be informing my mentor around a week in advance.
I will be completely free from 30th May 2017 to 11 July 2017.
There seems to be some interest in being able to reuse the above dependency resolution algorithm in other packaging related tools, specifically from the buildout project. I intend to eventually move the dependency resolution code that would come out of this project into a separate library to allow for reuse by installer projects - pip, buildout and other tools.
(As on 12th March 2017)
Authored:
- #3785 - Prefering wheel-based installation over source-based installation (Open)
- #3786 - Make install command upgrade packages by default (Closed)
- #3787 - Check if pip broke the dependency graph and warn the user (Open)
- #3807 - Tests fail since change on PyPI (Closed)
- #3809 - Switch to TOML for configuration files (Open)
- #3871 - Provide a way to perform non-eager upgrades (Closed)
- #4198 - Travis CI - pypy broken dues to dependency change in pycrypto (Closed)
- #4282 - What's the release schedule? (Closed)
Participated:
- #59 - Add "upgrade" and "upgrade-all" commands (Open)
- #988 - Pip needs a dependency resolver (Open)
- #1056 - pip install -U does not remember whether a package was installed with --user (Open)
- #1511 - ssl certificate hostname mismatch errors presented badly (Open)
- #1668 - Default to --user (Open)
- #1736 - Create a command to make it easy to access the configuration file (Open)
- #1737 - Don't tell the user what they meant, just do what they meant (Open)
- #2313 - Automated the Creation and Upload of Release Artifacts (Open)
- #2732 - pip install hangs with interactive setup.py setups (Open)
- #3549 - pip -U pip fails (Open)
- #3580 - Update requests/urllib3 (Closed)
- #3610 - pip install from package from github, with github dependencies (Open)
- #3788 -
pip
version suggested is older than the version which is installed (Closed) - #3789 - Error installing Mayavi in Mac OS X (Closed)
- #3798 - On using python -m pip install -upgrade pip.. its throwing an error like the one below (Closed)
- #3811 - no matching distribution found for install (Closed)
- #3814 - pip could not find a version that satisfies the requirement oslo.context (Closed)
- #3876 - support git refs in @ syntax (Open)
- #4021 - Will you accept PRs with pep484 type hints? (Open)
- #4087 - pip list produces error (Closed)
- #4149 - Exception thrown when binary is already linked to /usr/local/bin (Open)
- #4160 - Pip does not seem to be handling deep requirements correctly (Open)
- #4162 - Let --find-links be context aware to support github, gitlab, etc. links (Open)
- #4170 - pip list |head raises BrokenPipeError (Open)
- #4182 - pip install should install packages in order to avoid ABI incompatibilities in compiled (Open)
- #4186 - IOError: [Errno 13] Permission denied: '/usr/local/bin/pip' (Open)
- #4206 - Where on Windows 10 is pip.conf or pip.ini located? (Closed)
- #4221 - Feature request: Check if user has permissions before downloading files (Closed)
- #4229 - "pip uninstall" is too noisy (Open)
Authored:
- #3806 - Change install command's default behaviour to upgrade packages by default (Closed, superseded by #3972)
- #3808 - Fix Tests (Merged)
- #3818 - Improve UX and tests of check command (Merged)
- #3972 - Add an upgrade-strategy option (Merged)
- #3974 - [minor] An aesthetic change to wheel command source (Merged)
- #4192 - Move out all the config code to a separate module (Merged)
- #4193 - Add the ability to autocorrect a user's command (Open)
- #4199 - Fix Tests for Travis CI (Merged)
- #4200 - Reduce the API exposed by the configuration class (Merged)
- #4232 - Update documentation to mention upgrade-strategy (Merged)
- #4233 - Nicer permissions error message (Open)
- #4240 - [WIP] Add a configuration command (Open)
Participated:
- #2716 - Issue #988: new resolver for pip (Closed)
- #2975 - Different mock dependencies based on Python version (Merged)
- #3744 - Add a "Upgrade all local installed packages" (Open)
- #3750 - Add a
pip check
command. (Merged) - #3751 - tox.ini: Add "cover" target (Open)
- #3794 - Use the canonicalize_name function for finding .dist-info (Merged)
- #4142 - Optionally load C dependencies based on platform (Open)
- #4144 - Install build dependencies as specified in PEP 518 (Open)
- #4150 - Clarify default for --upgrade-strategy is eager (Merged)
- #4194 - Allow passing a --no-progress-bar to the install script to surpress progress bar (Merged)
- #4201 - Register req_to_install for cleanup sooner (Merged)
- #4202 - Switch to 3.6.0 final as our latest 3.x release (Merged)
- #4211 - improve message when installing requirements file (#4127) (Merged)
- #4241 - Python 3.6 is supported (Merged)
-
Tracking issue for adding a proper dependency resolver in pip. Contains links to various useful resources.
-
Prior work by Robert Collins for adding a proper dependency resolver in pip.
-
A writeup by Sebastian Awwad on the current state of dependency resolution in pip and PyPI in general.
-
For guidance on how to write the application and what information is needed.
-
Stork: Secure Package Management for VM Environments
A Paper by Justin Cappos about Stork, used for reference regarding Backtracking Resolution
Can't wait :D