Skip to content

Instantly share code, notes, and snippets.

@simon-mo
Last active July 31, 2019 18:12
Show Gist options
  • Select an option

  • Save simon-mo/bd5be408db1ec04c77dd395453084eab to your computer and use it in GitHub Desktop.

Select an option

Save simon-mo/bd5be408db1ec04c77dd395453084eab to your computer and use it in GitHub Desktop.
Release note draft

Ray 0.7.3 Release Note

Highlights

  • RLlib ModelV2 API is ready to use. It improves support for Keras and RNN models, as well as allowing object-oriented reuse of variables. ModelV1 API is deprecated. No migration is needed.

  • ray.experimental.sgd.pytorch.PyTorchTrainer is ready for early adopters. Checkout the doc here and we welcome your feedback!

model_creator = lambda config: YourPyTorchModel()
data_creator = lambda config: YourTrainingSet(), YourValidationSet()

trainer = PyTorchTrainer(
    model_creator,
    data_creator,
    optimizer_creator=utils.sgd_mse_optimizer,
    config={"lr": 1e-4},
    num_replicas=2,
    resources_per_replica=Resources(num_gpus=1),
    batch_size=16,
    backend="auto")

for i in range(NUM_EPOCHS):
    trainer.train()
  • Jobs table is added to the state API. You can query all the clients that have performed ray.init to connected to the current cluster. #5076
>>> ray.state.jobs()
[{'JobID': '02000000',
  'NodeManagerAddress': '10.99.88.77',
  'DriverPid': 74949,
  'StartTime': 1564168784,
  'StopTime': 1564168798},
 {'JobID': '01000000',
  'NodeManagerAddress': '10.99.88.77',
  'DriverPid': 74871,
  'StartTime': 1564168742}]

Core

  • Improvement on memory storage handling. #5143, #5216, #4893
  • Improved workflow:
    • Debugging tool local_mode now behaves more consistent with ray default mode. #5060
    • Improved KeyboardInterrupt Exception Handling, stack trace reduced from 115 lines to 22 lines. #5237
  • Ray core:
    • Experimental direct actor call. #5140, #5184
    • Use gRPC for Raylet communication. #5120, #5054, #5121
    • Improvement in core worker, the shared module between Python and Java. #5079, #5034, #5062
    • GCS (global control store) was refactored. #5058, #5050

RLlib

  • Finished port of all major RLlib algorithms to builder pattern #5277, #5258, #5249
  • learner_queue_timeout can be configured for async sample optimizer. #5270
  • reproducible_seed can be used for reproducible experiments. #5197
  • Added entropy coefficient decay to IMPALA, APPO and PPO #5043

Tune:

  • Support nested dictionaries for CSVLogger. So your Trainer._train function can return arbirarily nested dictionary. #5295
  • Add system performance tracking for gpu, ram, vram, cpu usage statistics #4924
  • Faster Node Recovery #5053

Autoscaler

  • Add a 'request_cores' function for manual autoscaling. You can know manually request resources for the autoscaler. #4754
  • Local cluster:
    • More readable example yaml with comments. #5290
    • Multiple cluster name is supported. #4864
  • Improved logging with AWS NodeProvider. create_instance call will be logged. #4998

Others Libraries:

  • SGD:
    • Example for Training. #5292
    • Deprecate old distributed SGD implementation. #5160
  • Kuberentes: Ray namespace added for k8s. #4111
  • Dev experience: Add linting pre-push hook. #5154

Thanks:

We thank the following contributors for their amazing contributions:

@joneswong, @1beb, @richardliaw, @pcmoritz, @raulchen, @stephanie-wang, @jiangzihao2009, @LorenzoCevolani, @kfstorm, @pschafhalter, @micafan, @simon-mo, @vipulharsh, @haje01, @ls-daniel, @hartikainen, @stefanpantic, @edoakes, @llan-ml, @alex-petrenko, @ztangent, @gravitywp, @MQQ, @dulex123, @morgangiraud, @antoine-galataud, @robertnishihara, @qxcv, @vakker, @jovany-wang, @zhijunfu, @ericl

@simon-mo
Copy link
Author

Please leave comments here! Note that we are still blocked on ray-project/ray#5310

@robertnishihara
Copy link

  • "ray.init" -> ray.init
  • use Python syntax highlighting

@ericl
Copy link

ericl commented Jul 30, 2019

  • ModelV2 API for RLlib, which improves support for Keras and RNN models, as well as allowing object-oriented reuse of variables
  • Finished port of all major RLlib algorithms to builder pattern

@simon-mo
Copy link
Author

@ericl would you say ModelV2 API is still experimental at this point?

@ericl
Copy link

ericl commented Jul 30, 2019

No, it's intended for production use. ModelV1 is deprecated at this point.

@richardliaw
Copy link

richardliaw commented Jul 31, 2019

  • Syncing behavior between head and workers can now be customized (sync_to_driver). Syncing behavior (upload_dir) between cluster and cloud is now separately customizable (sync_to_cloud). This changes the structure of the uploaded directory - now local_dir is synced with upload_dir. #4450
  • BREAKING: ExperimentAnalysis is now returned by default from tune.run. To obtain a list of trials, use analysis.trials. (#5115)
  • Analysis object will now return all trials in a folder; ExperimentAnalysis is a subclass that returns all trials of a experiment. (#5115)
  • Bug fix: Tune CLI sort is fixed
  • Add missing argument keep_checkpoints_num to tune (#5117)
  • Trials on failed nodes will be prioritized in processing (#5053)
  • Trial Checkpointing is now more flexible (#4728)
  • Add system performance tracking for gpu, ram, vram, cpu usage statistics - toggle with tune.run(log_sys_usage=True) (#4924)
  • Experiment checkpointing frequency is now less frequent and can be controlled with tune.run(global_checkpoint_period=...). (#4859)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment