by jeff smith
- introducing the components of machine learning systems
- understanding the reactive systems design paradigm
- the reactive approach to building machine learning systems
summary
- even simple machine learning systems can fail
- machine learning should be viewed as an application, not as a technique
- a machine learning system is composed of five components, or phases:
- the data-collection component ingests data from the outside world into the machine learning system
- the data-transformation component transforms raw data into useful derived representations of the data: features and concepts
- the model-learning component learns models from the features and concepts
- the model-publishing component makes a model available to make predictions
- the model-serving component connects models to requests for predictions
- the reactive systems design paradigm is a coherent approach to building better systems
- reactive systems are responsive, resilient, elastic, and message-driven
- reactive systems use the strategies of replication, containment, and supervision as concrete approaches to maintaining the reactive traits
- reactive machine learning is an extension of the reactive systems approach that addresses the specific challenges of building machine learning systems
- data in a machine learning system is effectively infinite. laziness, or delay of execution, is a way of conceiving the infinite flows of data, rather than finite batches. pure functions without side effects help manage infinite data by ensuring that functions behave predictably, regardless of context
- uncertainty is intrinsic and pervasive in the data of a machine learning system. writing all data in the form of immutable facts makes it easier to reason about views of uncertain data at points in time. different views of uncertain data can be thought of as possible worlds that can be queried across.
- managing uncertainty using Scala
- implementing supervision and fault tolerance with Akka
- Using Spark and MLlib as framenworks for distributed machine learning pipelines
summary
- scala gives you constructs to help you reason about uncertainty:
- options abstract over the uncertainty of something being present or not
- futures abstract over the uncertainty of actions, which take time
- futures give you the ability to implement timeouts, which help ensure responsiveness through bounding response times
- with Akka, you can build protections against failure into the structure of your application using the power of the actor model:
- communciation via message passing helps you keep system components contained
- supervisory hierarchies can help ensure resilience of components
- one of the best ways to use the power of the actor model is in libraries that use it behind the scenes, instead of doing much of the definition of the actor systems directly in your code
- Spark gives you reasonable components to build data-processing pipelines:
- Spark pipelines are constructued using pure functions and immutable transformations
- Spark uses laziness to ensure efficient, reliable execution
- MLlib provides useful tools for building and evaluating models with a minimum of code
- collecting inherently uncertain data
- handling data collection at scale
- querying aggregates of uncertain data
- avoiding updating data after it's been written to a database
summary
- facts are immutable records of something that happened and the time that it happened:
- transforming facts during data collection results in information loss and should never be done
- facts should encode any uncertainty about that information
- data collection can't work at scale with shared mutable state and locks
- fact databases solve the problems of collecting data at scale:
- facts can always be written without blocking or using locks
- facts can be written in any order
- futures-based programming handles the possibility that operations can take time and even fail
- extracting features from raw data
- transforming features to make them more useful
- selecting among the features you've created
- how to organize feature-generation code
summary
- like chicks cracking through eggs and entering the world of real birds, features are our entry points into the process of building intelligence into a machine learning system. although they haven't always gotten the attention they deserve, features are a large and crucial part of a machine learning system
- it's easy to begin writing feature-generation functionality. but that doesn't mean your feature-generation pipeline should be implemented with anything less than the same rigor you'd apply to your real-time predictive application. feature-generation pipelines can and should be awesome applications that live up to all the reactive traits
- feature extraction is the process of producing semantically meaningful, derived representations of raw data
- featuers can be transformed in various ways to make them easier to learn from
- you can select among all the features you have to make the model-learning process easier and more successful
- feature extractors and transformers should be well structured for composition and reuse
- feature-generation pipelines should be assembled into a series of immutable transformations (pure functions) that can easily be searialized and reused
- features that rely on external resources should be built with resilience in mind
- implementing model-learning algorithms
- use spark's model-learning capabilities
- handling third-party code
summary
- a model is a program that can make predictions about the future
- model learning consists of processing features and returning an model
- model learning must be implemented with an expectation of failure modes (for example, timeouts)
- containment, using the facade pattern, is a crucial technique for integrating third-party code
- contained code wrapped in a facade can be integrated with the rest of your data pipeline using standard reactive-programming techniques
- calculating model metrics
- training versus testing data
- recording model metrics as messages
summary
- models can be evaluated over hold-out data to assess their performance
- statistics like accuracy, precision, recall, f-measure, and area under the curve can quantify model performance
- failing to separate data used in training from testing can result in models that lack predictive capability
- recording the provenance of models allows you to pass messages to other systems about their performance
- persisting learned models
- modeling microservices using Akka HTTP
- containerization of services using Docker
summary
- models, and even entire training pipelines, can be persisted for later use
- microservices are simple services that have very narrow responsibilities
- models, as pure functions,can be encapsulated into microservices
- you can contain failure of a predictive service by only communicating via message passing
- you can use an actor hierarchy to ensure resilience within a service
- applications can be containerized using tools like Docker
- using models to respond to user requests
- managing containerized services
- designing for failure
summary
- tasks are useful lazy primitives for structuring expensive computations
- structuring models as services makes elastic architectures easier to build
- failing model services can be handled by a model supervisor
- the principles of containment and supervision can be applied at several levels of systems design to ensure reactive properties
- building scala code using sbt
- evaluating applications for deployment
- strategies for deployments
summary
- scala applications can be packaged into archives called JARs using sbt
- build pipelines can be used to execute evaluations of machine learning functionality, like models
- the decision to deploy a model can be made based on comparisons with meaningful values, like the performance of a random model, previous models' performance, or some known parameter
- deploying applications continuously can allow a team to deliver a new functionality quickly
- using metrics to determine whether new applications are deployable can make a deployment system fully autonomous
- understanding artificial intelligence
- working with agents
- evolving the complexity of agents
summary
-
an agent is a software application that can act on its own
-
a reflect agent acts according to statically defined behavior
-
an intelligent agent acts according to knowledge that it has
-
a learning agent is capable of learning--it can improve its performance on a task given exposure to more data