Skip to content

Instantly share code, notes, and snippets.

@grahamb
Last active August 16, 2021 17:54
Show Gist options
  • Save grahamb/98e72098d5a96915421b to your computer and use it in GitHub Desktop.
Save grahamb/98e72098d5a96915421b to your computer and use it in GitHub Desktop.
SFU's Canvas LMS infrastructure and ugprade

πŸŽ“ Canvas at SFU

Simon Fraser University is a mid-sized comprehensive university with three campuses in the Greater Vancouver area of British Columbia, Canada. We are a trimester school, with a Fall, Spring and Summer term. We have approximately 25,000 undergraduate FTEs.

SFU chose Canvas as its new LMS during a selection process in 2011/2012. We went into production in 2012. As of this writing, our enrollment counts in Canvas are:

  • Students: 25250
  • Teachers: 1070
  • TAs: 865

(plus a handful of observer, designer and student viewer enrollments)

It's worth noting that during our pre-production phase of our rollout, our Canvas usage exceeded the peak usage of its predecessor system, WebCT. It's fair to say that SFU ❀️s Canvas.

SFU has been running the open-source version of Canvas since 2012. We can't use the hosted version of Canvas because of British Columbia's privacy legislation; also, we ❀️ open-source and have built a number of customizations that aren't possible in the hosted world.

πŸ†š Open Source vs. hosted

Instructure dual-licenses Canvas; AGPL for the OSS bits, and a proprietary license for the non-open source bits. Those bits are mostly:

  • multi-tennancy
  • custom reports for individual schools
  • small per-environment customizations

There are several Canvas features that are not available to OSS installations. Off the top of my head:

  • Instructure's iOS and Android apps
  • chat (coming soonβ„’, since 2013)
  • SCORM
  • Canvas Data (may be technically available; we've never explored it because it runs on AWS and we can't use that)

πŸ’» Physical Infrastructure

SFU has three main Canvas environments: test, stage, and production. These environments are configured similarly; separate DB, app, job servers, etc. The main difference is in the number of servers. We also have several one-off single-server environments: snapshot, which is a weekly recovery clone of production, edge, which is an initial integration test server as part of our upgrade cycle. We also spin up these single-server environments for limited testing runs; we have one being built now for testing the new UI in isolation from our other environments.

Our infrastructure is all virtualized (VMWare) Linux VMs running RedHat Enterprise Linux 6. All machines are more-or-less identical in terms of specs (quad-core Xeon, 12 GB RAM)

Environment Database Management App File Hot Spare Redis Cassandra
Production 1 (PSQL 9.5) 2 8 7 5 3 1
Staging 1 (PSQL 9.5) 1 2 1 0 1 1
Testing 1 (PSQL 9.5) 1 2 1 0 1 1

File storage

Because we can't use AWS, we can't use S3 for file storage. We therefore use local file storage.

The file store (uploaded files, etc) is on a NetApp filer and mounted at common mount point on all machines in the cluster (per environment - our test, stage and prod clusters have their own file store and we clone it when we refresh a non-prod instance from production), regardless of their function (e.g. /mnt/canvasfiles). The path_prefix in file_store.yml is set to that mount point.

We also have a second common mount point that is mounted on every single canvas machine, regardless of cluster - in other words, the same share is mounted on every machine in our test/stage/prod clusters. This mount point contains common scripts, a file that describes how many application nodes are in each environment, and a set of template config YAML files that get copied in during the deploy. We don't commit these config files to our repo as the may contain credentials (e.g. database passwords, encryption keys, etc) that we wouldn't want to leak. We have a copy of each file for each environment (e.g. database.yml.{prod,stage,test}) and a script that determines the environment based on the machine's hostname (we use a standard pattern) and copies the appropriate files in during the deploy.

Beware: Instructure doesn't use local file storage, so it isn't well-tested. Be prepared to find bugs! When we find them, we try to fix them and submit them back.

Database

We run a single Postgres 9.5 server per-environment. This is definitely our weakest link; it's our main potential point of failure. As our infrastructure is all virtualized, we do have the ability to bring up a new VM in an alternate datacentre if disaster were to strike; between our nightly backups and the archive logs, we're mostly covered.

Management nodes

Canvas makes heavy usage of delayed jobs. We run these jobs on separate servers we refer to "management nodes". These servers don't process any user-facing web requests; they are solely for running the delayed job processor and other management tasks (e.g. the nightly batch SIS import).

App nodes

"App nodes" are the machines that run the Canvas application and handle user requests. When a user visits https://canvas.sfu.ca, they hit one of those machines. Nothing special going on there.

File nodes

When we first started running Canvas, we had some performance issues with courses that had a large number of file uploads/downloads (e.g. media-intensive design courses). Canvas requires that you serve your files from a separate files domain (e.g. files.canvas.sfu.ca); we point that hostname at a separate load balancer pool containing the file nodes. We could probably do away with that now, but it isn't broken, so we don't plan on fixing it.

These nodes are physically identical to the app nodes; the only difference is what pool they're in.

As mentioned above, we use local storage for files.

Hot Spares

These are simply Canvas nodes that aren't in any load balancer pool. If we need to take a bad node out of production, or quickly add capacity, we can add one of these to one of the pools.

Redis

Canvas uses redis for caching, and session storage. The docs suggest it's optional; it really isn't.

Cassandra

Cassandra is required to use Canvas Analytics. If you aren't running Analytics, then you don't need it.

Load balancer

Everything is on a private network, fronted by a BigIP F5 load balancer. We terminate SSL at the load balancer.

πŸ’Ύ Software Infrastructure

  • RHEL 6
  • Ruby 2.3
  • Apache httpd 2.4
  • Phusion Passenger 5.1.2 (Enterprise)

πŸ“¦ Upgrade Process

Instructure pushes a new release of Canvas to production every three weeks. We follow that pattern; however, we keep ourselves one release behind.

We maintain our own fork of the Canvas repo (https://github.com/sfu/canvas-lms) as we've developed our own plugins and have modified some of the core code to work around bugs that have yet to be fixed. We also have several newer plugins contained in their own GitHub repositories.

We use Atlassian Bamboo as our build and deployment server.

At the start of a new deployment cycle, we update our fork with the latest Instructure (upstream) changes. We merge our chosen Instructure release (typically, the stable branch, although we usually pick an explicit treesame commit and merge that) into our repo, and do an initial round of technical tests on our modifications and plugins. If those pass, we release it to our test cluster for functional testing. When that passes, we release it to production.

Bamboo serves as our build server. When we trigger a build & deploy from Bamboo, it does the following:

  • Clones the SFU canvas repo at the right branch
  • Clones the ancillary repos (instructure/analytics, instructure/QTIMigrationTool, and our own plugins)
  • Runs bundle install and npm install
  • Runs bundle exec rake canvas:compile_assets
  • tars the build canvas into a release package

When the 'build' phase is complete, it moves on to the 'deploy' process. This uses Capistrano to push the build out to all the servers in the target environment.

Capistrano uses a tree structure like this for apps:

/var/rails/canvas
β”œβ”€β”€ current -> /var/rails/canvas/releases/20160203055144
β”œβ”€β”€ releases
β”‚Β Β  β”œβ”€β”€ 20160112175734
β”‚Β Β  β”œβ”€β”€ 20160114222620
β”‚Β Β  β”œβ”€β”€ 20160202181554
β”‚Β Β  └── 20160203055144
└── shared
    β”œβ”€β”€ bin
    β”œβ”€β”€ cached-copy
    β”œβ”€β”€ log
    β”œβ”€β”€ pids
    β”œβ”€β”€ public
    β”œβ”€β”€ system
    └── tmp

Individual deploys are installed into a timestamped folder in the releases directory. current is a symlink to the currently-running instance of Canvas. Everything that needs to know where Canvas is installed (passenger, apache, scripts, whatever) looks at /var/rails/canvas/current. shared is stuff that should persist across deploys (logs, etc), and some Capistrano-specific stuff (cached-copy).

When cap deploy runs, it takes the tarball that Bamboo built and scps it to every server in the cluster. It then untars it into a date-stamped release directory under /var/rails/canvas/releases. For good measure we run bundle install again – we probably don't need to do this; I think we do it just in case there's any issues with gems using native code. Capistrano then runs database migrations, and moves the current symlink to point at the new release. Finally, it restarts the Passenger app server. We use Passenger Enterprise, which has a rolling restart capability. This lets us restart Canvas without our users noticing; although, recently, we've been seeing some issues with assets being broken on the first load (e.g. CSS not being served right).

If anything goes wrong, and as long as database migrations permit, we can roll back a release because we keep the old ones around. We had to do this last week; we had a bad deploy and were able to roll it back.

πŸš€ To the future

We have some short-term goals for refining this process:

  • One of our developers has come up with a set of automated Selinium tests for testing our customizations. Right now she runs it manually on her desktop; I want to get that moved onto a server, parallelize them, and make it part of our upgrade process.
  • New release toolling: we're working on some scripts to create dated release branches in our Canvas and Canvas-adjacent repositories, automatically do a pull request with the target Instructure release, and if mergable, merge it and deploy to our testing stack.
  • Right now, Capistrano scp's the release tarball to all of our production machines: 22 of them. We have a shared mount point mounted on all of those machines; I plan on changing the process so that it scp's the tarball there, and then each server copies it from the shared mount and untars into the release directory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment