Skip to content

Instantly share code, notes, and snippets.

@raprasad
Last active October 18, 2016 16:09
Show Gist options
  • Save raprasad/1de7c53092c50adc1c21b35fd29471f3 to your computer and use it in GitHub Desktop.
Save raprasad/1de7c53092c50adc1c21b35fd29471f3 to your computer and use it in GitHub Desktop.
Potential Ingest Improvement

@leonid_andreev :

Ingest thoughts for another release.

Current Scenario: User uploads files to dataset (before save)

Tracking lots of non-persisted objects

  • We currently create non-persisted DataFile and FileMetadata objects linked to a non-persisted DatasetVersion
  • Non-persisted objects tracked in memory:
    1. DataFile objects
    2. FileMetadata objects
    3. DatasetVersion with references to:
      • Existing (persisted) files +
      • Non-persisted files
  • More code to track these in-memory objects
    • Much more for developers to think about--especially when modifying/adding features to this code.

UX issues: Inability to create accurate error messages

  • Error messages only appear for the last file ingested.
  • Dataset already has published file 07.txt
  • User adds 3 files, including accidentally re-adding file 07.txt (or another file with the same content as 07.txt)
  • Good: User will not see 07.txt in list of newly uploaded files
  • Bad: User may not see an error message of why 07.txt was not included in the list of newly uploaded files
    • If multiple files have error messages, only one error message will appear.
    • This is due to:
      • File handling is done file by file
      • The async. nature of file upload

Implications (So what?)

  • Lots of files
    • Usually there are no user issues BUT with lots of files or adding files to a dataset with lots of existing files, it starts getting "expensive" and slower.
    • Checking duplicates:
      • For each single file added, need to check every existing and new file.
        • Not insurmountable in current config, but adding even more tracking of non-persisted files.
  • Commands
    • We can't break the process of (1) Upload + (2) Save into separate commands -- b/c need to keep state betweeen (1) and (2)
  • Stateful vs stateless
    • The process of (1) Upload + (2) Save is always "stateful", while the web is inherently "stateless"
    • e.g. We can't make an API to split steps (1) + (2)
  • Can't use the API when working with the UI
    • For consistency, conciseness, shared logic, etc. most web projects now use the API within their UI
  • When programming, there's always more to think about. e.g., Always tracking the "working version" of the Dataset.

Potential update - Make a PotentialDataFile object

  • Note: Current ingest has 3 major steps, this "update" would touch the 1st two steps
    1. createDataFiles - create non-persisted DataFile objects where we're Tracking lots of non-persisted objects as described above
    2. addFiles - Do some final checks and then persist the objects, moving files from temp to permanent directories.
    3. startIngestJobs - Run full ingest

Basic update idea

  1. Make a new and persistent intermediary object: PotentialDataFile*
    • (name is for conceptual purposes, not necessarily the best name)
  • Contains:
    • DataFile attributes
    • FileMetadata attributes
    • Link to temp directory with actual file
    • link to a Dataset (not DatasetVersion)
  1. 1st ingest step:
  • Create and persist PotentialDataFile objects
  • The process becomes "stateless" instead of "stateful"
    • Can query database to get PotentialDataFile objects for a Dataset
  • Don't need to track lots of non-persisted objects
    • The DatasetPage editing UI & backing bean become easier -- less to track
    • Can always reconstruct the working DatasetVersion
  • UI could even use the API
    • Tests to API then closer to UI
  1. 2nd ingest step:
  • Transition PotentialDataFile objects to DataFile objects.
  • Delete the PotentialDataFile objects
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment