Potential Ingest Improvement

@leonid_andreev :

Ingest thoughts for another release.

Current Scenario: User uploads files to dataset (before save)

We currently create non-persisted DataFile and FileMetadata objects linked to a non-persisted DatasetVersion
Non-persisted objects tracked in memory:
1. DataFile objects
2. FileMetadata objects
3. DatasetVersion with references to:
  - Existing (persisted) files +
  - Non-persisted files
More code to track these in-memory objects
- Much more for developers to think about--especially when modifying/adding features to this code.

Error messages only appear for the last file ingested.
Dataset already has published file 07.txt
User adds 3 files, including accidentally re-adding file 07.txt (or another file with the same content as 07.txt)
Good: User will not see 07.txt in list of newly uploaded files
Bad: User may not see an error message of why 07.txt was not included in the list of newly uploaded files
- If multiple files have error messages, only one error message will appear.
- This is due to:
  - File handling is done file by file
  - The async. nature of file upload

Lots of files
- Usually there are no user issues BUT with lots of files or adding files to a dataset with lots of existing files, it starts getting "expensive" and slower.
- Checking duplicates:
  - For each single file added, need to check every existing and new file.
    - Not insurmountable in current config, but adding even more tracking of non-persisted files.
Commands
- We can't break the process of (1) Upload + (2) Save into separate commands -- b/c need to keep state betweeen (1) and (2)
Stateful vs stateless
- The process of (1) Upload + (2) Save is always "stateful", while the web is inherently "stateless"
- e.g. We can't make an API to split steps (1) + (2)
Can't use the API when working with the UI
- For consistency, conciseness, shared logic, etc. most web projects now use the API within their UI
When programming, there's always more to think about. e.g., Always tracking the "working version" of the Dataset.

Note: Current ingest has 3 major steps, this "update" would touch the 1st two steps
1. createDataFiles - create non-persisted DataFile objects where we're Tracking lots of non-persisted objects as described above
2. addFiles - Do some final checks and then persist the objects, moving files from temp to permanent directories.
3. startIngestJobs - Run full ingest

- (name is for conceptual purposes, not necessarily the best name)
Contains:
- DataFile attributes
- FileMetadata attributes
- Link to temp directory with actual file
- link to a Dataset (not DatasetVersion)

Create and persist PotentialDataFile objects
The process becomes "stateless" instead of "stateful"
- Can query database to get PotentialDataFile objects for a Dataset
Don't need to track lots of non-persisted objects
- The DatasetPage editing UI & backing bean become easier -- less to track
- Can always reconstruct the working DatasetVersion
UI could even use the API
- Tests to API then closer to UI