@leonid_andreev :
Ingest thoughts for another release.
- We currently create non-persisted DataFile and FileMetadata objects linked to a non-persisted DatasetVersion
- Non-persisted objects tracked in memory:
- DataFile objects
- FileMetadata objects
- DatasetVersion with references to:
- Existing (persisted) files +
- Non-persisted files
- More code to track these in-memory objects
- Much more for developers to think about--especially when modifying/adding features to this code.
- Error messages only appear for the last file ingested.
- Dataset already has published file
07.txt
- User adds 3 files, including accidentally re-adding file
07.txt
(or another file with the same content as07.txt
) - Good: User will not see
07.txt
in list of newly uploaded files - Bad: User may not see an error message of why
07.txt
was not included in the list of newly uploaded files- If multiple files have error messages, only one error message will appear.
- This is due to:
- File handling is done file by file
- The async. nature of file upload
- Lots of files
- Usually there are no user issues BUT with lots of files or adding files to a dataset with lots of existing files, it starts getting "expensive" and slower.
- Checking duplicates:
- For each single file added, need to check every existing and new file.
- Not insurmountable in current config, but adding even more tracking of non-persisted files.
- For each single file added, need to check every existing and new file.
- Commands
- We can't break the process of (1) Upload + (2) Save into separate commands -- b/c need to keep state betweeen (1) and (2)
- Stateful vs stateless
- The process of (1) Upload + (2) Save is always "stateful", while the web is inherently "stateless"
- e.g. We can't make an API to split steps (1) + (2)
- Can't use the API when working with the UI
- For consistency, conciseness, shared logic, etc. most web projects now use the API within their UI
- When programming, there's always more to think about. e.g., Always tracking the "working version" of the Dataset.
- Note: Current ingest has 3 major steps, this "update" would touch the 1st two steps
createDataFiles
- create non-persisted DataFile objects where we're Tracking lots of non-persisted objects as described aboveaddFiles
- Do some final checks and then persist the objects, moving files from temp to permanent directories.startIngestJobs
- Run full ingest
- Make a new and persistent intermediary object:
PotentialDataFile
*
-
- (name is for conceptual purposes, not necessarily the best name)
- Contains:
- DataFile attributes
- FileMetadata attributes
- Link to temp directory with actual file
- link to a Dataset (not DatasetVersion)
- 1st ingest step:
- Create and persist
PotentialDataFile
objects - The process becomes "stateless" instead of "stateful"
- Can query database to get
PotentialDataFile
objects for a Dataset
- Can query database to get
- Don't need to track lots of non-persisted objects
- The DatasetPage editing UI & backing bean become easier -- less to track
- Can always reconstruct the working DatasetVersion
- UI could even use the API
- Tests to API then closer to UI
- 2nd ingest step:
- Transition
PotentialDataFile
objects toDataFile
objects. - Delete the
PotentialDataFile
objects