Project: Data Normalization and BigQuery Integration for Gaming Compliance
Context: I am developing a cloud-based system for normalizing and processing gaming compliance data from various operators, specifically starting with RSI BetRivers. The system needs to handle Excel file inputs, process them according to a predefined schema, and output the normalized data to BigQuery tables.
Key Components:
- Cloud Function: Triggered by file uploads to a Google Cloud Storage bucket.
- Data Processing: Normalizing Excel data based on a JSON schema.
- BigQuery Integration: Loading processed data into specified BigQuery tables.
Project Structure:
/cdor-dev ├── config/ │ ├── cdor_schema_v2.json │ ├── config.py │ └── init.py ├── dev/ │ ├── import_checker.py │ ├── import_fixer.py │ ├── init.py │ ├── map_functions.py │ ├── project_analyzer.py │ └── project_fixer.py ├── init.py ├── local_bucket/ │ └── D_Recon_CO_2024-08-04_V2.xlsx ├── local_test.py ├── logs.txt ├── main.py ├── my_bucket/ ├── operators/ │ ├── init.py │ └── operator_rsi_betrivers.py ├── project_analysis.json ├── README.md ├── requirements.txt ├── scripts/ │ └── run_etl.sh ├── tests/ │ ├── init.py │ ├── test_data_normalization.py │ ├── test_imports.py │ ├── test_local.py │ └── test_process_excel_file.py └── utils/ ├── bigquery_utils/ │ ├── bigquery_utils.py │ └── init.py ├── error_handling_utils/ │ └── init.py ├── etl_utils/ │ ├── data_normalization.py │ ├── etl_utils.py │ └── init.py ├── init.py └── logging_utils/ └── init.py
Packages in Use:
(Please list all the packages from your requirements.txt file here. If you don't have this information readily available, you can generate it by running pip freeze > requirements.txt
in your project's virtual environment.)
Current Focus:
- Implementing and testing the
process_excel_file
function inoperator_rsi_betrivers.py
. - Ensuring correct schema loading and application during data processing.
- Setting up a robust testing framework using pytest.
- Developing a local testing workflow that simulates the Cloud Function environment.
Key Requirements:
- The system should handle both file path and DataFrame inputs for flexibility in testing and production.
- Data processing should include cleaning, normalization, and type conversion based on the schema.
- The implementation should be modular and follow Python best practices.
- Comprehensive unit and integration tests should be developed alongside the main code.
- The system should be easily extendable to handle data from other gaming operators in the future.
Development Approach:
- Test-Driven Development (TDD) approach, writing tests before implementing features.
- Iterative development, focusing on one component at a time.
- Regular refactoring to maintain code quality and readability.
- Continuous integration practices, running tests automatically on code changes.
Next Steps:
- Review and refine the current
process_excel_file
function implementation. - Develop comprehensive tests for the data processing pipeline.
- Implement local testing utilities to simulate the Cloud Function environment.
- Gradually build out the BigQuery integration components with appropriate tests.
Please assist in developing this system, focusing on best practices, test-driven development, and creating a robust, maintainable codebase.