Skip to content

Instantly share code, notes, and snippets.

@iTrauco
Created September 26, 2024 17:07
Show Gist options
  • Save iTrauco/281c3ca082a6f2071c9f1676dfd622ca to your computer and use it in GitHub Desktop.
Save iTrauco/281c3ca082a6f2071c9f1676dfd622ca to your computer and use it in GitHub Desktop.

Project: Data Normalization and BigQuery Integration for Gaming Compliance

Context: I am developing a cloud-based system for normalizing and processing gaming compliance data from various operators, specifically starting with RSI BetRivers. The system needs to handle Excel file inputs, process them according to a predefined schema, and output the normalized data to BigQuery tables.

Key Components:

  1. Cloud Function: Triggered by file uploads to a Google Cloud Storage bucket.
  2. Data Processing: Normalizing Excel data based on a JSON schema.
  3. BigQuery Integration: Loading processed data into specified BigQuery tables.

Project Structure:

/cdor-dev ├── config/ │ ├── cdor_schema_v2.json │ ├── config.py │ └── init.py ├── dev/ │ ├── import_checker.py │ ├── import_fixer.py │ ├── init.py │ ├── map_functions.py │ ├── project_analyzer.py │ └── project_fixer.py ├── init.py ├── local_bucket/ │ └── D_Recon_CO_2024-08-04_V2.xlsx ├── local_test.py ├── logs.txt ├── main.py ├── my_bucket/ ├── operators/ │ ├── init.py │ └── operator_rsi_betrivers.py ├── project_analysis.json ├── README.md ├── requirements.txt ├── scripts/ │ └── run_etl.sh ├── tests/ │ ├── init.py │ ├── test_data_normalization.py │ ├── test_imports.py │ ├── test_local.py │ └── test_process_excel_file.py └── utils/ ├── bigquery_utils/ │ ├── bigquery_utils.py │ └── init.py ├── error_handling_utils/ │ └── init.py ├── etl_utils/ │ ├── data_normalization.py │ ├── etl_utils.py │ └── init.py ├── init.py └── logging_utils/ └── init.py

Packages in Use: (Please list all the packages from your requirements.txt file here. If you don't have this information readily available, you can generate it by running pip freeze > requirements.txt in your project's virtual environment.)

Current Focus:

  • Implementing and testing the process_excel_file function in operator_rsi_betrivers.py.
  • Ensuring correct schema loading and application during data processing.
  • Setting up a robust testing framework using pytest.
  • Developing a local testing workflow that simulates the Cloud Function environment.

Key Requirements:

  1. The system should handle both file path and DataFrame inputs for flexibility in testing and production.
  2. Data processing should include cleaning, normalization, and type conversion based on the schema.
  3. The implementation should be modular and follow Python best practices.
  4. Comprehensive unit and integration tests should be developed alongside the main code.
  5. The system should be easily extendable to handle data from other gaming operators in the future.

Development Approach:

  • Test-Driven Development (TDD) approach, writing tests before implementing features.
  • Iterative development, focusing on one component at a time.
  • Regular refactoring to maintain code quality and readability.
  • Continuous integration practices, running tests automatically on code changes.

Next Steps:

  1. Review and refine the current process_excel_file function implementation.
  2. Develop comprehensive tests for the data processing pipeline.
  3. Implement local testing utilities to simulate the Cloud Function environment.
  4. Gradually build out the BigQuery integration components with appropriate tests.

Please assist in developing this system, focusing on best practices, test-driven development, and creating a robust, maintainable codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment