Skip to content

Instantly share code, notes, and snippets.

@quetzaluz
Last active August 29, 2015 13:57
Show Gist options
  • Save quetzaluz/9925711 to your computer and use it in GitHub Desktop.
Save quetzaluz/9925711 to your computer and use it in GitHub Desktop.

Building A New File Based Data Connector/Processor

In order to process files of a new type, such as a new type of test we've never processed before, you'll have to define how to process that file within this module. Here are some steps that may help you accomplish this. The majority of this work will occur in the file ~/SzfdData/config/module.config.php and other areas of the SzfdData module.

It is a prerequisite that you have access to our szf code repository to perform these steps. Please get in touch with Jake Alves or another lead engineer to get access to this code repo.

1: Create a schema file by looking at the file you are trying to upload

Some academic tests contain help documentation for the test reports they issue, and some tests do not provide such documentation. Here is an example of the California State Test's documentation for their testing reports.

If at all possible, try to locate documentation like this to help with defining your schema file. If you do not have access to such documentation, you need to look at an example of the file you are trying to process and guess what data type each field should be as you define the schema file.

Heidi Williamson has written an excellent guide for defining the actual schema file here. Once you have created your schema file by following this guide, it should be saved in the following directory in this module: ~/SzfdData/config/schemas/

2: Determine the file type for the new data source

The ~/SzfdData/config/module.config.php file provides all of the parameters our file processor needs to know to process files of a given type. First determine what type of file you are trying to process is by looking at the 'types' array.

As of writing this document, here is what the file types array contains:

	'types' => array(
        'csv' => array('adapter' => '\SzfdData\Adapter\File\Csv',
            'chunk_size' => 500),
        'xls' => array('adapter' => '\SzfdData\Adapter\File\Excel',
            'options'   => array(
                'reader'    => 'Excel5'
            ),
            'chunk_size' => 500),
        'xlsx' => array('adapter' => '\SzfdData\Adapter\File\Excel',
            'options'   => array(
                'reader'    => 'Excel2007'
            ),
            'chunk_size' => 500),
        'fixed-width' => array('adapter' => '\SzfdData\Adapter\File\FixedWidth',
            'chunk_size' => 500),
        'tab-delimited' => array('adapter' => '\SzfdData\Adapter\File\Tab',
            'chunk_size' => 500),
        'json' => array('adapter' => '\SzfdData\Adapter\Json',
            'chunk_size' => 500),
    ),

As we can see from this array, we currently support Excel files of very types, JSON data, fixed with text files, and csv files. If the type of file you are trying to process is not listed here, you or someone on our engineering team will have to write up new file adapter.

3: Create a new definition within the module.config.php's 'file_templates' array

In the same file (~/SzfdData/config/module.config.php), find the 'file_templates' array. Here you will see several examples of how other file processors have been defined, such as:

			'MME' => array('type' => 'csv',
                'schema' => 'mme_13_schema.csv',
                'parser' => '\SzfdData\Filter\Csv',
                'options'   => array(
                    'header_row'    => true
                ),
                'process' => array(
                    'meap/create_mme_rawdata'
                ),
                'delete' => array()
            ),

Below is the same configuration definition with annotation.

	'MME' => array('type' => 'csv',

'MME' is the label for the file and will appear in the File Uploader options. 'csv' is the file type -- this processor can only work with files ending in .csv

		'schema' => 'mme_13_schema.csv',

'schema' should point to the schema file we defined in step 1 of this guide. You should not specify the filepath, just the file name

		'parser' => '\SzfdData\Filter\Csv',

'parser' is usually points to a class of the same type -- you may look for available parsers in ~/SzfdData/src/SzfdData/Filter

		'options'   => array(
	        'header_row'    => true
	     ),

'options' allow you to specify certain behaviors of the parser. Currently you can only define the parameter 'header_row', which indicates whether the file contains column headers (true) or not (false)

		'process' => array(					
            'meap/create_mme_rawdata',			  
        ),

'process' lists all of the sprocs, or stored procedures, that should be run on a file. This file parser currently contains only one process, which writes raw data from the temporary table that is created when a file is uploaded to the data warehouse.

        'delete' => array(),

'delete' lists all of the sprocs that should be run when a file is deleted. Currently this file processor has no delete processes, so uploaded data will never be deleted. However, looking at the following sproc definitions for CST processing...

                'process' => array(
                    'cst/insert_StateTest_CA_STAR_2011',
                    'cst/insert_CSTScores_2011',
                    'cst/load_dimschool',
                    'cst/load_dimschoolcalendar',
                    'cst/load_dimsecurity_admin',
                    'cst/load_dimstudent_2011',
                    'cst/load_dimtest',
                    'cst/load_dimtestprevious',
                    'cst/drop_csixs',
                    'cst/load_factstudentachievement',
                    'cst/update_testgrowth',
                    'cst/create_csixs',
                    'cst/update_report_permissions',
                    'cst/load_bridgesecurity_teacher',
                    'cst/complete_file_processing'
                ),
                'delete' => array(
                    'delete/delete_bridgesecurity_teacher',
                    'delete/delete_dimstudent',
                    'delete/delete_dimschoolcalendar',
                    'delete/delete_dimschool',
                    'delete/delete_cstscores_from_template',
                    'delete/drop_CSIXs',
                    'delete/delete_factstudentachievement',
                    'delete/create_CSIXs'
                )

... Indicates that file processing can involve several sprocs impacting several data tables, but do not necessarily have a 1:1 relationship, that is, there does not have to be one deletion sproc for every processing sproc.

4: Test your new file processor

Once you have defined the schema and saved it in ~/SzfdData/config/schemas and the file processor's configuration in ~/SzfdData/config/module.config.php, you are ready to test your work. Testing locally on your machine before committing and deploying your work is always preferred to testing on a live deployment.

5: Write new sprocs as needed

Sprocs are basically stored SQL scripts that are executed in the order they are defined within the 'process' array and the 'delete' array. Here's an example of what a typical source to raw sproc looks like:

INSERT INTO [Schoolzilla_DW].[statetest_ny].[statetest_ny_NYSA_RawData_L2RPT]
    ([sz_CustID]
    ,[sz_DataSetID]
    ,[sz_LastUpdated]
    ,[sz_guid]
    ,[sz_row]
    ,[DATA_COL_1]
    ,[DATA_COL_2])
  SELECT ds.customer_id as [sz_CustID]
    ,ds.id as [sz_DataSetID]
    ,GETDATE() as [sz_LastUpdated]
    ,[sz_guid]
    ,[sz_row]
    ,[DATA_COL_1]
    ,[DATA_COL_2]
  FROM {table_name}
    LEFT JOIN [Schoolzilla_Web_Data].[dbo].[data_sets] ds
    ON ds.id = {data_set_id}

All this sproc does is read data from the temporary table, indicated by the following part of the sproc:

  FROM {table_name}
    LEFT JOIN [Schoolzilla_Web_Data].[dbo].[data_sets] ds
    ON ds.id = {data_set_id}

These SQL variables {table_name} and {data_set_id} are available to every sproc defined and saved within the following directories.

	~/SzfdData/sql/SqlServer
	~/SzfdData/sql/Mysql

Because we have two folders does not mean that you have to define sprocs in their entirety twice. Generally we define the sproc once for SqlServer and then load this sproc definition into Mysql. Here is an example of a Mysql sproc: SELECT 'create_meap_access_rawdata'

That's it, one line. Essentially, all it does is select the stored procedure already defined for SqlServer and runs it with Mysql syntax.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment