Chapter 2 - Using Data

This tutorial is based on example code which can be found in the TRAC GitHub Repository under examples/models/python.

Wrap existing code

In the previous tutorial, model code was written directly in the run_model() method of the model class. An alternative approach is to put the model code in a separate class or function, which can be called by run_model(). This can be useful if you have a library of existing model code that you want to wrap with the TRAC model API.

If you are wrapping code in this way, it is important that all the required inputs are passed to the top-level class or function as parameters, as shown in this example.

src/tutorial/using_data.py
21def calculate_profit_by_region(
22        customer_loans: pd.DataFrame,
23        eur_usd_rate: float,
24        default_weighting: float,
25        filter_defaults: bool):
26
27    """
28    Aggregate expected profit by region on a book of customer loans
29    Use a weighting factor for bad loans and report results in USD
30    Optionally, bad loans can be filtered from the results
31    """
32
33    if filter_defaults:
34        customer_loans = customer_loans[customer_loans["loan_condition_cat"] == 0]
35
36    # Build a weighting vector, use default_weighting for bad loans and 1.0 for good loans
37    condition_weighting = customer_loans["loan_condition_cat"] \
38        .apply(lambda c: decimal.Decimal(default_weighting) if c > 0 else decimal.Decimal(1))
39
40    customer_loans["gross_profit_unweighted"] = customer_loans["total_pymnt"] - customer_loans["loan_amount"]
41    customer_loans["gross_profit_weighted"] = customer_loans["gross_profit_unweighted"] * condition_weighting
42    customer_loans["gross_profit"] = customer_loans["gross_profit_weighted"] * decimal.Decimal(eur_usd_rate)
43
44    profit_by_region = customer_loans \
45        .groupby("region", as_index=False) \
46        .aggregate({"gross_profit": "sum"})
47
48    return profit_by_region

Defining model requirements

Now let’s write the TRAC model wrapper class. The previous tutorial showed how to define parameters so we can use the same syntax. We’ll define the three parameters needed by the model function:

51class UsingDataModel(trac.TracModel):
52
53    def define_parameters(self) -> tp.Dict[str, trac.ModelParameter]:
54
55        return trac.define_parameters(
56
57            trac.P("eur_usd_rate", trac.FLOAT,
58                   label="EUR/USD spot rate for reporting"),
59
60            trac.P("default_weighting", trac.FLOAT,
61                   label="Weighting factor applied to the profit/loss of a defaulted loan"),
62
63            trac.P("filter_defaults", trac.BOOLEAN,
64                   label="Exclude defaulted loans from the calculation",
65                   default_value=False))

The example model function has one data input, which is a table called customer_loans. The function define_output_table() in the TRAC API allows us to define a tabular dataset for use as a model input, which is exactly what is needed. Each field is defined using the shorthand function trac.F(). This approach works well for small models with simple schemas (the next tutorial discusses managing more complex models using schema files).

Every field must have a name, type and label. Only scalar types are allowed for fields in table schemas - it is not possible define a field which has a compound type such as MAP or ARRAY.

In this example the dataset has a natural business key, so we can mark this in the schema. Business key fields cannot contain nulls or duplicate records. Defining a business key is optional, if the dataset doesn’t have a natural business key there is no need to create one. There are two categorical fields in this dataset which can be marked in the schema as well. Setting business key and categorical flags will allow for more meaningful outputs, for example by making information available to a UI for sorting and filtering. TRAC may also perform some optimisations using these flags. As a general rule, define business key or categorical fields where they are a natural expression of the data.

When the customer_loans dataset is accessed at runtime, TRAC will guarantee the dataset is supplied with exactly this arrangement of columns: the order, case and data types will be exactly as defined. Order and case are treated leniently - if the incoming dataset has a different field order or casing, the fields will be reordered and renamed. Any extra fields will be dropped. Data types are also guaranteed to match what is in the schema.

For models running locally, the –dev-mode option will enable a more lenient handling of data types. In this mode, TRAC will attempt to convert data to use the specified field types, for example by parsing dates stored as strings or casting integers to floats. Conversions that fail or lose data will not be allowed. If the conversion succeeds, the dataset presented to the model is guaranteed to match the schema. This option can be very useful for local development if data is held in CSV files. Models launched using launch_model() run in dev mode by default and will use lenient type handling for input files.

67    def define_inputs(self) -> tp.Dict[str, trac.ModelInputSchema]:
68
69        customer_loans = trac.define_input_table(
70            trac.F("id", trac.STRING, label="Customer account ID", business_key=True),
71            trac.F("loan_amount", trac.DECIMAL, label="Principal loan amount"),
72            trac.F("total_pymnt", trac.DECIMAL, label="Total amount repaid"),
73            trac.F("region", trac.STRING, label="Customer home region", categorical=True),
74            trac.F("loan_condition_cat", trac.INTEGER, label="Loan condition category"))
75
76        return {"customer_loans": customer_loans}

To define the model outputs we can use define_output_table(), which is identical to define_input_table() save for the fact it returns an output schema. There are a few special cases where input and output schemas need to be treated differently, but in the majority of cases they are the same.

Models are free to define multiple outputs if required, but this example only has one.

78    def define_outputs(self) -> tp.Dict[str, trac.ModelOutputSchema]:
79
80        profit_by_region = trac.define_output_table(
81            trac.F("region", trac.STRING, label="Customer home region", categorical=True),
82            trac.F("gross_profit", trac.DECIMAL, label="Total gross profit"))
83
84        return {"profit_by_region": profit_by_region}

Now the parameters, inputs and outputs of the model are defined, we can implement the run_model() method.

Running the model

To implement the run_model() method first we get the three model parameters, which will come back with the correct Python types - eur_usd_rate and default_weighting will be floats, filter_defaults will have type bool.

To get the input dataset we use the method get_pandas_table(). The dataset name is the same name we used in define_inputs(). This will create a Pandas dataframe, with column layout and data types that match what we defined in the schema for this input.

86    def run_model(self, ctx: trac.TracContext):
87
88        eur_usd_rate = ctx.get_parameter("eur_usd_rate")
89        default_weighting = ctx.get_parameter("default_weighting")
90        filter_defaults = ctx.get_parameter("filter_defaults")
91
92        customer_loans = ctx.get_pandas_table("customer_loans")

Once all the inputs and parameters are available, we can call the model function. Since all the inputs and parameters are supplied using the correct native types there is no further conversion necessary, they can be passed straight into the model code.

94        profit_by_region = calculate_profit_by_region(
95            customer_loans, eur_usd_rate,
96            default_weighting, filter_defaults)

The model code has produced a Pandas dataframe that we want to record as an output. To do this, we can use put_pandas_table(). The dataframe should match exactly with what is defined in the output schema. If any columns are missing or have the wrong data type, TRAC will report a runtime validation error. When considering data types for outputs TRAC does provide some leniency. For example, if a timestamp field is supplied with the wrong precision, or an integer column is supplied in place of decimals, TRAC will perform conversions. Any conversion that would result in loss of data (e.g. values outside the allowed range) will result in an error. The output dataset passed on to the platform is guaranteed to have the correct data types as specified in define_outputs().

If the column order or casing is wrong, or if there are extra columns, the output will be allowed but a warning will appear in the logs. Columns will be reordered and converted to the correct case, any extra columns will be dropped.

98        ctx.put_pandas_table("profit_by_region", profit_by_region)

The model can be launched locally using launch_model().

101if __name__ == "__main__":
102    import tracdap.rt.launch as launch
103    launch.launch_model(UsingDataModel, "config/using_data.yaml", "config/sys_config.yaml")

Configure local data

To pass data into the local model, a little bit more config is needed in the sys_config file to define a storage bucket. In TRAC storage buckets can be any storage location that can hold files. This would be bucket storage on a cloud platform, but you can also use local disks or other storage protocols such as network storage or HDFS, so long as the right storage plugins are available.

This example sets up one storage bucket called example_data. Since we are going to use a local disk, the storage protocol is LOCAL. The rootPath property says where this storage bucket will be on disk - a relative path is taken relative to the sys_config file by default, or you can specify an absolute path here to avoid confusion.

The default bucket is also where output data will be saved. In this example we have only one storage bucket configured, which is used for both inputs and outputs, so we mark that as the default.

config/sys_config.yaml
storage:

  defaultBucket: example_data
  defaultFormat: CSV

  buckets:

    example_data:
      protocol: LOCAL
      properties:
        rootPath: ../data

In the job_config file we need to specify what data to use for the model inputs and outputs. Each input named in the model must have an entry in the inputs section, and each output in the outputs section. In this example we are using CSV files and just specify a simple path for each input and output.

Input and output paths are always relative to the data storage location, it is not possible to use absolute paths for model inputs and outputs in a job config. This is part of how the TRAC framework operates, data is always accessed from a storage location, with locations defined in the system config.

The model parameters are also set in the job config, in the same way as the previous tutorial.

config/using_data.yaml
job:
  runModel:

    parameters:
      eur_usd_rate: 1.2071
      default_weighting: 1.5
      filter_defaults: false

    inputs:
      customer_loans: "inputs/loan_final313_100.csv"

    outputs:
      profit_by_region: "outputs/using_data/profit_by_region.csv"

These simple config files are enough to run a model locally using sample data in CSV files. Output files will be created when the model runs, if you run the model multiple times outputs will be suffixed with a number.

See also

Full source code is available for the Using Data example on GitHub

Schema files

For small models like this example defining schemas in code is simple, however for more complex models in real-world situations the schemas are often quite large and can be reused across a set of related models. To cater for more complex schemas, TRAC allows schemas to be defined in schema files.

A schema file is just a CSV file that lists the field names, types and labels for a dataset as well as any other optional flags. Here are the schema files for the input and output datasets of this model, as you can see they provide the same information that was defined in code earlier.

customer_loans.csv

field_name

field_type

label

categorical

business_key

not_null

format_code

id

STRING

Customer account ID

true

loan_amount

DECIMAL

Principal loan amount

total_pymnt

DECIMAL

Total amount repaid

region

STRING

Customer home region

true

loan_condition_cat

INTEGER

Loan condition category

profit_by_region.csv

field_name

field_type

label

categorical

business_key

not_null

format_code

region

STRING

Customer home region

true

gross_profit

DECIMAL

Total gross profit

The default values for the field flags are categorical = false, business_key = false and not_null = true if business_key = true, otherwise not_null = false. The TRAC platform ignores the format_code field, but it can be used to describe how data is displayed in client applications.

To use schema files, they must be included as part of your Python package structure. That means they must be in the source tree with your Python code, in a package with an __init__.py file. If you are building your model packages as Python Wheels or Conda packages the schema files must be included as part of the build.

To add the schema files into the example project we can create a sub-package called “tutorial.schemas”, which would look like this:

examples-project
├── config
│   ├── sys_config.yaml
│   ├── using_data.yaml
│   └── ...
├── src
│   └── tutorial
│       ├── __init__.py
│       ├── using_data.py
│       └── schemas
│           ├── __init__.py
│           ├── customer_loans.csv
│           └── profit_by_region.csv
├── test
│   ├── test_using_data_model.py
│   └── ...
├── requirements.txt
├── setup.py
└── ...

Now we can re-write our model to use the new schema files. First we need to import the schemas package:

src/tutorial/schema_files.py
19import tutorial.schemas as schemas

Then we can load schemas from the schemas package in the define_inputs() and define_outputs() methods:

46    def define_inputs(self) -> tp.Dict[str, trac.ModelInputSchema]:
47
48        customer_loans = trac.load_schema(schemas, "customer_loans.csv")
49
50        return {"customer_loans": trac.ModelInputSchema(customer_loans)}
51
52    def define_outputs(self) -> tp.Dict[str, trac.ModelOutputSchema]:
53
54        profit_by_region = trac.load_schema(schemas, "profit_by_region.csv")
55
56        return {"profit_by_region": trac.ModelOutputSchema(profit_by_region)}

Notice that the load_schema() method is the same for input and output schemas, so we need to use ModelInputSchema and ModelOutputSchema explicitly.

See also

Full source code is available for the Schema Files example on GitHub