Chapter 2 - Using Data¶
This tutorial is based on example code which can be found in the TRAC GitHub Repository under examples/models/python.
Wrap existing code¶
In the previous tutorial, model code was written directly in the run_model()
method of the model class. An alternative approach is to put the model code in a separate class or function,
which can be called by run_model()
. This can be useful
if you have a library of existing model code that you want to wrap with the TRAC model API.
If you are wrapping code in this way, it is important that all the required inputs are passed to the top-level class or function as parameters, as shown in this example.
21def calculate_profit_by_region(
22 customer_loans: pd.DataFrame,
23 eur_usd_rate: float,
24 default_weighting: float,
25 filter_defaults: bool):
26
27 """
28 Aggregate expected profit by region on a book of customer loans
29 Use a weighting factor for bad loans and report results in USD
30 Optionally, bad loans can be filtered from the results
31 """
32
33 if filter_defaults:
34 customer_loans = customer_loans[customer_loans["loan_condition_cat"] == 0]
35
36 # Build a weighting vector, use default_weighting for bad loans and 1.0 for good loans
37 condition_weighting = customer_loans["loan_condition_cat"] \
38 .apply(lambda c: decimal.Decimal(default_weighting) if c > 0 else decimal.Decimal(1))
39
40 customer_loans["gross_profit_unweighted"] = customer_loans["total_pymnt"] - customer_loans["loan_amount"]
41 customer_loans["gross_profit_weighted"] = customer_loans["gross_profit_unweighted"] * condition_weighting
42 customer_loans["gross_profit"] = customer_loans["gross_profit_weighted"] * decimal.Decimal(eur_usd_rate)
43
44 profit_by_region = customer_loans \
45 .groupby("region", as_index=False) \
46 .aggregate({"gross_profit": "sum"})
47
48 return profit_by_region
Defining model requirements¶
Now let’s write the TRAC model wrapper class. The previous tutorial showed how to define parameters so we can use the same syntax. We’ll define the three parameters needed by the model function:
51class UsingDataModel(trac.TracModel):
52
53 def define_parameters(self) -> tp.Dict[str, trac.ModelParameter]:
54
55 return trac.define_parameters(
56
57 trac.P("eur_usd_rate", trac.FLOAT,
58 label="EUR/USD spot rate for reporting"),
59
60 trac.P("default_weighting", trac.FLOAT,
61 label="Weighting factor applied to the profit/loss of a defaulted loan"),
62
63 trac.P("filter_defaults", trac.BOOLEAN,
64 label="Exclude defaulted loans from the calculation",
65 default_value=False))
The example model function has one data input, which is a table called customer_loans.
The function define_output_table()
in the
TRAC API allows us to define a tabular dataset for use as a model input, which is exactly
what is needed. Each field is defined using the shorthand function trac.F()
.
This approach works well for small models with simple schemas (the next tutorial discusses
managing more complex models using schema files).
Every field must have a name, type and label. Only scalar types are allowed for fields in table
schemas - it is not possible define a field which has a compound type such as
MAP
or ARRAY
.
In this example the dataset has a natural business key, so we can mark this in the schema. Business key fields cannot contain nulls or duplicate records. Defining a business key is optional, if the dataset doesn’t have a natural business key there is no need to create one. There are two categorical fields in this dataset which can be marked in the schema as well. Setting business key and categorical flags will allow for more meaningful outputs, for example by making information available to a UI for sorting and filtering. TRAC may also perform some optimisations using these flags. As a general rule, define business key or categorical fields where they are a natural expression of the data.
When the customer_loans dataset is accessed at runtime, TRAC will guarantee the dataset is supplied with exactly this arrangement of columns: the order, case and data types will be exactly as defined. Order and case are treated leniently - if the incoming dataset has a different field order or casing, the fields will be reordered and renamed. Any extra fields will be dropped. Data types are also guaranteed to match what is in the schema.
For models running locally, the –dev-mode option will enable a more lenient handling of data types.
In this mode, TRAC will attempt to convert data to use the specified field types, for example by parsing
dates stored as strings or casting integers to floats. Conversions that fail or lose data will not be allowed.
If the conversion succeeds, the dataset presented to the model is guaranteed to match the schema.
This option can be very useful for local development if data is held in CSV files. Models launched using
launch_model()
run in dev mode by default and will use
lenient type handling for input files.
67 def define_inputs(self) -> tp.Dict[str, trac.ModelInputSchema]:
68
69 customer_loans = trac.define_input_table(
70 trac.F("id", trac.STRING, label="Customer account ID", business_key=True),
71 trac.F("loan_amount", trac.DECIMAL, label="Principal loan amount"),
72 trac.F("total_pymnt", trac.DECIMAL, label="Total amount repaid"),
73 trac.F("region", trac.STRING, label="Customer home region", categorical=True),
74 trac.F("loan_condition_cat", trac.INTEGER, label="Loan condition category"))
75
76 return {"customer_loans": customer_loans}
To define the model outputs we can use define_output_table()
,
which is identical to define_input_table()
save for the fact it
returns an output schema. There are a few special cases where input and output schemas need to be treated
differently, but in the majority of cases they are the same.
Models are free to define multiple outputs if required, but this example only has one.
78 def define_outputs(self) -> tp.Dict[str, trac.ModelOutputSchema]:
79
80 profit_by_region = trac.define_output_table(
81 trac.F("region", trac.STRING, label="Customer home region", categorical=True),
82 trac.F("gross_profit", trac.DECIMAL, label="Total gross profit"))
83
84 return {"profit_by_region": profit_by_region}
Now the parameters, inputs and outputs of the model are defined, we can implement the
run_model()
method.
Running the model¶
To implement the run_model()
method first we get
the three model parameters, which will come back with the correct Python types - eur_usd_rate and
default_weighting will be floats, filter_defaults will have type bool.
To get the input dataset we use the method get_pandas_table()
.
The dataset name is the same name we used in define_inputs()
.
This will create a Pandas dataframe, with column layout and data types that match what we defined in the
schema for this input.
86 def run_model(self, ctx: trac.TracContext):
87
88 eur_usd_rate = ctx.get_parameter("eur_usd_rate")
89 default_weighting = ctx.get_parameter("default_weighting")
90 filter_defaults = ctx.get_parameter("filter_defaults")
91
92 customer_loans = ctx.get_pandas_table("customer_loans")
Once all the inputs and parameters are available, we can call the model function. Since all the inputs and parameters are supplied using the correct native types there is no further conversion necessary, they can be passed straight into the model code.
94 profit_by_region = calculate_profit_by_region(
95 customer_loans, eur_usd_rate,
96 default_weighting, filter_defaults)
The model code has produced a Pandas dataframe that we want to record as an output. To do this, we can use
put_pandas_table()
. The dataframe should match
exactly with what is defined in the output schema. If any columns are missing or have the wrong data type,
TRAC will report a runtime validation error. When considering data types for outputs TRAC does provide some
leniency. For example, if a timestamp field is supplied with the wrong precision, or an integer column is
supplied in place of decimals, TRAC will perform conversions. Any conversion that would result in loss of
data (e.g. values outside the allowed range) will result in an error. The output dataset passed on to the
platform is guaranteed to have the correct data types as specified in
define_outputs()
.
If the column order or casing is wrong, or if there are extra columns, the output will be allowed but a warning will appear in the logs. Columns will be reordered and converted to the correct case, any extra columns will be dropped.
98 ctx.put_pandas_table("profit_by_region", profit_by_region)
The model can be launched locally using launch_model()
.
101if __name__ == "__main__":
102 import tracdap.rt.launch as launch
103 launch.launch_model(UsingDataModel, "config/using_data.yaml", "config/sys_config.yaml")
Configure local data¶
To pass data into the local model, a little bit more config is needed in the sys_config file to define a storage bucket. In TRAC storage buckets can be any storage location that can hold files. This would be bucket storage on a cloud platform, but you can also use local disks or other storage protocols such as network storage or HDFS, so long as the right storage plugins are available.
This example sets up one storage bucket called example_data. Since we are going to use a local disk, the storage protocol is LOCAL. The rootPath property says where this storage bucket will be on disk - a relative path is taken relative to the sys_config file by default, or you can specify an absolute path here to avoid confusion.
The default bucket is also where output data will be saved. In this example we have only one storage bucket configured, which is used for both inputs and outputs, so we mark that as the default.
storage:
defaultBucket: example_data
defaultFormat: CSV
buckets:
example_data:
protocol: LOCAL
properties:
rootPath: ../data
In the job_config file we need to specify what data to use for the model inputs and outputs. Each input named in the model must have an entry in the inputs section, and each output in the outputs section. In this example we are using CSV files and just specify a simple path for each input and output.
Input and output paths are always relative to the data storage location, it is not possible to use absolute paths for model inputs and outputs in a job config. This is part of how the TRAC framework operates, data is always accessed from a storage location, with locations defined in the system config.
The model parameters are also set in the job config, in the same way as the previous tutorial.
job:
runModel:
parameters:
eur_usd_rate: 1.2071
default_weighting: 1.5
filter_defaults: false
inputs:
customer_loans: "inputs/loan_final313_100.csv"
outputs:
profit_by_region: "outputs/using_data/profit_by_region.csv"
These simple config files are enough to run a model locally using sample data in CSV files. Output files will be created when the model runs, if you run the model multiple times outputs will be suffixed with a number.
See also
Full source code is available for the Using Data example on GitHub
Schema files¶
For small models like this example defining schemas in code is simple, however for more complex models in real-world situations the schemas are often quite large and can be reused across a set of related models. To cater for more complex schemas, TRAC allows schemas to be defined in schema files.
A schema file is just a CSV file that lists the field names, types and labels for a dataset as well as any other optional flags. Here are the schema files for the input and output datasets of this model, as you can see they provide the same information that was defined in code earlier.
field_name |
field_type |
label |
categorical |
business_key |
not_null |
format_code |
---|---|---|---|---|---|---|
id |
STRING |
Customer account ID |
true |
|||
loan_amount |
DECIMAL |
Principal loan amount |
||||
total_pymnt |
DECIMAL |
Total amount repaid |
||||
region |
STRING |
Customer home region |
true |
|||
loan_condition_cat |
INTEGER |
Loan condition category |
field_name |
field_type |
label |
categorical |
business_key |
not_null |
format_code |
---|---|---|---|---|---|---|
region |
STRING |
Customer home region |
true |
|||
gross_profit |
DECIMAL |
Total gross profit |
The default values for the field flags are categorical = false, business_key = false and not_null = true if business_key = true, otherwise not_null = false. The TRAC platform ignores the format_code field, but it can be used to describe how data is displayed in client applications.
To use schema files, they must be included as part of your Python package structure. That means they must be in the source tree with your Python code, in a package with an __init__.py file. If you are building your model packages as Python Wheels or Conda packages the schema files must be included as part of the build.
To add the schema files into the example project we can create a sub-package called “tutorial.schemas”, which would look like this:
examples-project
├── config
│ ├── sys_config.yaml
│ ├── using_data.yaml
│ └── ...
├── src
│ └── tutorial
│ ├── __init__.py
│ ├── using_data.py
│ └── schemas
│ ├── __init__.py
│ ├── customer_loans.csv
│ └── profit_by_region.csv
├── test
│ ├── test_using_data_model.py
│ └── ...
├── requirements.txt
├── setup.py
└── ...
Now we can re-write our model to use the new schema files. First we need to import the schemas package:
19import tutorial.schemas as schemas
Then we can load schemas from the schemas package in the
define_inputs()
and
define_outputs()
methods:
46 def define_inputs(self) -> tp.Dict[str, trac.ModelInputSchema]:
47
48 customer_loans = trac.load_schema(schemas, "customer_loans.csv")
49
50 return {"customer_loans": trac.ModelInputSchema(customer_loans)}
51
52 def define_outputs(self) -> tp.Dict[str, trac.ModelOutputSchema]:
53
54 profit_by_region = trac.load_schema(schemas, "profit_by_region.csv")
55
56 return {"profit_by_region": trac.ModelOutputSchema(profit_by_region)}
Notice that the load_schema()
method is the same
for input and output schemas, so we need to use
ModelInputSchema
and
ModelOutputSchema
explicitly.
See also
Full source code is available for the Schema Files example on GitHub