Training Machine Learning Models on the AAW¶
Info
Training machine learning models involves using algorithms to learn patterns and relationships in data. This process involves identifying features or variables that are relevant to the problem at hand and using these features to make predictions or classifications.
Why train with us?¶
Training machine learning models on the Advanced Analytics Workspace (AAW) has several advantages.
-
Open Source: The AAW is an open source data platform hosted by Statistics Canada that provides secure (Protected B) access to a variety of data sources, including census data, surveys, and administrative records. This data can be used to train machine learning models and generate insights that can inform policy decisions and improve business processes.
-
Versatile: The AAW is designed to handle large and complex datasets. It provides access to a range of advanced analytics tools, in any language you like, including Python, R, and SAS, which can be used to preprocess data, train machine learning models, and generate visualizations. Because the AAW leverages cloud technologies, users can scale up their computing power as needed*. *
- Secure: The AAW is a secure platform (Protected B) that adheres to the highest standards of data privacy and security. Data can be stored and processed on the platform without risk of unauthorized access or data breaches.
MLOps and Data Pipelines¶
Optimize Data Workflows
MLOps and data pipelines are important tools used in the field of data science to manage and optimize data workflows.
MLOps¶
MLOps refers to the set of practices and tools used to manage the entire lifecycle of a machine learning model. This includes everything from developing and training the model to deploying it in production and maintaining it over time. MLOps ensures that machine learning models are reliable, accurate, and scalable, and that they can be updated and improved as needed.
Data Pipelines¶
Data pipelines are a series of steps that help move data from one system or application to another. This includes collecting, cleaning, transforming, and storing data, as well as retrieving it when needed. Data pipelines are important for ensuring that data is accurate, reliable, and accessible to those who need it.
Automation and Reliability
MLOps and data pipelines help organizations manage the complex process of working with large amounts of data and developing machine learning models. By automating these processes and ensuring that data is accurate and reliable, organizations can save time and resources while making better decisions based on data-driven insights.
Why Containerized MLOps?¶
The advantages of using a containerized approach for training machine learning models with Argo Workflows include:
-
Reproducibility: Containerizing the machine learning model and its dependencies ensures that the environment remains consistent across different runs, making it easy to reproduce results.
-
Scalability: Argo Workflows can orchestrate parallel jobs and complex workflows, making it easy to scale up the training process as needed.
-
Portability: Containers can be run on any platform that supports containerization, making it easy to move the training process to different environments or cloud providers.
-
Collaboration: By pushing the container to a container registry, other users can easily download and use the container for their own purposes, making it easy to collaborate on machine learning projects.
Argo Workflows and containerization provide a powerful and flexible approach for training machine learning models. By leveraging these tools, data scientists and machine learning engineers can build, deploy, and manage machine learning workflows with ease and reproducibility.
How to Train Models¶
There are multiple ways to train machine learning models and it is not our place to tell anyone how to do it. That being said we have provided below a couple of guides on how to train machine learning models using the tools available on the AAW. The first tutorial is about training a simple model directly in a JupyterLab notebook. The second tutorial assumes the user is more advanced and is interested in defining an MLOps pipeline for training models using Argo Workflows.
Create a Notebook Server on the AAW¶
Notebook Servers
Regardless of whether you plan on working in JupyterLab, R Studio or something more advanced with Argo Workflows, you'll need the appropriate notebook server. Follow the instructions found here to get started.
Using JupyterLab¶
JupyterLab is Popular
Training machine learning models with JupyterLab is a popular approach among data scientists and machine learning engineers.
Here you will find the steps required to train a machine learning model with JupyterLab on the AAW. Because we are a multi-lingual environment, we've done our best to provide code examples in our most popular languages, Python
, R
and SAS
.
1. Import the required libraries¶
Once you have a JupyterLab session running, you need to import the required libraries for your machine learning model. This could include libraries such as NumPy
, Pandas
, Scikit-learn
, Tensorflow
, or PyTorch
. If you are using R
, you'll want tidyverse
, caret
and janitor
.
libraries.sas | |
---|---|
About the Code
The code examples you see in this document and throughout the documentation are for illustrative purposes to get you started on your projects. Depending on the specific task or project, other libraries and steps may be required.
2. Load and preprocess the data¶
Next, you need to load and preprocess the data you'll be using to train your machine learning model. This could include data cleaning, feature extraction, and normalization. The exact preprocessing steps you'll need to perform will depend on the specific dataset you're working with, the requirements of your machine learning model and the job to be done.
3. Split the data into training and testing sets¶
Once the data is preprocessed, you need to split it into training and testing sets. The training set will be used to train the machine learning model, while the testing set will be used to evaluate its performance.
train_test.py | |
---|---|
train_test.py | |
---|---|
train_test.sas | |
---|---|
Note
We split the data into training and testing sets using the train_test_split
function from scikit-learn
, which randomly splits the data into two sets based on the specified test size and random seed.
4. Define and train the machine learning model¶
With the data split, you can now define and train your machine learning model using the training set. This could involve selecting the appropriate algorithm, hyperparameter tuning, and cross-validation.
5. Evaluate the model¶
After training the model, you need to evaluate its performance on the testing set. This will give you an idea of how well the model will perform on new, unseen data.
6. Deploy the model¶
Finally, you can deploy the trained machine learning model in a production environment.
Using Argo Workflows¶
MLOps Best Practices
Argo Workflows is an excellent tool for anyone looking to implement MLOps practices and streamline the process of training and deploying machine learning models or other data science tasks such as ETL.
Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD (Custom Resource Definition). It is particularly well-suited for use in machine learning and data science workflows.
Argo Workflows allows you to
- Define workflows where each step in the workflow is a container.
- Model multi-step workflows as a sequence of tasks or capture the dependencies between tasks using a directed acyclic graph (DAG).
- Easily run compute intensive jobs for machine learning or data processing in a fraction of the time using Argo Workflows on Kubernetes.
- Run CI/CD pipelines natively on Kubernetes without configuring complex software development products.
making it easy to manage the entire end-to-end machine learning pipeline. With Argo Workflows, you can easily build workflows that incorporate tasks such as data preprocessing, model training, and model deployment, all within a Kubernetes environment.
See the argo workflows section for more details.