The PACE framework is a framework by Google for data analysis and machine learning projects. PACE is an acronym for plan, analyze, construct and execute. Each one of the letters represents an actionable stage in a project. The PACE workflow discovers how each stage of the process can help guide data analysis. The PACE framework was developed at Google by the people who put together the course at Coursera called Advanced Data Analytics. It is a workflow/framework structure that is meant to be flexible. Anytime you are allowed to revisit each stage of the workflow to make any needed changes without interrupting the overall process. Communication is vital and encouraged throughout the entire process.
- Plan
- Analyze
- Construct
- Execute
Plan
Here you will define the project, the project’s purpose and the scope of the project. Here you define what data you will need. Here you develop the steps of the project. You will assess who the stakeholders are and what their needs are. What are the key questions that need to be answered? First thing is to focus on the business need and the second thing is to evaluate alternative models. If you are considering a machine learning model, or any model, does it meet the business needs?
As an example, perhaps you are trying to predict housing prices. You have the sales price of some houses in a given area, along with the square footage, number of bedrooms, number of bathrooms, and location. We know that we need to create a supervised continuous model. It’s continuous because we need to predict price, which is a continuous variable, not a categorical one. (Predicting company churn would be categorical). It is supervised because we have our target variable data, namely the price.
Ask yourself if you have any ethical concerns. For example, suppose you’ve been asked to find customers who don’t tip well. If the model is able to identify those customers who are poor tippers, would the staff treat those customers well, knowing in advance the profile of a poor tipper? Maybe it is better to identify those customers who are good tippers.
It’s great if you have access to a data dictionary, but if not, you’ll want to try to look at the data to see what you are working with. Write down which questions you have. As a data analyst, it is here that you may require the help of a data engineer.
Analyze
This is where you begin to interact with the data for the first time. You will acquire the data. You may be working with a data engineer to do this. Data may come from primary or secondary sources. This stage is where you engage in exploratory data analysis (EDA). Here you need to clean, reorganize, and analyze all your data.
Your response variable is a good place to start. Then you can move on to your predictor variables. In our house prices example, the response variable is price. The square footage is an example of the predictor variable. You might need to change the format of some of your variables.
If you are working in Python and pandas, you might use commands such as head(), info(), describe() and value_counts(). You may want to use visualizations to see your data. For example, have a look at the Titanic Logistic Regression post for examples of using seaborn to visualize your data. If you ar eworking in Python, check out the post Histograms in Matplotlib.
Feature engineering is the next step. You may or may not have anything to do here. You might be trying to predict generous tippers. It will be your target variable. You may need to feature engineer this column by dividing tip by the total paid minus the tip to get a percentage, then create another column called “generous’.
In your feature engineering you will want to pay close attention to your data types. For example, you may need to convert a date column that is a string to a datetime column, in pandas. You might want to create columns for year, month and day.
Construct
In this stage you will be building, interpreting and revising your models. Your project may require a machine learning model. It may require a regression model. You will want to think about correlations between data. If there are any hidden relationships in the data, you will want to uncover them. Below are some things you will do in this phase.
- Determine which models are most appropriate
- Construct the model
- Confirm model assumptions
- Evaluate model results to determine how well your model fits the data
Here we need to develop an even deeper understanding of the data. At this point you will split the data into features and target variables and into training and testing datasets. Features and target data is split by columns. Training and testing is split by rows.
- Define a variable y that isolates the target variable (generous).
- Define a variable X that isolates all of the features (columns).
- Split the data into training and testing sets. Put about 20% of the samples into the test set, stratify the data, and set the random
state.
Execute
Both internal and external stakeholders are involved in this step. You will be communicating your findings to people inside and outside of your organization. All through the PACE workflow you will be communicating. Below are some of the things you will be doing in this phase.
- Interpret model performance and results
- Evaluate model performance using metrics
- Prepare results, visualizations, and actionable steps to share with stakeholders
Adaptability
Although the PACE framework is presented as four stages in order, feel free to move to any stage at any time. The PACE framework is flexible. For example, in the planning phase you may need to jump to the execute stage and present you initial findings to stakeholders to get feedback. In the analyze phase you might discover that you need more data than you thought you’d needed and you’ll need to go back to the planning phase and add that requirement. The PACE framework allows for an Agile approach.
An Example
Suppose you had a dataset that you wanted to investigate to gather insights. There are four stages of the PACE framework, but there are seven steps in our project plan.
- Imports – understand the business scenario; define the problem and challenge – Plan
- EDA and checking model assumptions; Data cleaning – Plan & Analyze
- Determine which models are appropriate – Analyze & Construct
- Model Building – Construct
- Confirm the model assumptions – Analyze & Construct
- Evaluating the Model results – Analyze
- Interpreting the Model Results and Share with Stakeholders- Execute