Best practices when troubleshooting decision strategy performance
Best practices when troubleshooting decision strategy performance
|Description||Best practices when troubleshooting decision strategy performance|
|Version as of||8.6|
|Application||Pega Customer Decision Hub|
|Capability/Industry Area||Next Best Action|
Note: Information from this artile has been included in the official Pega documentation. For more information on optimizing strategy performance, see Tips for optimizing decision strategies on Pega Documentation.
Best practices for troubleshooting
When a decision strategy performance issue occurs, it will normally be down to a combination of strategy design and the customer data being used at run time. This article details some best practices when troubleshooting the decision strategy performance.
Data flow metrics
When a decision strategy performance issue is encountered, it will normally be possible to see this in the Data Flow run page. The strategy component in a data flow is expensive. How expensive can be interpreted through the following two metrics:
Time% taken among overall data flow
Be aware that in a typical decisioning scenario, strategy is normally the most time consuming (CPU intensive) component. This metric could reach up to 90% - 95% of total data flow execution time. Hence a high percentage with this metric, means strategy execution is shown as a performance bottleneck (as it should be). It doesn't necessarily mean there is a performance problem with the application though. However, if we see a relatively low percentage here, it might be an indication that other parts of the system (e.g. DB or DDS) can be tuned better.
Average time taken by Strategy shape per record
This metric records the time spent on the Strategy shape in Data Flow per record, which is also precisely what PEGA0063 alert is about. To dive a bit deeper, this metric mainly consists of 3 sub-parts (only visible after enabling detailed metrics for a Data Flow run):
- Pre-processing, such as loading IH or IH Summary caches for a Batch run
- Strategy execution, this is (should be) where most of this metric covers in a typically good scenario.
- Post processing, which invokes the pxDelayedLearningFlow Data Flow synchronously to save Strategy Results and Monitoring Info to pxDecisionResults dataset. Also depending on the configuraiton, its possible a few other built-in destinations.
Guide to troubleshooting
There are essentially 2 stages to troubleshooting. The first stage is to look at the data flow metrics and alerts in PDC for guidance and direction on the performance challenge. The second stage is to deep dive into the strategy execution.
- Firstly explore the the two Data Flow metrics mentioned in the data flow metrics section. In case of a sudden perf degradation in staging or prod environment, which metric(s) is affected?
- Alerts. Some typical alerts to watch out for Strategy related issues that can be seen in PDC:
- PEGA0063 strategy execution time
- PEGA0064 max number of strategy results processed per strategy component
- PEGA0075 DDS interaction time, various built-in storage (e.g. pxDecisionResults dataset) relies on DDS.
- PEGA0058 | PEGA0059 IH reading/writing time
- Enable detailed Data Flow metrics. In case of Strategy rule design related perf issues, the Strategy execution time should be the most expensive in the detailed metric report.
Diving into the details of Strategy rule design, there are two built-in tools that can be leveraged to inspect the details of strategy execution:
- Strategy Test Run panel (batch test): this provides user a way to leverage existing Data Flow / Data Set definitions for batch testing a Strategy rule over a batch of customer records. This provides the most accurate view and extensive metrics in terms of detailed performance for the current strategy on canvas. However, the current batch test capability has two limitations:
- It only shows the result for the current strategy only, a performance test would need to be run on each sub-strategy to collect metrics
- It only has detailed metrics for legacy components when running in optimized mode
- Strategy Profiler: this is accessible via "Actions → Run" from Strategy rule form. This is the traditional PRPC test run page which accepts a Data Transform to initialize the customer page for executing strategy. In CDH, the data transform rules used for Persona testing can be directly used here. By the end of execution, this will generate a downloadable report.
Strategy Profiler is a built-in tool that allows Strategy Designer to test execute a strategy rule for a given primary page and generate an overview report about how each component performs in terms of input/output and the time spent. This is helpful when troubleshooting performance issues and correctness issues to some extent.
How to use the Strategy Profiler
The tool is built into the PRPC standard "Action → Run" dialog of the Strategy rule. To generate the overview report for offline analysis:
- Open the "Action → Run" of the strategy rule under test (normally this would be the top NBA strategy)
- Initialize the Primary page context with either a Data Transform or copy it from another page, whichever is appropriate based on your setup
- Run & Download the Strategy Execution Profile report (Excel file)
Example Strategy Profiler report
- Total strategy execution time
- When this is executed with the new SSA engine, only the non-optimized components are measured directly with pages-in, pages-out and execution time, as indicated with a proper component name.Row with component name <All> represents the total time spent within that particular strategy execution, including substrategy executions if applicable
- Row with component name <Optimized> = <All> - all non-optimized components
- In case of Substrategy component, the time is accumulative
|Strategy applies to||DMOrg-DMSample-Data-Customer|
|Strategy execution time (ms)||673|
|Strategy / Component breakdown|
|Strategy name||Component name||Time (μs)||Time per page (μs)||Pages in||Pages out|
|NextBestAction||Global Control Parameters||370||370.00||0||1|
|CalculateMonthsLeftInContract||Global Control Parameters||744||744.00||0||1|
|Retention||Global Control Parameters||867||867.00||0||1|
|WebshopPointOffers||Import Webshop Points||1,480||493.33||0||3|
Decision Profiler is an advanced tool which is hidden from Pega users by default but allows you take a detailed snapshot of the decision profile
It is not available to customers directly but can be switched on in consultation with Pega support who can guide in it's usage and interpretation.
Simulations and batch performance checks
Running a simulation or a batch run to check the differences between versions of strategies is a simple yet effective way on understanding impact of strategy changes on the overall strategy execution time.
Starting in 8.5 we have introduced a performance check into Revision Management, Pega's change management tool. This simulation runs on the same audience and top level strategy. This lets us for each revision collect the average processing speed for each record. We can the compare this with the previous revision and report on the trendline performance. Has performance got better or worse than the last revision.
This approach can be replicated in a batch run or manually via a simulation test from the landing page in CDH portal. You can monitor the performance of your strategy through the data flow metrics. Running on the same strategy, with the same audience means you track the change in your strategy performance by comparing metrics from run to run.
Frequently asked questions when troubleshooting strategy performance
Below we show some frequently asked questions around best practices when troubleshooting strategy performance issues. This will continue to be updated when relevant questions are asked to Pega.
Which node type does the strategy get executed when run through a data flow? Can we control it to run on specific node types?
No dedicated node type for strategy, because it normally gets executed under data flow, hence it lives on the specific data fow node type depends on the type of workload.
Based on past experience, what are the top components if used in strategy design could lead to performance degradation? Are there suggestions or best practices which need to be followed to avoid such situation?
Un-optimized components are typically the ones that need more attention when debugging. Typical components that might run into performance issues:
- Adaptive Model - this by nature is a relatively expensive component. If using IH Predictors, this can be extremely expensive if there is an issue with DDS (e.g. PEGA0075)
- Interaction History - if there are many records to be loaded
- Data Import or Decision Data - when importing a large list of pages
- Embedded Strategy - when iterating over a large page list
- Data Join - when the component is misconfigured that leads to an explosion in the number of result pages (e.g. Cartesian product)
- MarkerContainer - internal technical representation of all data that needs to be propagated along with SR page, for example, ADM Model results and monitoring data. It's all transparent to the strategy designer, but if we have too many SR pages or the Strategy logic is (mis-)configured, it might cause a long GC pause issue. In this case, select "exclude model results" from Data Join shapes when applicable.
What are the important things to look at while tracing strategies
Tracing using the built-in Tracer is not recommended with the optimized decision engine as the optimizations means it cannot guarantee the order of execution.
Following the guidance in this article means you should be able to get good insight into your strategy performance issue and initiate the appropriate resolution. If you still cannot identify where the bottleneck in strategy is, then 3rd party JVM profiling tools (such as Yourkit) can be utilized.