Performance Checklist
Curator Assigned | |
---|---|
Request to Publish | |
Description | |
Version as of | |
Application | |
Capability/Industry Area |
↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ Please Read Below ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
Enter your content below. Use the basic wiki template that is provided to organize your content. After making your edits, add a summary comment that briefly describes your work, and then click "SAVE". To edit your content later, select the page from your "Watchlist" summary. If you can not find your article, search the design pattern title.
When your content is ready for publishing, next to the "Request to Publish" field above, type "Yes". A Curator then reviews and publishes the content, which might take up to 48 hours.
↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ The above text will be removed prior to being published ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
Introduction
The aim of this document is to be a guide by providing a checklist for Development engineers & teams aiming to develop High Performant Solutions using Pega Platform™. The checklist is mainly focused on Solutions developed using Pega Platform™ and tools and features provided out of the box to ensure Production readiness.
The following Category lists Performance checklist and general best practices that need to be underwent during development phase and before Product release.
PDC - Right Place to tackle Performance Alerts
Be it on premise or PegaCloud, it’s important that you start monitoring your Pega based Solutions in Pega Predictive Diagnostic Cloud™ (PDC) even during development phase as a daily routine to diagnose, troubleshoot, and resolve Performance issues.
PDC provides you with tools for closely monitoring and precisely assessing your Pega Platform™ performance. By using the knowledge of the areas that need improvement, you can thoroughly investigate and effectively deal with unexpected or unwanted behaviour of your system.
The data that PDC presents gives you an in-depth view of various issues and events in the system, which increases your control over the way Pega Platform operates, and helps you eliminate errors. Sensitive data is safe and secure because PDC receives only diagnostic data, filtering out all personally identifying information (PII).
With detailed insight into your system’s operations, you can promptly identify and resolve issues to optimize features and maximize performance.
Use the information that you gather to decide on the best way to proceed. Choose the Improvement Plan report or enable continuous notifications about specific events, and then use the findings to inform users about the system health.
PDC monitors and gives variety of Alerts ranging from PEGA0001 – PEGA0110 based specifically on performance.
Typical Performance Alerts Captured in PDC
Alert | Category |
PEGA0001 - HTTP interaction time exceeds limit | Browser Time |
PEGA0002 - Commit operation time exceeds limit | DB Commit Time |
PEGA0003 - Rollback operation time exceeds limit | DB Rollback Time |
PEGA0004 - Quantity of data received by database query exceeds limit | DB Bytes Read |
PEGA0005 - Query time exceeds limit | DB Time |
How to start monitoring your systems with PDC?
Getting started with Pega Predictive Diagnostic Cloud.
Various Types of Performance Alerts:
https://community.pega.com/knowledgebase/articles/pega-predictive-diagnostic-cloud/list-performance-and-security-alerts-pega-platform
To configure PDC on premise:
https://community.pega.com/knowledgebase/articles/configuring-premises-systems-monitoring-pdc
If you are still unable to Configure PDC try using PegaRules Log Analyzer:
https://community.pega.com/knowledgebase/articles/performance/how-use-pegarules-log-analyzer
DB Performance and Top queries
When it comes to Performance Bottlenecks, Database is one of the main suspects that one can point his/her finger to. To avoid issues which are relating to Performance of DB and DB queries it is apt that you monitor your DB and Queries regularly.
The things that you need to check are:
· Based on your Applications Data Table design and anticipated growth patterns, please ensure you have created Indexes on key data columns
· Make sure the Database queries Response time fall under the SLA defined by Pega Platform™ Alerts and tune them if required and look for other DB related alerts in PDC
· Make sure your DB Query is retrieving data from the right columns and not more than required columns
· Make sure same Database query is not run multiple times or repeatedly than required, you can check this by tracking the count of executions in PDC or in PAL readings
· Find the Top queries with respect to response times and counts and address them
For Solutions hosted on Pega Cloud™ environments PDC is available by default and can be used to monitor Database metrics and Top Query Statistics.
For non Pega Cloud environments having PostgreSQL database, you can execute the following to get top running Queries on the system.
Ø Enable pg_stat_statements extension in Postgres database
Ø Use below query to get a list of top 10 queries by execution time.
èselect query, total_time, calls, total_time/calls avg_time, total_time*100/(select sum(total_time) from pg_stat_statements) percent , rows, shared_blks_hit, shared_blks_read , shared_blks_dirtied, shared_blks_written from pg_stat_statements order by avg_time desc,total_time desc limit 10 Apart from the above you can also set few Databases Alerts
Setting Database alerts Thresholds
One such scenario would be to identify database queries that return large amounts of data, and thus are candidates for tuning, set the Byte Threshold. This feature is off by default. The warning threshold warnMB writes a stack trace to the alert log. The error threshold errorMB writes a stack trace and additionally halts the requestor. Thus, setting the warnMB entry to 10 will provide insight into which queries are requesting 10 MB or more of data. However, setting the errorMB entry to 50 halts the requestor only if a database query returns over 50 MB of data. Based on the alert log results, adjust these settings periodically according to your requirements.
<env name="alerts/database/interactionByteThreshold/enabled" value="true" />
<env name="alerts/database/interactionByteThreshold/warnMB" value="15" />
<env name="alerts/database/interactionByteThreshold/errorMB" value="500" />
Measure Clipboard Size and Requestor sessions
The clipboard display shows the contents of the clipboard, but not its size in bytes. Large clipboards can affect performance because memory in the Java Virtual Machine (JVM) supporting the Pega Platform holds the clipboards of all requestors.
You can use the Performance tool to see the size of your clipboard in bytes, or to track the growth and contraction of your clipboard over time.
· Make sure for an end user requestor size remains under acceptable limit
· Make sure obsolete and dead Data pages are removed, memory gets cleared regularly and check memory leaks
· Have a check on Heavy Data Pages in your requestors and threads and reduce their footprint if possible
Also Monitor and adjust the number of requestors in the batch requestor pool.
To alter the number of requestors in the pool, use the agent/threadpoolsize setting in the prconfig.xml file or DSS. Monitor the thread level pages as well to ensure and limit the amount of clipboard usage to required data only.
Ways to measure Clipboard size:
https://community.pega.com/knowledgebase/articles/application-development/85/measuring-clipboard-size
Clipboard tool:
https://community.pega.com/knowledgebase/articles/application-development/85/using-clipboard-tool
Define and Set Target SLA’s
It’s always better to have a target acceptable Performance number in mind and setting an SLA across for your Solution can help you achieve this goal.
· You can set SLA’s in terms of Response times for the various critical HTML Pages or Screens that load as part of your Application usage by end users
These can be validated by running performance tests using open source Performance tools like Jmeter, Gatling, Fiddler etc. Alerts from Platform - PEGA0001 - HTTP interaction time exceeds limit and PEGA0069 - Client page load time are handy Alerts that can be leveraged here with their default SLA.
· Other key SLA’s that can be set would be for DB Query execution times, Connect Total Time
Validation of these again can be done by running simple performance tests mentioned above or by running manual runs. Alerts from Platform to look are PEGA0005 - Query time exceeds limit and PEGA0020 - Total connect interaction time exceeds limit.
In case the SLAs are not met you can debug the alerts for these transactions with Pega Performance diagnostic tools which are available in Tracer and Performance Tab in Dev Studio Portal for SysAdmin user namely PAL Reading, Tracer, Profiler, DB tracer etc.
You can also set thresholds for Alerts, for example
HTTP interaction time threshold
The default threshold for HTTP interactions time is one second. If a particular interaction takes more than 1sec, the system writes alert PEGA0001 to the alert log. The setting for exclude Assembly is included here so that initial Rule Assembly does not trigger alerts.
<env name="alerts/browser/interactionTimeThreshold/enabled" value="true" />
<env name="alerts/browser/interactionTimeThreshold/excludeAssembly" value="true" />
<env name="alerts/browser/interactionTimeThreshold/warnMS" value="1000" />
Other Thresholds & System setting to look at:
https://community.pega.com/knowledgebase/articles/performance/performance-guidance-production-systems-system-settings
Guardrail Scores - Check your Score
Guardrail score is a great way to look at when you want to develop performant solutions using Pega Platform. It not only helps you gauge your applications current functional issues but also helps you in identifying serious Performance Problems.
Schedule a recurring Check on Compliance Score and make sure it remains above a threshold like 90. Specific Performance impact risks and their counts can be seen in Compliance details tab. Urgency of these are categorized as follows:
Resolve Now: Severe Warnings that need to be addressed immediately
Resolve before Production: Moderate Warnings that need to be resolved before production
Selecting the number will take you to the Current risk areas. Addressing these can help you in overall gain or stop degradation in performance.
As part of your scheduled checks you need to also make sure System performance metrics and avg. response times are under control and don’t degrade over a period.
How to check Scores for your app:
https://community.pega.com/knowledgebase/articles/devops/85/viewing-application-quality-metrics
Metrics Details:
https://community.pega.com/knowledgebase/articles/devops/85/application-quality-metrics
Additional things to consider
Regularly Monitor Heap Memory and set the right JVM Configurations.
Heap memory trends can help you diagnose and troubleshoot Memory related performance bottlenecks if any. Rising Heap memory which is not getting Garbage Collected(GC) can be potential performance issue and memory leak in your application.
Starting Pega Platform 8.5.1 you can use PDC to monitor Heap health in System ResourcesàJVM Monitoring and GC activity available in events.
Thread dumps are other parameters that if created are also available in logs to be analyzed.
How to Monitor Heap memory using JMX
If not PDC you can still monitor using JMX monitoring through open source tools like JvisualVM for insights into the JVM heap and thread details.
[Use the following JVM arguments to enable JMX monitoring, port 9099 can be used to connect through JvisualVM
-Dcom.sun.management.jmxremote.port=9099 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false ].
Typical Heap Memory Usage issue using JVisualVM
Pega Platform JVM Configurations Best Practises:
https://community.pega.com/knowledgebase/articles/performance/jvm-configuration-best-practices
Setting up & Configuring Hazelcast for On Prem:
https://community.pega.com/knowledgebase/articles/configuring-client-server-mode-hazelcast-pega-platform
As the number of work items in the database grows, older or inactive work items and their related data need to be archived or purged. For guidance, see link below:
https://community.pega.com/knowledgebase/articles/system-administration/85/trimming-purging-and-archiving-tables
Production-level settings
Set the system's production level according to whether it is a test or production environment. By default this setting is been taken care on Pega Cloud Production environments but in case of on premise environments Use a production level of 2 for development systems and 5 for production systems. This setting will also help you manage the logging level accordingly.
Also, have regular checks on Background Dataflows that had been setup to create Data, if not in use need to be stopped along with QueueProcessor as they might be quietly creating data in background as per schedules.
How to Set Production level:
https://community.pega.com/knowledgebase/articles/system-administration/85/specifying-production-level
Design & Run Load test to validate the business use
While there are many tools available to carry out Load Testing, you can start with Jmeter for running performance tests. You may also consider Reusing of functional tests cases written by QAs as well for performance testing using Karate testing for example.
Design the load test to meet the business use of the solution. This means executing a test that is as close as feasible to the real anticipated use of the solution developed. It is important that your performance tests are designed to mimic the real-world production use. To ensure this happens, identify the right volume and the right mix of work across a business day. Always do the math, to ensure you understand the throughput of the tests and be able to say that in any n minutes the test had throughput of y items that would represent a full daily rate of x items, which is A% of current volumes of V/day.
Things to remember while carrying out Load Testing:
Ensure adequate data loads
Make sure loads are realistic and enough data is available to complete tests in the time period! Many performance issues first become evident in applications that have been in production for a certain period. Often this is because load testing was performed with insufficient data loads. As a result, response-time performance of the data paths was satisfactory during testing.
For example, the performance of a database table scan can be as effective on a table, with a certain number of records, as a selection through an index. However if the table grows significantly in production and a needed index is not in place, performance will seriously degrade.
Measure results appropriately
Do not use average response times for transactions as the absolute unit of measure for test results. Always consider Service Level Agreements (SLAs) in percentile terms. Load testing is not a precise science; consider the top percentile user or requestor experience. Review results in this light.
Ø For transaction intensive solutions/applications ("heads-down" use) a recommended value is 80 percentile.
Ø For mixed-type use applications, use 90 percentile.
Ø For ad-hoc, infrequent type use, a 95 percentile average wills provide a more statistically relevant result set than 100 percentile of the average.
Once you have understood and calculated the above start working on running your solution under a specified goal. You can run multiple types of Performance Tests like Load tests, Scalability Tests and Long Duration – Soak test etc.
Typical Load Testing Graphs
Response Times vs Virtual Users (Jmeter-Grafana)
Monitor PDC for Performance Metrics related to System resources, Alerts and Database for DB queries. You can also setup external monitoring tools like Datadog to capture Server health metrics. In case DataDog is unavailable make use of sar/vmstat commands in linux to track these system metrics
· system.cpu
· system.io
· system.load
· system.mem
…………
*Metrics captured on DATAdog
As a best practise, periodically visit and repeat on the above steps mentioned in Performance checklist during your entire development phase of the Solution to be able to deliver Performant and Reliable Solutions.