We are currently experiencing minor technical difficulties. All Design Patterns are available and search is working. We expect to have the Portal pages working again shortly.

Difference between revisions of "Performance Checklist"

From PegaWiki
Performance Checklist
Jump to navigation Jump to search
m (image adjust)
Tag: Visual edit
m (image)
Tag: Visual edit
Line 78: Line 78:
 
For Solutions hosted on Pega Cloud™ environments PDC is available by default and can be used to monitor Database metrics and Top Query Statistics.
 
For Solutions hosted on Pega Cloud™ environments PDC is available by default and can be used to monitor Database metrics and Top Query Statistics.
  
[[File:ImagePDC.png|left|624x624px|frameless]]
+
[[File:ImagePDC.png|624x624px|frameless|center]]
  
 
For non Pega Cloud environments having PostgreSQL database, you can execute the following to get top running Queries on the system.
 
For non Pega Cloud environments having PostgreSQL database, you can execute the following to get top running Queries on the system.

Revision as of 18:00, 4 November 2020


Curator Assigned
Request to Publish
Description
Version as of
Application
Capability/Industry Area

↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ Please Read Below ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓

Enter your content below. Use the basic wiki template that is provided to organize your content. After making your edits, add a summary comment that briefly describes your work, and then click "SAVE". To edit your content later, select the page from your "Watchlist" summary. If you can not find your article, search the design pattern title.

When your content is ready for publishing, next to the "Request to Publish" field above, type "Yes". A Curator then reviews and publishes the content, which might take up to 48 hours.

↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ The above text will be removed prior to being published ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓

Introduction

The aim of this document is to be a guide by providing a checklist for Development engineers & teams aiming to develop High Performant Solutions using Pega Platform™. The checklist is mainly focused on Solutions developed using Pega Platform™ and tools and features provided out of the box to ensure Production readiness.

The following Category lists Performance checklist and general best practices that need to be underwent during development phase and before Product release.

PDC - Right Place to tackle Performance Alerts

Be it on premise or PegaCloud, it’s important that you start monitoring your Pega based Solutions in Pega Predictive Diagnostic Cloud™ (PDC) even during development phase as a daily routine to diagnose, troubleshoot, and resolve Performance issues.

PDC provides you with tools for closely monitoring and precisely assessing your Pega Platform™ performance. By using the knowledge of the areas that need improvement, you can thoroughly investigate and effectively deal with unexpected or unwanted behaviour of your system.

The data that PDC presents gives you an in-depth view of various issues and events in the system, which increases your control over the way Pega Platform operates, and helps you eliminate errors. Sensitive data is safe and secure because PDC receives only diagnostic data, filtering out all personally identifying information (PII).

With detailed insight into your system’s operations, you can promptly identify and resolve issues to optimize features and maximize performance.

Use the information that you gather to decide on the best way to proceed. Choose the Improvement Plan report or enable continuous notifications about specific events, and then use the findings to inform users about the system health.

PDC monitors and gives variety of Alerts ranging from PEGA0001PEGA0110 based specifically on performance.

Typical Performance Alerts Captured in PDC

Alert Category
PEGA0001 - HTTP interaction time exceeds limit Browser Time
PEGA0002 - Commit operation time exceeds limit DB Commit Time
PEGA0003 - Rollback operation time exceeds limit DB Rollback Time
PEGA0004 - Quantity of data received by database query exceeds limit DB Bytes Read
PEGA0005 - Query time exceeds limit DB Time

How to start monitoring your systems with PDC?

Getting started with Pega Predictive Diagnostic Cloud.

Various Types of Performance Alerts:

https://community.pega.com/knowledgebase/articles/pega-predictive-diagnostic-cloud/list-performance-and-security-alerts-pega-platform

To configure PDC on premise:

https://community.pega.com/knowledgebase/articles/configuring-premises-systems-monitoring-pdc

If you are still unable to Configure PDC try using PegaRules Log Analyzer:

https://community.pega.com/knowledgebase/articles/performance/how-use-pegarules-log-analyzer

DB Performance and Top queries

When it comes to Performance Bottlenecks, Database is one of the main suspects that one can point his/her finger to. To avoid issues which are relating to Performance of DB and DB queries it is apt that you monitor your DB and Queries regularly.

The things that you need to check are:

  •     Based on your Applications Data Table design and anticipated growth patterns, please ensure you have created Indexes on key data columns
  •     Make sure the Database queries Response time fall under the SLA defined by Pega Platform™ Alerts and tune them if required and look for other DB related alerts in PDC
  •     Make sure your DB Query is retrieving data from the right columns and not more than required columns
  •     Make sure same Database query is not run multiple times or repeatedly than required, you can check this by tracking the count of executions in PDC or in PAL readings
  •     Find the Top queries with respect to response times and counts and address them

For Solutions hosted on Pega Cloud™ environments PDC is available by default and can be used to monitor Database metrics and Top Query Statistics.

ImagePDC.png

For non Pega Cloud environments having PostgreSQL database, you can execute the following to get top running Queries on the system.

  •  Enable pg_stat_statements extension in Postgres database
  • Use below query to get a list of top 10 queries by execution time.

> select query, total_time, calls, total_time/calls avg_time, total_time*100/(select sum(total_time) from pg_stat_statements) percent , rows, shared_blks_hit, shared_blks_read , shared_blks_dirtied, shared_blks_written from pg_stat_statements order by  avg_time desc,total_time desc limit 10

Apart from the above you can also set few Databases Alerts

Setting Database alerts Thresholds

One such scenario would be to identify database queries that return large amounts of data, and thus are candidates for tuning, set the Byte Threshold. This feature is off by default. The warning threshold warnMB writes a stack trace to the alert log. The error threshold errorMB writes a stack trace and additionally halts the requestor. Thus, setting the warnMB entry to 10 will provide insight into which queries are requesting 10 MB or more of data. However, setting the errorMB entry to 50 halts the requestor only if a database query returns over 50 MB of data. Based on the alert log results, adjust these settings periodically according to your requirements.

<env name="alerts/database/interactionByteThreshold/enabled" value="true" />

<env name="alerts/database/interactionByteThreshold/warnMB" value="15" />

<env name="alerts/database/interactionByteThreshold/errorMB" value="500" />

Measure Clipboard Size and Requestor sessions

The clipboard display shows the contents of the clipboard, but not its size in bytes. Large clipboards can affect performance because memory in the Java Virtual Machine (JVM) supporting the Pega Platform holds the clipboards of all requestors.

You can use the Performance tool to see the size of your clipboard in bytes, or to track the growth and contraction of your clipboard over time.

  •      Make sure for an end user requestor size remains under acceptable limit
  •      Make sure obsolete and dead Data pages are removed, memory gets cleared regularly and check memory leaks
  •      Have a check on Heavy Data Pages in your requestors and threads and reduce their footprint if possible

Also Monitor and adjust the number of requestors in the batch requestor pool.

To alter the number of requestors in the pool, use the agent/threadpoolsize setting in the prconfig.xml file or DSS. Monitor the thread level pages as well to ensure and limit the amount of clipboard usage to required data only.

Ways to measure Clipboard size:

https://community.pega.com/knowledgebase/articles/application-development/85/measuring-clipboard-size

Clipboard tool:

https://community.pega.com/knowledgebase/articles/application-development/85/using-clipboard-tool

Define and Set Target SLA’s

It’s always better to have a target acceptable Performance number in mind and setting an SLA across for your Solution can help you achieve this goal.

  •      You can set SLA’s in terms of Response times for the various critical HTML Pages or Screens that load as part of your Application usage by end users

These can be validated by running performance tests using open source Performance tools like Jmeter, Gatling, Fiddler etc. Alerts from Platform - PEGA0001 - HTTP interaction time exceeds limit and PEGA0069 - Client page load time are handy Alerts that can be leveraged here with their default SLA.

  •      Other key SLA’s that can be set would be for DB Query execution times, Connect Total Time

Validation of these again can be done by running simple performance tests mentioned above or by running manual runs. Alerts from Platform to look are PEGA0005 - Query time exceeds limit and PEGA0020 - Total connect interaction time exceeds limit.

In case the SLAs are not met you can debug the alerts for these transactions with Pega Performance diagnostic tools which are available in Tracer and Performance Tab in Dev Studio Portal for SysAdmin user namely PAL Reading, Tracer, Profiler, DB tracer etc.

Performance1.png
PALPerformance.png

You can also set thresholds for Alerts, for example

HTTP interaction time threshold

The default threshold for HTTP interactions time is one second. If a particular interaction takes more than 1sec, the system writes alert PEGA0001 to the alert log. The setting for exclude Assembly is included here so that initial Rule Assembly does not trigger alerts.

<env name="alerts/browser/interactionTimeThreshold/enabled" value="true" />

<env name="alerts/browser/interactionTimeThreshold/excludeAssembly" value="true" />

<env name="alerts/browser/interactionTimeThreshold/warnMS" value="1000" />

Other Thresholds & System setting to look at:

https://community.pega.com/knowledgebase/articles/performance/performance-guidance-production-systems-system-settings

Guardrail Scores - Check your Score

Guardrail score is a great way to look at when you want to develop performant solutions using Pega Platform. It not only helps you gauge your applications current functional issues but also helps you in identifying serious Performance Problems.

Schedule a recurring Check on Compliance Score and make sure it remains above a threshold like 90. Specific Performance impact risks and their counts can be seen in Compliance details tab. Urgency of these are categorized as follows:

Resolve Now: Severe Warnings that need to be addressed immediately

Resolve before Production: Moderate Warnings that need to be resolved before production

Selecting the number will take you to the Current risk areas. Addressing these can help you in overall gain or stop degradation in performance.

GuardrailScore1.png
GuardrailEvents.png

As part of your scheduled checks you need to also make sure System performance metrics and avg. response times are under control and don’t degrade over a period.

How to check Scores for your app:

https://community.pega.com/knowledgebase/articles/devops/85/viewing-application-quality-metrics

Metrics Details:

https://community.pega.com/knowledgebase/articles/devops/85/application-quality-metrics

Additional things to consider

Regularly Monitor Heap Memory and set the right JVM Configurations.

Heap memory trends can help you diagnose and troubleshoot Memory related performance bottlenecks if any. Rising Heap memory which is not getting Garbage Collected(GC) can be potential performance issue and memory leak in your application.   

Starting Pega Platform 8.5.1 you can use PDC to monitor Heap health in System Resources ---> JVM Monitoring and GC activity available in events.

Thread dumps are other parameters that if created are also available in logs to be analyzed.

How to Monitor Heap memory using JMX

If not PDC you can still monitor using JMX monitoring through open source tools like JvisualVM for insights into the JVM heap and thread details.

[Use the following JVM arguments to enable JMX monitoring, port 9099 can be used to connect through JvisualVM

-Dcom.sun.management.jmxremote.port=9099 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false ].

Typical Heap Memory Usage issue using JVisualVM

JVMHeapMemory.png

Pega Platform JVM Configurations Best Practises:

https://community.pega.com/knowledgebase/articles/performance/jvm-configuration-best-practices

Setting up & Configuring Hazelcast for On Prem:

https://community.pega.com/knowledgebase/articles/configuring-client-server-mode-hazelcast-pega-platform

Archiving and purging work items and related data

As the number of work items in the database grows, older or inactive work items and their related data need to be archived or purged. For guidance, see link below:

https://community.pega.com/knowledgebase/articles/system-administration/85/trimming-purging-and-archiving-tables

Production-level settings

Set the system's production level according to whether it is a test or production environment. By default this setting is been taken care on Pega Cloud Production environments but in case of on premise environments Use a production level of 2 for development systems and 5 for production systems. This setting will also help you manage the logging level accordingly.

Also, have regular checks on Background Dataflows that had been setup to create Data, if not in use need to be stopped along with QueueProcessor as they might be quietly creating data in background as per schedules.

How to Set Production level:

https://community.pega.com/knowledgebase/articles/system-administration/85/specifying-production-level

Design & Run Load test to validate the business use

While there are many tools available to carry out Load Testing, you can start with Jmeter for running performance tests. You may also consider Reusing of functional tests cases written by QAs as well for performance testing using Karate testing for example.

Design the load test to meet the business use of the solution. This means executing a test that is as close as feasible to the real anticipated use of the solution developed. It is important that your performance tests are designed to mimic the real-world production use. To ensure this happens, identify the right volume and the right mix of work across a business day. Always do the math, to ensure you understand the throughput of the tests and be able to say that in any n minutes the test had throughput of y items that would represent a full daily rate of x items, which is A% of current volumes of V/day.

Things to remember while carrying out Load Testing:

Ensure adequate data loads

Make sure loads are realistic and enough data is available to complete tests in the time period! Many performance issues first become evident in applications that have been in production for a certain period. Often this is because load testing was performed with insufficient data loads. As a result, response-time performance of the data paths was satisfactory during testing.

For example, the performance of a database table scan can be as effective on a table, with a certain number of records, as a selection through an index. However if the table grows significantly in production and a needed index is not in place, performance will seriously degrade.

Measure results appropriately

Do not use average response times for transactions as the absolute unit of measure for test results. Always consider Service Level Agreements (SLAs) in percentile terms. Load testing is not a precise science; consider the top percentile user or requestor experience. Review results in this light.

  •   For transaction intensive solutions/applications ("heads-down" use) a recommended value is 80 percentile.
  •   For mixed-type use applications, use 90 percentile.
  •   For ad-hoc, infrequent type use, a 95 percentile average wills provide a more statistically relevant result set than 100 percentile of the average.

Once you have understood and calculated the above start working on running your solution under a specified goal. You can run multiple types of Performance Tests like Load tests, Scalability Tests and Long Duration – Soak test etc.

Typical Load Testing Graphs

LoadTestingTypes.png

Response Times vs Virtual Users (Jmeter-Grafana)

LoadTestGraphs.png

Monitor PDC for Performance Metrics related to System resources, Alerts and Database for DB queries. You can also setup external monitoring tools like Datadog to capture Server health metrics. In case DataDog is unavailable make use of sar/vmstat commands in linux to track these system metrics

·       system.cpu

·       system.io

·       system.load

·       system.mem

…………

*Metrics captured on DATAdog

DataDogWeb.png
DataDogDB.png

As a best practice, periodically visit and repeat on the above steps mentioned in Performance checklist during your entire development phase of the Solution to be able to deliver Performant and Reliable Solutions.