Creating custom Ruta scripts for entity extraction

From PegaWiki
Creating custom RUTA scripts for entity extraction / This is the approved revision of this page, as well as being the most recent.
Jump to navigation Jump to search

Creating custom Ruta scripts for entity extraction

Description Creating custom entities with Ruta scripts
Version as of 7.4
Application Pega Customer Service
Capability/Industry Area Chat and Messaging



Creating custom Apache Ruta scripts for entity extraction[edit]

Here are some examples of Apache Ruta scripts for custom entities. The full Apache UIMA Ruta™ Guide and Reference can be found here: https://uima.apache.org/d/ruta-current/tools.ruta.book.html

US states and abbreviations[edit]

The following Apache Ruta script extracts the US state names and abbreviations.

PACKAGE uima.ruta.example;

DECLARE VarA;
DECLARE VarB;
DECLARE VarC;

//State abbreviations - must be capitalized
W{REGEXP("(AL|AK|AZ|AR|CA|CO|CT|DE|FL|GA|HI|ID|IL|IN|IA|KS|KY|LA|ME|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VT|VA|WA|WV|WI|WY)") ->MARK(EntityType,1)};


// Single word state names
W?{REGEXP("(?i)(alabama|alaska|arizona|arkansas|california|colorado|connecticut|delaware|florida|georgia|hawaii|idaho|illinois|indiana|iowa|kansas|kentucky|louisiana|maine|maryland|massachusetts|michigan|minnesota|mississippi|missouri|montana|nebraska|nevada|ohio|oklahoma|oregon|pennsylvania|tennessee|texas|utah|vermont|virginia|washington|wisconsin|wyoming)") ->MARK(EntityType,1)};


// North Carolina and Dakota
W?{REGEXP("(?i)(north)")}
SPACE*?
W?{REGEXP("(?i)(carolina|dakota)") ->MARK(EntityType,1,2,3)};


// South Carolina and Dakota
W?{REGEXP("(?i)(south)")}
SPACE*?
W?{REGEXP("(?i)(carolina|dakota)") ->MARK(EntityType,1,2,3)};


//West Virginia
W?{REGEXP("(?i)(west)")}
SPACE*?
W?{REGEXP("(?i)(virginia)") ->MARK(EntityType,1,2,3)};


//New York, New Jersey, New Mexico, New Hampshire
W?{REGEXP("(?i)(new)")}
SPACE*?
W?{REGEXP("(?i)(hampshire|york|jersey|mexico)") ->MARK(EntityType,1,2,3)};


//Rhode Island
W?{REGEXP("(?i)(rhode)")}
SPACE*?
W?{REGEXP("(?i)(island)") ->MARK(EntityType,1,2,3)};

US states[edit]

The following Apache Ruta script extracts the US state names.

PACKAGE uima.ruta.example;

DECLARE VarA;
DECLARE VarB;
DECLARE VarC;

W?{REGEXP("(?i)(north|east|south|west|new|alabama|alaska|arizona|arkansas|california|colorado|connecticut|delaware|florida|georgia|hawaii|idaho|illinois|indiana|iowa|kansas|kentucky|louisiana|maine|maryland|massachusetts|michigan|minnesota|mississippi|missouri|montana|nebraska|nevada|ohio|oklahoma|oregon|pennsylvania|rhode|tennessee|texas|utah|vermont|virginia|washington|wisconsin|wyoming)")}
SPACE*?
W?{REGEXP("(?i)(hampshire|york|jersey|island|mexico|carolina|dakota|virginia)") ->MARK(EntityType,1,2,3)};

US state abbreviations[edit]

The following Apache Ruta script extracts the abbreviations of the US states.

PACKAGE uima.ruta.example;

DECLARE VarA;
DECLARE VarB;
DECLARE VarC;

W{REGEXP("(AL|AK|AZ|AR|CA|CO|CT|DE|FL|GA|HI|ID|IL|IN|IA|KS|KY|LA|ME|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VT|VA|WA|WV|WI|WY)") ->MARK(EntityType,1)};

Year[edit]

The following Apache Ruta script extracts a 4-digit year between 1900 and 2099.

PACKAGE uima.ruta.example;

NUM{REGEXP("19..|20..") ->  CREATE(EntityType,1,5,"entityType" = "YEAR")};

Vehicle identification number (VIN)[edit]

The following Apache Ruta script extracts a vehicle identification number in the following format: one digit, four letters, two digits, four letters, and six digits. For example, 1HGBH41JXMN109186.

PACKAGE uima.ruta.example;

DECLARE VarA;
DECLARE VarB;
DECLARE VarC;

NUM {REGEXP(".") }
W{REGEXP("....") }
NUM {REGEXP("..") }
W{REGEXP("....") }
NUM{REGEXP("......") ->MARK(EntityType,1,5)}

9-digit account number[edit]

The following examples show different ways of creating the entity and recognizing a 9-digit account number. For example, Account:123450000

PACKAGE uima.ruta.example;
DECLARE VarA;
NUM{REGEXP(".........") -> MARK(VarA),  MARK(EntityType,1,1), UNMARK(VarA)};

PACKAGE uima.ruta.example;
NUM{REGEXP(".........") ->  CREATE(EntityType,1,"entityType" = "AccNum")};

PACKAGE uima.ruta.example;
NUM{REGEXP(".........") -> MARK(EntityType,1)};

Phone number[edit]

The following Apache Ruta script extracts 6-digit and 10-digit phone numbers.

//US Phone number entity type
PACKAGE uima.ruta.example;

DECLARE VarA;
DECLARE VarB;
DECLARE VarC;

//With Area code and delimeter
//222-555-6789, 222.555.6789, 222 555 6789
NUM{REGEXP("...")}
("-"|"."|SPACE)?
NUM{REGEXP("...")}
("-"|"."|SPACE?)?
NUM{REGEXP("....") ->CREATE(EntityType,1,5,"entityType" = "PhoneNumber")};

//Area code no delimeter
//2225556789
NUM{REGEXP("..........") ->CREATE(EntityType,1,5,"entityType" = "PhoneNumber")};

//No Area code with delimeter
//555.6789, 555-6789, 555 6789
NUM{REGEXP("...")}
("-"|"."|SPACE?)?
NUM{REGEXP("....") ->CREATE(EntityType,1,5,"entityType" = "PhoneNumber")};