Creating custom Ruta scripts for entity extraction
Creating custom Ruta scripts for entity extraction
Creating custom Apache Ruta scripts for entity extraction
Here are some examples of Apache Ruta scripts for custom entities. The full Apache UIMA Ruta™ Guide and Reference can be found here: https://uima.apache.org/d/ruta-current/tools.ruta.book.html
US states and abbreviations
The following Apache Ruta script extracts the US state names and abbreviations.
PACKAGE uima.ruta.example; DECLARE VarA; DECLARE VarB; DECLARE VarC; //State abbreviations - must be capitalized W{REGEXP("(AL|AK|AZ|AR|CA|CO|CT|DE|FL|GA|HI|ID|IL|IN|IA|KS|KY|LA|ME|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VT|VA|WA|WV|WI|WY)") ->MARK(EntityType,1)}; // Single word state names W?{REGEXP("(?i)(alabama|alaska|arizona|arkansas|california|colorado|connecticut|delaware|florida|georgia|hawaii|idaho|illinois|indiana|iowa|kansas|kentucky|louisiana|maine|maryland|massachusetts|michigan|minnesota|mississippi|missouri|montana|nebraska|nevada|ohio|oklahoma|oregon|pennsylvania|tennessee|texas|utah|vermont|virginia|washington|wisconsin|wyoming)") ->MARK(EntityType,1)}; // North Carolina and Dakota W?{REGEXP("(?i)(north)")} SPACE*? W?{REGEXP("(?i)(carolina|dakota)") ->MARK(EntityType,1,2,3)}; // South Carolina and Dakota W?{REGEXP("(?i)(south)")} SPACE*? W?{REGEXP("(?i)(carolina|dakota)") ->MARK(EntityType,1,2,3)}; //West Virginia W?{REGEXP("(?i)(west)")} SPACE*? W?{REGEXP("(?i)(virginia)") ->MARK(EntityType,1,2,3)}; //New York, New Jersey, New Mexico, New Hampshire W?{REGEXP("(?i)(new)")} SPACE*? W?{REGEXP("(?i)(hampshire|york|jersey|mexico)") ->MARK(EntityType,1,2,3)}; //Rhode Island W?{REGEXP("(?i)(rhode)")} SPACE*? W?{REGEXP("(?i)(island)") ->MARK(EntityType,1,2,3)};
US states
The following Apache Ruta script extracts the US state names.
PACKAGE uima.ruta.example; DECLARE VarA; DECLARE VarB; DECLARE VarC; W?{REGEXP("(?i)(north|east|south|west|new|alabama|alaska|arizona|arkansas|california|colorado|connecticut|delaware|florida|georgia|hawaii|idaho|illinois|indiana|iowa|kansas|kentucky|louisiana|maine|maryland|massachusetts|michigan|minnesota|mississippi|missouri|montana|nebraska|nevada|ohio|oklahoma|oregon|pennsylvania|rhode|tennessee|texas|utah|vermont|virginia|washington|wisconsin|wyoming)")} SPACE*? W?{REGEXP("(?i)(hampshire|york|jersey|island|mexico|carolina|dakota|virginia)") ->MARK(EntityType,1,2,3)};
US state abbreviations
The following Apache Ruta script extracts the abbreviations of the US states.
PACKAGE uima.ruta.example; DECLARE VarA; DECLARE VarB; DECLARE VarC; W{REGEXP("(AL|AK|AZ|AR|CA|CO|CT|DE|FL|GA|HI|ID|IL|IN|IA|KS|KY|LA|ME|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VT|VA|WA|WV|WI|WY)") ->MARK(EntityType,1)};
Year
The following Apache Ruta script extracts a 4-digit year between 1900 and 2099.
PACKAGE uima.ruta.example; NUM{REGEXP("19..|20..") -> CREATE(EntityType,1,5,"entityType" = "YEAR")};
Vehicle identification number (VIN)
The following Apache Ruta script extracts a vehicle identification number in the following format: one digit, four letters, two digits, four letters, and six digits. For example, 1HGBH41JXMN109186.
PACKAGE uima.ruta.example; DECLARE VarA; DECLARE VarB; DECLARE VarC; NUM {REGEXP(".") } W{REGEXP("....") } NUM {REGEXP("..") } W{REGEXP("....") } NUM{REGEXP("......") ->MARK(EntityType,1,5)}
9-digit account number
The following examples show different ways of creating the entity and recognizing a 9-digit account number. For example, Account:123450000
PACKAGE uima.ruta.example; DECLARE VarA; NUM{REGEXP(".........") -> MARK(VarA), MARK(EntityType,1,1), UNMARK(VarA)}; PACKAGE uima.ruta.example; NUM{REGEXP(".........") -> CREATE(EntityType,1,"entityType" = "AccNum")}; PACKAGE uima.ruta.example; NUM{REGEXP(".........") -> MARK(EntityType,1)};
Phone number
The following Apache Ruta script extracts 6-digit and 10-digit phone numbers.
//US Phone number entity type PACKAGE uima.ruta.example; DECLARE VarA; DECLARE VarB; DECLARE VarC; //With Area code and delimeter //222-555-6789, 222.555.6789, 222 555 6789 NUM{REGEXP("...")} ("-"|"."|SPACE)? NUM{REGEXP("...")} ("-"|"."|SPACE?)? NUM{REGEXP("....") ->CREATE(EntityType,1,5,"entityType" = "PhoneNumber")}; //Area code no delimeter //2225556789 NUM{REGEXP("..........") ->CREATE(EntityType,1,5,"entityType" = "PhoneNumber")}; //No Area code with delimeter //555.6789, 555-6789, 555 6789 NUM{REGEXP("...")} ("-"|"."|SPACE?)? NUM{REGEXP("....") ->CREATE(EntityType,1,5,"entityType" = "PhoneNumber")};