Regexp_extract function in BigQuery with examples
Contents
Regexp_extract function in BigQuery
Regexp_extract function returns a substring from the source string that matches the regular expression pattern. If no matches are found, this function returns NULL.
Capturing group in regexp
A part of regular expression pattern can be enclosed in parentheses (…). This is called a “capturing group”.
- contains the capturing group – The function returns the substring that is matched by that capturing group.
- doesn’t contain a capturing group – The function returns the entire matching substring.
- has more than one capturing group – The function returns an error.
Syntax of regexp_extract function
1 |
REGEXP_EXTRACT(value, regexp[, position[, occurrence]]) |
- value – It is a source string.
- regexp – It is a regular expression that we want to match with the source string.
- position – It is an optional argument.
- If the position value is specified, the pattern search starts from that position in source string. Otherwise it starts from the beginning of the source string.
- The position must be a positive integer and cannot be 0. If the position value is greater than the length of the source string, NULL is returned.
- occurrence – It is an optional argument.
- If the occurrence is specified, it return a specific occurrence of regexp in source string. Otherwise it returns a first matching string.
- The occurrence must be a positive integer. If the occurrence is greater than the number of matches found, NULL is returned.
Example for regexp_extract function in BigQuery
Let’s understand the regexp_extract function with examples. Assume that we have a table called as merchant_accounts in BigQuery. It contains the merchant account details of customers who are using payment gateways such as Stripe,Square and so on.
As shown below, the table has the below records. Here rc_fin_test_tables is a dataset name in BigQuery.
Get user name from email id
The table merchant_accounts has a email id of the merchants. Our goal is to get the user name from email_id using regexp_extract function.
As we know that email id follows a pattern as user name + symbol ‘@’ + domain name. So we can write a regular expression to extract specific part of the email id. Let’s write a regular expression as ^[a-zA-Z0-9_.+-]+ which will return the user name from email id.
- ^ asserts position at start of a line
- [a-zA-Z0-9_.+-] matches a single character present in the list
- + matches the previous token between one and unlimited times, as many times as possible
- a-z matches a single character in the range between a and z (case sensitive)
- A-Z matches a single character in the range between A and Z (case sensitive)
- 0-9 matches a single character in the range between 0 and 9 (case sensitive)
- _.+- matches a single character in the list
1 2 3 4 5 6 7 |
SELECT merchant_id, merchant_name, email_id, REGEXP_EXTRACT(email_id,r'^[a-zA-Z0-9_.+-]+') AS user_name FROM rc_fin_test_tables.merchant_accounts; |
As we shown above, the function regexp_extract returned the user name from email id that matches the given regular expression pattern.
Example for capturing group : Get top level domain name from email id
In email id, the domain name is usually the name of the company or organization. Examples : deliveroo.co.uk and getjobber.com. It comes after the ‘@’ sign.
The last part of the domain name is called as top level domain name. Examples : co.uk and com. Let’s fetch the domain name using regexp_extract function. Then we will extract the top level domain name from that result.
In the regular expression, we have used the capturing group(…) to fetch domain name. The pattern is enclosed in parentheses after ‘@’ sign. So it returns only the domain name from the email id.
- ([a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)– capturing group in regexp
- \. match the dot(.) character
- $ asserts position at the end of a line
1 2 3 4 5 6 7 |
SELECT merchant_id, merchant_name, email_id, REGEXP_EXTRACT(email_id, r'^[a-zA-Z0-9_.+-]+@([a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)') AS domain_name FROM rc_fin_test_tables.merchant_accounts; |
As shown above, the function regexp_extract returned the domain name from the email id. To get the top level domain name, we need to use the capturing group(…) after dot character.
1 2 3 4 5 6 7 |
SELECT merchant_id, merchant_name, email_id, REGEXP_EXTRACT(email_id, r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.([a-zA-Z0-9-.]+$)') AS top_level_domain FROM rc_fin_test_tables.merchant_accounts; |
Position and Occurrence in regexp_extract function
In this example, we will write a regular expression with position and occurrence values. The table merchant_accounts has a column tax_id which consist of region code, region specific tax code and tax identification number.
Example: us_ein_849291.
- us – Region code as United States
- ein – Region specific tax code as Employer Identification Number.
- 849291 – tax identification number
Our goal is to get the region and region tax code from tax_id column in merchant_accounts table. Let’s write the query using regexp_extract function in BigQuery
Since the region and region tax code contains only the alphabets in lower case, we can write a regular expression as [a-z]+. Then using the position and occurrence in regexp_extract function, we can get either region code or region tax code.
Position : 1 & Occurrence : 1
1 2 3 4 5 6 7 |
SELECT merchant_id, merchant_name, tax_id, REGEXP_EXTRACT(tax_id, r'[a-z]+',1,1) AS region_code FROM rc_fin_test_tables.merchant_accounts; |
As shown above, we have used the position as 1 and occurrence as 1 in regular expression. So the function regexp_extract returned only the first occurrence of alphabets. Let’s change the occurrence to 2 which will return the region_tax_code.
Position : 1 & Occurrence : 2
1 2 3 4 5 6 7 |
SELECT merchant_id, merchant_name, tax_id, REGEXP_EXTRACT(tax_id, r'[a-z]+',1,2) AS region_tax_code FROM rc_fin_test_tables.merchant_accounts; |
As shown above, the regexp_extract function returned only the region tax code from tax_id column. We can get the same results by changing the position to 4 in regexp_extract function.
Position : 4 & Occurrence : 1
1 2 3 4 5 6 7 |
SELECT merchant_id, merchant_name, tax_id, REGEXP_EXTRACT(tax_id, r'[a-z]+',4,1) AS region_tax_code FROM rc_fin_test_tables.merchant_accounts; |
Since the position is 4 in regexp_extract function, the pattern search starts from that position in tax_id column. As a result, we got the region specific tax code as output.
Position : 4 & Occurrence : 2
As we mentioned earlier, if the occurrence is greater than number of matches found, the function returns NULL value.
1 2 3 4 5 6 7 |
SELECT merchant_id, merchant_name, tax_id, REGEXP_EXTRACT(tax_id, r'[a-z]+',4,2) AS region_tax_code FROM rc_fin_test_tables.merchant_accounts; |
Since there is no second match from position 4, the function regexp_extract returned the NULL value for tax_region_code.
Recommended Articles