Regexp_extract function in BigQuery with examples

Regexp_extract function in BigQuery

Regexp_extract function returns a substring from the source string that matches the regular expression pattern. If no matches are found, this function returns NULL.

Capturing group in regexp

A part of regular expression pattern can be enclosed in parentheses (…). This is called a “capturing group”.

  • contains the capturing group – The function returns the substring that is matched by that capturing group.
  • doesn’t contain a capturing group – The function returns the entire matching substring.
  • has more than one capturing group – The function returns an error.

Syntax of regexp_extract function

  • value – It is a source string.
  • regexp – It is a regular expression that we want to match with the source string.
  • position – It is an optional argument.
    • If the position value is specified, the pattern search starts from that position in source string. Otherwise it starts from the beginning of the source string.
    • The position must be a positive integer and cannot be 0. If the position value is greater than the length of the source string, NULL is returned.
  • occurrence – It is an optional argument.
    • If the occurrence is specified, it return a specific occurrence of regexp in source string. Otherwise it returns a first matching string.
    • The occurrence must be a positive integer. If the occurrence is greater than the number of matches found, NULL is returned.

Example for regexp_extract function in BigQuery

Let’s understand the regexp_extract function with examples. Assume that we have a table called as merchant_accounts in BigQuery. It contains the merchant account details of customers who are using payment gateways such as Stripe,Square and so on.

As shown below, the table has the below records. Here rc_fin_test_tables is a dataset name in BigQuery.

regexp_extract function in BigQuery
regexp_extract function in BigQuery

Get user name from email id

The table merchant_accounts has a email id of the merchants. Our goal is to get the user name from email_id using regexp_extract function.

As we know that email id follows a pattern as user name + symbol ‘@’ + domain name. So we can write a regular expression to extract specific part of the email id. Let’s write a regular expression as ^[a-zA-Z0-9_.+-]+ which will return the user name from email id.

Regular expression for email id
Regular expression for email id
  • ^ asserts position at start of a line
  • [a-zA-Z0-9_.+-] matches a single character present in the list
  • + matches the previous token between one and unlimited times, as many times as possible
  • a-z matches a single character in the range between a and z (case sensitive)
  • A-Z matches a single character in the range between A and Z (case sensitive)
  • 0-9 matches a single character in the range between 0 and 9 (case sensitive)
  • _.+- matches a single character in the list
Extract user name from email id using regexp_extract
Extract user name from email id using regexp_extract

As we shown above, the function regexp_extract returned the user name from email id that matches the given regular expression pattern.

Example for capturing group : Get top level domain name from email id

In email id, the domain name is usually the name of the company or organization. Examples : deliveroo.co.uk and getjobber.com. It comes after the ‘@’ sign.

The last part of the domain name is called as top level domain name. Examples : co.uk and com. Let’s fetch the domain name using regexp_extract function. Then we will extract the top level domain name from that result.

In the regular expression, we have used the capturing group(…) to fetch domain name. The pattern is enclosed in parentheses after ‘@’ sign. So it returns only the domain name from the email id.

  • ([a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)– capturing group in regexp
  • \. match the dot(.) character
  • $ asserts position at the end of a line
Extract domain name using regexp_extract function in BigQuery
Extract domain name using regexp_extract function in BigQuery

As shown above, the function regexp_extract returned the domain name from the email id. To get the top level domain name, we need to use the capturing group(…) after dot character.

Extract top level domain using regexp_extract function in BigQuery
Extract top level domain using regexp_extract function in BigQuery

Position and Occurrence in regexp_extract function

In this example, we will write a regular expression with position and occurrence values. The table merchant_accounts has a column tax_id which consist of region code, region specific tax code and tax identification number.

Example: us_ein_849291.

  • us – Region code as United States
  • ein – Region specific tax code as Employer Identification Number.
  • 849291 – tax identification number

Our goal is to get the region and region tax code from tax_id column in merchant_accounts table. Let’s write the query using regexp_extract function in BigQuery

Since the region and region tax code contains only the alphabets in lower case, we can write a regular expression as [a-z]+. Then using the position and occurrence in regexp_extract function, we can get either region code or region tax code.

Position : 1 & Occurrence : 1

Position and Occurrence example in regexp_extract function
Position and Occurrence example in regexp_extract function

As shown above, we have used the position as 1 and occurrence as 1 in regular expression. So the function regexp_extract returned only the first occurrence of alphabets. Let’s change the occurrence to 2 which will return the region_tax_code.

Position : 1 & Occurrence : 2

regexp_extract function with position and occurrence value in BigQuery
regexp_extract function with position and occurrence value in BigQuery

As shown above, the regexp_extract function returned only the region tax code from tax_id column. We can get the same results by changing the position to 4 in regexp_extract function.

Position : 4 & Occurrence : 1

Since the position is 4 in regexp_extract function, the pattern search starts from that position in tax_id column. As a result, we got the region specific tax code as output.

change position to 4 to get region_tax_code using regexp_extract function
change position to 4 to get region_tax_code using regexp_extract function

Position : 4 & Occurrence : 2

As we mentioned earlier, if the occurrence is greater than number of matches found, the function returns NULL value.

Returns NULL from regexp_extract function in BigQuery
Returns NULL from regexp_extract function in BigQuery

Since there is no second match from position 4, the function regexp_extract returned the NULL value for tax_region_code.

Recommended Articles