A user-defined function (UDF) is a means for a user to extend the native capabilities of Apache Spark™ SQL. Spark SQL has supported external user-defined functions written in Scala, Java, Python and R programming languages since 1.3.0. While external UDFs are very powerful, they also come with a few caveats:
- Security. A UDF written in an external language can execute dangerous or even malicious code. This requires tight control over who can create UDF.
- Performance. UDFs are black boxes to the Catalyst Optimizer. Given Catalyst is not aware of the inner workings of a UDF, it cannot do any work to improve the performance of the UDF within the context of a SQL query.
- SQL Usability. For a SQL user it can be cumbersome to write UDFs in a host language and then register them in Spark. Also, there is a set of extensions many users may want to make to SQL which are rather simple where developing an external UDF is overkill.
To cope with the above limitations, we are thrilled to introduce a new form of UDF: SQL UDFs. Available in DBR 9.1 LTS, the SQL UDF is completely defined with the expressive power of SQL and also completely transparent to the SQL Compiler.
Benefits of using SQL UDFs
SQL UDFs are simple yet powerful extensions to Spark SQL. As functions, they provide a layer of abstraction to simplify query construction – making SQL queries more readable and modularized. Unlike UDFs that are written in a non-SQL language, SQL UDFs are more lightweight for SQL users to create. SQL function bodies are transparent to the query optimizer thus making them more performant than external UDFs. SQL UDFs can be created as either temporary or permanent functions, be reused across multiple queries, sessions and users, and be access-controlled via Access Control Language (ACL). In this blog, we will walk you through some key use cases of SQL UDFs with examples.
SQL UDFs as constants
Let’s start with the most simplistic function imaginable: a constant. We all know we’re not supposed to use literals in our code because it harms readability and, who knows, maybe the constant doesn’t remain constant after all. So we want to be able to change it in one place only:
CREATE FUNCTION blue() RETURNS STRING COMMENT 'Blue color code' LANGUAGE SQL RETURN '0000FF'
If you are familiar with external UDFs, you can see there are some differences that stand out:
- A SQL UDF must define its parameter list, even if it’s empty. A constant takes no parameters.
- The function also declares the data type it will return. In this case that’s a STRING.
- The implementation of the function is part of the function definition.
- You specify LANGUAGE SQL to say that it’s a SQL UDF. But really, that’s not needed. The RETURN clause is enough of a give away that we decided to make this optional.
Beyond these differences there are many other things that are the same as external UDF:
- You can replace a function. More on that later.
- You can add a comment that describes the function – as shown above.
- You can even create a temporary function that you can use within the current session, only.
Let’s use the function:
SELECT blue(); 0000FF
Unsurprisingly this works. But what is happening under the hood?EXPLAIN SELECT blue(); == Physical Plan == *(1) Project [0000FF AS default.blue()#9] +- *(1) Scan OneRowRelation
This is neat! The SQL compiler replaced the function invocation with the constant itself.
That means at least this SQL UDF comes at zero cost to performance.
Now, let’s have a look at another common usage pattern.
SQL UDF encapsulating expressions
Imagine you don’t like the naming of some built-in functions. Maybe you are migrating lots of queries from another product, which has different function names and behaviors. Or perhaps you just can’t stand copy-pasting some lengthy expressions over and over again in your SQL queries. So, you want to fix that.
With SQL UDF, we can simply create a new function with the name we like:CREATE FUNCTION to_hex(x INT COMMENT 'Any number between 0 - 255') RETURNS STRING COMMENT 'Converts a decimal to a hexadecimal' CONTAINS SQL DETERMINISTIC RETURN lpad(hex(least(greatest(0, x), 255)), 2, 0)
Let’s have a look at what new syntax was used here:
- This function takes an argument, and the parameter is defined by a name, a type and an optional comment.
- The CONTAINS SQL clause is optional, but tells us the function does not read or modify any data in a table. It is the default setting, so you normally wouldn’t specify it.
- DETERMINISTIC is also optional and tells us that the function will always return the same result set given the same arguments. The clause is for documentation only at this point. But at some point in the future it may be used to block non deterministic functions in certain contexts.
- In the RETURN clause the parameter has been referred to by name. In more complex scenarios below you will see that the parameter can get disambiguated with the function name. Naturally you can use arbitrarily complex expressions as the function body.
Not only does it work …
SELECT to_hex(id) FROM range(2); 00 01
… but it works well:EXPLAIN SELECT to_hex(id) FROM range(2); == Physical Plan == *(1) Project [lpad(hex(cast(least(greatest(0, cast(id#0 as int)), 255) as bigint)), 2, 0) AS default.to_hex(id)#1] +- *(1) Range (0, 2, step=1, splits=4)
We can see that the physical plan shows a straight application of the functions lpad, hex, least and greatest. This is the same plan you get invoking the series of functions directly.
You can also compose SQL functions out of SQL functions:CREATE FUNCTION rgb_to_hex(r INT, g INT, b INT) RETURNS STRING COMMENT 'Converts an RGB color to a hex color code' RETURN CONCAT(to_hex(r), to_hex(g), to_hex(b)) SELECT rgb_to_hex(0, 0, 255); 0000FF
SQL UDF reading from tables
Another common usage of SQL UDF is to codify lookups. A simple lookup may be to decode RGB color codes into English color names:CREATE FUNCTION from_rgb(rgb STRING COMMENT 'an RGB hex color code') RETURNS STRING COMMENT 'Translates an RGB color code into a color name' RETURN DECODE(rgb, 'FF00FF', 'magenta', 'FF0080', 'rose'); SELECT from_rgb('FF0080'); rose
OK, but there are a lot more than two colors in this world. And we want this translation both ways, so these should really be in a lookup table:CREATE TABLE colors(rgb STRING NOT NULL, name STRING NOT NULL); INSERT INTO colors VALUES ('FF00FF', 'magenta'), ('FF0080', 'rose'), ('BFFF00', 'lime'), ('7DF9FF', 'electric blue'); CREATE OR REPLACE FUNCTION from_rgb(rgb STRING COMMENT 'an RGB hex color code') RETURNS STRING READS SQL DATA SQL SECURITY DEFINER COMMENT 'Translates an RGB color code into a color name' RETURN SELECT FIRST(name) FROM colors WHERE rgb = from_rgb.rgb; SELECT from_rgb(rgb) FROM VALUES('7DF9FF'), ('BFFF00') AS codes(rgb); electric blue lime
There are multiple new concepts applied here:
- You can REPLACE a SQL UDF. To be allowed to do that, the new function must match the old function’s signature. The signature of a function is defined as the number of its parameters and their types.
- This function looks up information in a table, so you can optionally document that using READS SQL DATA. If you state nothing the SQL Compiler will derive the correct value, but you must not lie and state CONTAINS SQL.
- SQL SECURITY DEFINER is another optional clause, which states that the query accessing the colors table will use the authorization of the function owner. So the function could be executed by the public without compromising the security of the table.
- Just as the function operates under the authorization of its owner it will always be parsed using the current database at time of creation.
- `rgb` is the name of the column in numbers. By qualifying the parameter as `from_rgb`.`rgb` you clarify that you mean the parameter reference, and not the column.
How does the physical plan look like now? It is easy to see that using an external UDF, which itself performs a query that would result in a nested loop join, is an awful way to burn precious resources.
EXPLAIN SELECT from_rgb(rgb) FROM VALUES ('7DF9FF'), ('BFFF00') AS codes(rgb); == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Project [first(name)#1322268 AS default.from_rgb(rgb)#1322259] +- BroadcastHashJoin [rgb#1322261], [rgb#1322266], LeftOuter, BuildRight, false :- LocalTableScan [rgb#1322261] +- BroadcastExchange HashedRelationBroadcastMode(ArrayBuffer(input[1, string, false]),false), [id=#1437557] +- SortAggregate(key=[rgb#1322266], functions=[finalmerge_first(merge first#1322271, valueSet#1322272) AS first(name#1322267)()#1322260]) +- Sort [rgb#1322266 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(rgb#1322266, 200), ENSURE_REQUIREMENTS, [id=#1437553] +- SortAggregate(key=[rgb#1322266], functions=[partial_first(name#1322267, false) AS (first#1322271, valueSet#1322272)]) +- Sort [rgb#1322266 ASC NULLS FIRST], false, 0 +- FileScan parquet default.colors[rgb#1322266,name#1322267]
In this case, Catalyst has chosen a broadcast hash join instead of a nested loop join. It can do this because it understands the content of the SQL UDF.
Thus far, all examples discussed used scalar-valued functions – ones that return a single value. That result may be of any type, even complex combinations of structs, arrays, and maps.There is also another type of UDF to discuss – the table-valued UDF.
SQL Table UDF
Imagine if views took arguments! You could encapsulate complex predicates even if they rely on user-provided values. A SQL Table UDF is just that: a view by any other name, except with parameters.
Let’s assume that the color mapping above is not unique. At the very least, we can assert the color names differ across languages.
Therefore the `from_rgb` function needs to be modified to return either an array of names or a relation.INSERT INTO colors VALUES ('BFFF00', 'citron vert'); CREATE OR REPLACE FUNCTION from_rgb(rgb STRING COMMENT 'an RGB hex color code') RETURNS TABLE(name STRING COMMENT 'color name') READS SQL DATA SQL SECURITY DEFINER COMMENT 'Translates an RGB color code into a color name' RETURN SELECT name FROM colors WHERE rgb = from_rgb.rgb;
As you can see, the only difference compared to a scalar function is a more complex RETURNS clause. Unlike views, SQL UDFs mandate a declaration of the returned relation’s signature:
- TABLE specifies that the function returns a relation.
- The TABLE clause must include a name for each return column and the column’s data type.
- You may optionally specify a comment for any return column.
User-defined table functions are new to DBR. Let’s have a look at how to invoke them.
SELECT * FROM from_rgb('7DF9FF'); electric blue
In its simplest form, a table function is invoked in the same way and the same places a view is referenced. The only difference are the mandatory braces, which include the function’s arguments. This function is invoked with literal arguments, but the arguments can be any expression, even scalar subqueries.
Most powerful, however, is the usage of SQL table UDF in a join, typically a correlated cross join:SELECT rgb, from_rgb.name FROM VALUES('7DF9FF'), ('BFFF00') AS codes(rgb), LATERAL from_rgb(codes.rgb); 7DF9FF electric blue BFFF00 lime BFFF00 citron vert
Here the arguments refer (correlate) to a preceding (lateral) relation in the FROM clause. The new LATERAL keyword gives Catalyst permission to resolve these columns. Also note that you can refer to the result of the table function by naming the columns as defined in the result signature and optionally qualified by the function name.
Naturally, SQL UDFs are fully supported by the existing GRANT, REVOKE, SHOW, DESCRIBE and DROP statements.
The statement worth pointing out in more detail is DESCRIBE.DESCRIBE FUNCTION from_rgb; Function: default.from_rgb Type: TABLE Input: rgb STRING Returns: name STRING
The basic describe returns what you might expect, but the extended DESCRIBE adds significantly more detail:DESCRIBE FUNCTION EXTENDED from_rgb; Function: default.from_rgb Type: TABLE Input: rgb STRING 'an RGB hex color code' Returns: name STRING 'color name' Comment: Translates an RGB color code into a color name Data Access: READS SQL DATA Configs: spark.sql.datetime.java8API.enabled=true spark.sql.hive.version=2.3.7 Owner: serge.rielau Create Time: Wed Sep 08 08:59:53 PDT 2021 Body: SELECT name FROM colors WHERE rgb = from_rgb.rgb
What we have described represents the initial functionality for SQL UDF. Future extensions we are pondering include support for:
- SQL PATH, so you can create a library of functions in a database and subscribe to them from another, just as you would do in your file system.
- Overloading of UDFs.
- UDFs with default values for parameters.
SQL UDFs are a big step forward in SQL usability and can be used in many different ways as outlined in this blog. We encourage you to think of even more creative ways to leverage SQL UDFs be it in Databricks SQL or using Photon for Data Engineering jobs. Try the notebook here and see the documentation for more information.