Parsing Hive Create table query using Apache Hive library
Contents
- 1 Apache Hive
- 1.1 Abstract Syntax Tree (AST)
- 1.2 Java program to parse the Hive Create table query
- 1.2.1 Step 1: Set the Hive dependency in pom.xml
- 1.2.2 Step 2 : Assign the input query to the variable
- 1.2.3 Step 3 : Create a ParseDriver Object
- 1.2.4 Step 4 : Parse the input Sql
- 1.2.5 Step 5 : Print the Parse tree
- 1.2.6 Step 6 : Traverse the AST node
- 1.2.7 Step 7 : Loop through each Child node
- 1.2.8 Step 8: Create a HashMap variable for data type mapping
- 1.2.9 Step 9: Complete program code and execution
Apache Hive
Apache Hive is an open source data warehousing solution build on top of Hadoop. It supports SQL-like query language called as HiveQL. The query execution happens using Apache Hadoop MapReduce or Apache Tez frameworks. The Hive project details are available in GitHub. So we can use its library to parse the HiveQL. In this tutorial, we will write a Java program to parse the Hive Create table query.
Abstract Syntax Tree (AST)
Before executing the Query,Hive parser generate a syntactic structure of the sql. It is called as parse tree. The output of the parser is usually an Abstract Syntax Tree (AST). Hive uses Antlr to generate the abstract syntax tree (AST) for the query.
For the simple create table query, Hive parser generates the AST tree like below. Basically it split the sql into tokens such as TOK_CREATETABLE, TOK_TABNAME and so on. Then it build the AST tree based on those tokens.
1 2 3 |
create table <table_name> (<column_1> <data_type>, <column_2> <data_type>) |
Java program to parse the Hive Create table query
Consider that, we have a Hive create table query. From that query, we want to get the table name, column name and data type of the columns. It is possible with the help of Apache Hive library. Let’s start writing the java program to parse the Create table statement.
Step 1: Set the Hive dependency in pom.xml
To add the Hive libraries in the program, we are creating the Maven project in IntelliJ IDEA. Then we need to add the Hive dependency in the pom.xml file. Here we are using Hive version 2.3.7. Based on the requirement, we can change the Hive version in this file. The maven tool download/include the relevant jar files in our project.
1 2 3 4 5 |
<dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>2.3.7</version> </dependency> |
Step 2 : Assign the input query to the variable
In the main function, we are assigning the Hive create table query to the variable. We will parse this query in the following steps.
1 |
String input ="CREATE TABLE Test_Cust (Cust_id int,Name string)"; |
Step 3 : Create a ParseDriver Object
The ParseDriver class has the parse method to process the Hive QL. Let’s create a ParseDriver object as below
1 |
ParseDriver pd = new ParseDriver(); |
Step 4 : Parse the input Sql
Using the ParserDriver object, we can call the parse method. Let’s call this method with the input of Create table Sql, It will return the Abstract Syntax Tree (AST). In Hive library, it is mapped to a class ASTNode.
1 |
ASTNode tree = pd.parse(input); |
Step 5 : Print the Parse tree
In this step, we are just printing the input and ASTNode. The node class has a function dump which is returning the below parse tree.
1 2 3 |
System.out.println("Input Hive Sql : " + input); //Printing AST tree System.out.println(tree.dump()); |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Input Hive Sql : CREATE TABLE Test_Cust (Cust_id int,Name string) nil TOK_CREATETABLE TOK_TABNAME Test_Cust TOK_LIKETABLE TOK_TABCOLLIST TOK_TABCOL Cust_id TOK_INT TOK_TABCOL Name TOK_STRING <EOF> |
Step 6 : Traverse the AST node
As we shown in the earlier tree diagram, set of ASTNodes linked together and created a AST tree. We need to traverse the tree to get the required details.
The parent node of this tree is nil. It has two children such as TOK_CREATETABLE & <EOF>. To get the table name, column name and data type, we need to parse the child of node TOK_CREATETABLE. For that we created a separate function as traverseASTNode.
1 2 3 4 5 6 7 8 |
if(tree.getChildren() != null){ for(Node eachNode : tree.getChildren()){ ASTNode astNode = (ASTNode) eachNode; //Calling the traverseASTNode function to loop through the child nodes if(astNode.getChildren() != null) traverseASTNode(astNode.getChildren()); } } |
Step 7 : Loop through each Child node
We passed the child nodes of TOK_CREATETABLE to function traverseASTNode. To iterate each node, we write a for loop. Then we are checking the node type of the parent node. Based on the node type such as TOK_TABNAME & TOK_TABCOL, we are getting the table name and column name from the current node.
The HiveParser class maintaining each node type values. For this example, we used only HiveParser.TOK_TABNAME & HiveParser.TOK_TABCOL in the switch case statement.
Also to get the data type of the column, we used only HiveParser.TOK_INT & HiveParser.TOK_STRING. These values are mapped for int and string data types. For other data types, we need to include respective value here.
Please note that if any of the node has a child node, we are calling the function traverseASTNode again at the end. It is a recursive function in the program.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
public static void traverseASTNode(List<Node> astNodeList){ for(Node eachNode : astNodeList){ ASTNode astNode = (ASTNode) eachNode; //Get the node type of parent AST node int parentNodeType = astNode.getParent().getType(); String tableName,columnText; int NodeType; //Validate the node type to get the SQL details switch (parentNodeType){ case HiveParser.TOK_TABNAME: tableName = astNode.getToken().getText(); System.out.println("Table Name : " + tableName); break; case HiveParser.TOK_TABCOL: //Get the current AST node type NodeType = astNode.getType(); //Get the Token text of the AST node for the column Name columnText = astNode.getText(); // Checking the parentNodeType which corresponds to DataType if(NodeType == HiveParser.TOK_INT || NodeType == HiveParser.TOK_STRING) System.out.println("Data Type : " + dataTypeMap.get(NodeType)); else{ System.out.println("Column Name : " + columnText); } break; } //Recursive function to traverse the child node of current AST node if(astNode.getChildren() != null) traverseASTNode(astNode.getChildren()); } } |
Step 8: Create a HashMap variable for data type mapping
Also we created a Hashmap variable dataTypeMap which holds the key value pair of Hive parser value and corresponding data type. To get the data type of the column, we used this variable in the previous function traverseASTNode => dataTypeMap.get(NodeType)
1 2 3 4 5 6 7 8 |
private static Map<Integer, String> dataTypeMap; static { dataTypeMap = new HashMap<>(); dataTypeMap.put(HiveParser.TOK_INT, "int"); dataTypeMap.put(HiveParser.TOK_STRING, "string"); dataTypeMap.put(HiveParser.TOK_DATE, "date"); } |
Step 9: Complete program code and execution
We combined all the above steps and presenting the complete program to parse the Hive Create table query.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
import org.apache.hadoop.hive.ql.lib.Node; import org.apache.hadoop.hive.ql.parse.ASTNode; import org.apache.hadoop.hive.ql.parse.HiveParser; import org.apache.hadoop.hive.ql.parse.ParseDriver; import org.apache.hadoop.hive.ql.parse.ParseException; import java.util.HashMap; import java.util.List; import java.util.Map; public class TestParseHiveSql { // Declaring the static map for Hive Data Type private static Map<Integer, String> dataTypeMap; static { dataTypeMap = new HashMap<>(); dataTypeMap.put(HiveParser.TOK_INT, "int"); dataTypeMap.put(HiveParser.TOK_STRING, "string"); dataTypeMap.put(HiveParser.TOK_DATE, "date"); } public static void main(String[] args) { // Input Create Table Statement String input ="CREATE TABLE Test_Cust (Cust_id int,Name string)"; //Create a ParseDriver object ParseDriver pd = new ParseDriver(); try { //Get Abstract Syntax Tree(AST) by parsing the input Hive Sql ASTNode tree = pd.parse(input); System.out.println("Input Hive Sql : " + input); //Printing AST tree System.out.println(tree.dump()); // Start traverse the AST tree if it has children AST node if(tree.getChildren() != null){ for(Node eachNode : tree.getChildren()){ ASTNode astNode = (ASTNode) eachNode; //Calling the traverseEachNode function to loop through the child nodes if(astNode.getChildren() != null) traverseASTNode(astNode.getChildren()); } } } catch (ParseException e) { throw new RuntimeException(e); } } public static void traverseASTNode(List<Node> astNodeList){ for(Node eachNode : astNodeList){ ASTNode astNode = (ASTNode) eachNode; //Get the node type of parent AST node int parentNodeType = astNode.getParent().getType(); String tableName,columnText; int NodeType; //Validate the node type to get the SQL details switch (parentNodeType){ case HiveParser.TOK_TABNAME: tableName = astNode.getToken().getText(); System.out.println("Table Name : " + tableName); break; case HiveParser.TOK_TABCOL: //Get the current AST node type NodeType = astNode.getType(); //Get the Token text of the AST node for the column Name columnText = astNode.getText(); // Checking the parentNodeType which corresponds to DataType if(NodeType == HiveParser.TOK_INT || NodeType == HiveParser.TOK_STRING) System.out.println("Data Type : " + dataTypeMap.get(NodeType)); else{ System.out.println("Column Name : " + columnText); } break; } //Recursive function to traverse the child node of current AST node if(astNode.getChildren() != null) traverseASTNode(astNode.getChildren()); } } } |
Output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
Input Hive Sql : CREATE TABLE Test_Cust (Cust_id int,Name string) nil TOK_CREATETABLE TOK_TABNAME Test_Cust TOK_LIKETABLE TOK_TABCOLLIST TOK_TABCOL Cust_id TOK_INT TOK_TABCOL Name TOK_STRING <EOF> Table Name : Test_Cust Column Name : Cust_id Data Type : int Column Name : Name Data Type : string |
After executing the program, we can see that the table name, column name and data type are fetched from the Hive CREATE TABLE query. Those details are printed in the console.
Recommended Articles