How to get the HDFS file size using WebHDFS?
Contents
WebHDFS
WebHDFS is a protocol which is based on an industry-standard RESTful mechanism. It provides the same functionality as HDFS, but over a REST interface. It uses Kerberos (SPNEGO) and Hadoop delegation tokens for authentication. Few usages of WebHDFS as below
- User can connects HDFS from the outside cluster.
- User can interact with the data stored in HDFS from the outside cluster.
- It is providing web services access to all Hadoop components.
- It permits clients to access Hadoop from multiple languages without installing Hadoop. We can use common tools like curl/wget to access HDFS.
- The file read and file write calls are redirected to the corresponding datanodes.
Curl command to get HDFS file size using WebHDFS
Since the WebHDFS is a kind of RESTAPI , we can use CURL command to access it. Lets assume that we have a Hive table in HDFS. As we know that the Hive table data is stored in the HDFS file, we are using WebHDFS to get the size of that file.
In this example, We are going to submit the HTTP GET request to get the content summary of the HDFS file/directory.
Syntax of CURL command
1 |
curl -L --negotiate -u:<user_name> "<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=GETCONTENTSUMMARY" |
Here the -L option instructs curl to follow any redirect until it reaches the final destination. HDFS uses Kerberos SPNEGO for authentication. To activate (SPNEGO) authentication using the given user name, we have used –negotiate option in curl.
After that, we need to mention the -u: followed by user name. Then we need to mention the host address with the port numebr where the name node is running. Finally we have to give /webhdfs/v1/ followed by the HDFS file path with ?op=GETCONTENTSUMMARY.
Example of webhdfs using curl command:
Hive table used in this example is prime_customers which is resides in the banking database. First we are going to get the location of that table using Desc formatted command in Hive as below.
1 2 3 4 5 6 7 8 9 10 11 12 |
hive> desc formatted banking.prime_customers; # Detailed Table Information Database: banking Owner: test_user CreateTime: Wed May 05 21:33:06 PDT 2021 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://apps/hive/warehouse/banking.db/prime_customers Table Type: MANAGED_TABLE |
Also we know the host name and port number to call the API of WebHDFS as below
1 |
http://pnbhdc0011.pnbltd.com:50020 |
Lets write the CURL command with all these details to get the HDFS file size.
1 |
curl -L --negotiate -u:test_user "http://pnbhdc0011.pnbltd.com:50020/webhdfs/v1/apps/hive/warehouse/banking.db/prime_customers?op=GETCONTENTSUMMARY" |
After running this curl command in the server, we got the HDFS file details in JSON format as below.
1 2 3 4 5 6 7 8 9 10 |
{ "ContentSummary": { "directoryCount": 1, "fileCount": 5000, "length": 5410176374, "quota": -1, "spaceConsumed": 16230529122, "spaceQuota": -1 } } |
Here the length is the number of bytes used by the content. So this length is used as the size of the HDFS file. As it is in byte value, we can apply some formula to convert this into KB,GB,TB values. Instead of CURL command, we can write a Java program to perform all these operations outside the HDFS cluster. In addition, webhdfs used to perform various operations on HDFS File. Please check here for more details.
Recommended Articles