6. Hive Extensibility Features | Apache Hive Cookbook

Sign In Start Free Trial

Book Overview & Buying
Table Of Contents
Feedback & Rating

Apache Hive Cookbook

By : Hanish Bansal, Saurabh Chauhan, Shrey Mehrotra

3 (4)

Apache Hive Cookbook

3 (4)

By: Hanish Bansal, Saurabh Chauhan, Shrey Mehrotra

Overview of this book

Hive was developed by Facebook and later open sourced in Apache community. Hive provides SQL like interface to run queries on Big Data frameworks. Hive provides SQL like syntax also called as HiveQL that includes all SQL capabilities like analytical functions which are the need of the hour in today’s Big Data world. This book provides you easy installation steps with different types of metastores supported by Hive. This book has simple and easy to learn recipes for configuring Hive clients and services. You would also learn different Hive optimizations including Partitions and Bucketing. The book also covers the source code explanation of latest Hive version. Hive Query Language is being used by other frameworks including spark. Towards the end you will cover integration of Hive with these frameworks.

Preface

Preface

What this book covers

What you need for this book

Who this book is for

Sections

Conventions

Reader feedback

Customer support

Free Chapter

1. Developing Hive

1. Developing Hive

Introduction

Deploying Hive on a Hadoop cluster

Deploying Hive Metastore

Installing Hive

Configuring HCatalog

Understanding different components of Hive

Compiling Hive from source

Hive packages

Debugging Hive

Running Hive

Changing configurations at runtime

2. Services in Hive

2. Services in Hive

Introducing HiveServer2

Understanding HiveServer2 properties

Configuring HiveServer2 high availability

Using HiveServer2 clients

Introducing the Hive metastore service

Configuring high availability of metastore service

Introducing Hue

3. Understanding the Hive Data Model

3. Understanding the Hive Data Model

Introduction

Using numeric data types

Using string data types

Using Date/Time data types

Using miscellaneous data types

Using complex data types

Using operators

Partitioning

Partitioning a managed table

Partitioning an external table

Bucketing

4. Hive Data Definition Language

4. Hive Data Definition Language

Introduction

Creating a database schema

Dropping a database schema

Altering a database schema

Using a database schema

Showing database schemas

Describing a database schema

Creating tables

Dropping tables

Truncating tables

Renaming tables

Altering table properties

Creating views

Dropping views

Altering the view properties

Altering the view as select

Showing tables

Showing partitions

Show the table properties

Showing create table

HCatalog

WebHCat

5. Hive Data Manipulation Language

5. Hive Data Manipulation Language

Introduction

Loading files into tables

Inserting data into Hive tables from queries

Inserting data into dynamic partitions

Writing data into files from queries

Enabling transactions in Hive

Inserting values into tables from SQL

Updating data

Deleting data

6. Hive Extensibility Features

6. Hive Extensibility Features

Introduction

Serialization and deserialization formats and data types

Exploring views

Exploring indexes

Hive partitioning

Creating buckets in Hive

Analytics functions in Hive

Windowing in Hive

File formats

7. Joins and Join Optimization

7. Joins and Join Optimization

Understanding the joins concept

Using a left/right/full outer join

Using a left semi join

Using a cross join

Using a map-side join

Using a bucket map join

Using a bucket sort merge map join

Using a skew join

8. Statistics in Hive

8. Statistics in Hive

Bringing statistics in to Hive

Table and partition statistics in Hive

Column statistics in Hive

Top K statistics in Hive

9. Functions in Hive

9. Functions in Hive

Using built-in functions

Using the built-in User-defined Aggregation Function (UDAF)

Using the built-in User Defined Table Function (UDTF)

Creating custom User-Defined Functions (UDF)

10. Hive Tuning

10. Hive Tuning

Enabling predicate pushdown optimizations in Hive

Optimizations to reduce the number of map

Sampling

11. Hive Security

11. Hive Security

Securing Hadoop

Authorizing Hive

Configuring the SQL standards-based authorization

Authenticating Hive

12. Hive Integration with Other Frameworks

12. Hive Integration with Other Frameworks

Working with Apache Spark

Working with Accumulo

Working with HBase

Working with Google Drill

Index

Index

Customer Reviews

3 (4)

5 star

0

4 star

50%

3 star

25%

2 star

0

1 star

25%

Analytics functions in Hive

Hive provides the following set of analytical functions:

RANK
DENSE_RANK
ROW_NUMBER
PERCENT_RANK
CUME_DIST
NTILE

Common and useful sets of analytical functions are ranking functions where rows from resultset are ranked according to a scheme.

How to do it…

Let's analyze each function in detail. We will be using the same sales dataset and applying analytical functions to it:

ROW_NUMBER: This function will provide a unique number to each row in resultset based on the ORDER BY clause within the PARTITION. For example, if we want to assign row_number to each fname, which is also partitioned by IP address in the sales dataset, the query would be:
```
hive> select fname,ip,ROW_NUMBER() OVER (ORDER BY ip ) as rownum from sales;
```
RANK: It is similar to ROW_NUMBER, but the equal rows are ranked with the same number. For example, if we use RANK in the previous query instead of ROW_NUM:
```
hive> select fname,ip,RANK() OVER (ORDER BY ip) as ranknum, RANK() OVER (PARTITION BY ip order...
```

Search

Your notes and bookmarks