Sunday, January 5, 2014

HDInsight Service (Hadoop on Windows Azure) – Hive Partitions

The post will walk you through the process of creating, loading and querying partitioned Hive Table via HDInsight

Follow the steps here to create HDInsight Cluster and to install  Windows Azure PowerShell.

Upload data to HDInsight Cluster

Download the file w3c_2MB.txt to your local drive, say to ‘C:\w3c_2MB.txt’

Since HDInsight uses the Azure Blob Storage as its distributed file storage (unlike non-could Hadoop clusters that use the default HDFS implementation that’s based on local file-system), you can choose your preferred tool to upload data. We will use Windows Azure PowerShell to upload the sample data.

Open MIcrosoft Azure PowerShell

Run the command below, you will be prompt to login to your Azure account:

Add-AzureAccount

Paste the following script into the Windows Azure PowerShell console window to run it.

$subscriptionName ="<your-subscription>"
$storageAccountName = "<your-storage-name>"
$containerName = "<your-cluster-name>"

$fileName ="C:\w3c_2MB.txt"
$blobName = "user/hdp/myexample/w3c_2MB.txt"

# Get the storage account key
Select-AzureSubscription $subscriptionName
$storageaccountkey = get-azurestoragekey $storageAccountName | %{$_.Primary}

# Create the storage context object
$destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageaccountkey

# Copy the file from local workstation to the Blob container        
Set-AzureStorageBlobContent -File $fileName -Container $containerName -Blob $blobName -context $destContext
image 

Initialize Hive session in Windows Azure PowerShell

Since we are already using Windows Azure PowerShell, we will continue to use it to submit Hive jobs. Quite frankly, PowerShell is not the best tool for ad-hoc interaction with Hive. To debug Hive jobs, (as of now) it’s better to remote connect to the cluster and use the built in Hive CLI. Even for automation, it’s simpler to submit jobs via the .NET SDK. The Hive support in Windows Azure PowerShell might become handy if you already have automated PS script and you want to keep everything in that script.

Before you start running Hive queries via Windows Azure PowerShell (using the Invoke-Hive cmdlet), you need to select your cluster. Paste the following script into the Windows Azure PowerShell console window to run it.

$subscriptionName ="<your-subscription>" 
$clusterName = "<your-cluster-name>"

# Select the cluster
Select-AzureSubscription $subscriptionName
Use-AzureHDInsightCluster $clusterName 

Create Non-Partitioned Hive Table

The script bellow will create a new Hive table name ‘w3c’ and load it with the date that we uploaded to the blob in the previous section

You should know that operations like create/delete execute faster when using New-AzureHDInsightHiveJobDefinition and Start-AzureHDInsightJob instead of Invoke-Hive (more details here). I choose to use Invoke-Hive for all the hive queries in this post to keep things simple.

# HiveQL query
$queryString = "DROP TABLE w3c;" + 
               "CREATE TABLE w3c(
        logdate string,  logtime string,  c_ip string,  
        cs_username string,  s_ip string, 
        s_port string,  cs_method string, cs_uri_stem string,  
        cs_uri_query string,  sc_status int,  
        sc_bytes int,  cs_bytes int,  time_taken int,  
        cs_agent string,  cs_Referrer string)
        row format delimited
        fields terminated by ' ';"  + 
        "LOAD DATA INPATH 'myexample/w3c_2MB.txt' OVERWRITE INTO TABLE w3c";


# Submit the job to the cluster 
Invoke-Hive -Query $queryString

image

Since we created internal table. the file will be deleted from its original location and moved to hive warehouse.

Query the Non-Partitioned Hive Table

The script bellow will count the number of rows where the column cs_uri_stem equals ‘/Default.aspx’

$queryString = "SELECT COUNT(*) FROM w3c WHERE cs_uri_stem='/Default.aspx';"

# Submit the job to the cluster 
Measure-Command { Invoke-Hive -Query $queryString }
image 

Create Partitioned Hive Table

The script bellow will create a new, partitioned Hive table name ‘w3c_partitioned’. Notice that comparing to the table ‘w3c’, cs_uri_stem is not a column, It’s a partition

$queryString = "DROP TABLE w3c_partitioned;" + 
        "CREATE TABLE w3c_partitioned(
        logdate string, logtime string, c_ip string, 
        cs_username string,  s_ip string,  
        s_port string,  cs_method string,  cs_uri_query string,
        sc_status int,  sc_bytes int, cs_bytes int, time_taken int, 
        cs_agent string,  cs_Referrer string)
        PARTITIONED BY (cs_uri_stem string)
        row format delimited
        fields terminated by ' ';"

# Submit the job to the cluster 
Invoke-Hive -Query $queryString 

Insert data into the Partitioned Table

The following script will populate the partitions with data selected from the table w3c.

Notice that:

  • We’re inserting the rows where the cs_uri_stem cell equals ‘/Default.aspx‘ into partition ‘cs_uri_stem='/Default.aspx‘.
  • We’re inserting the rows where the cs_uri_stem cell equals ‘/Info.aspx‘ into partition ‘cs_uri_stem='/Info.aspx‘.
  • We’re inserting the rows where the cs_uri_stem cell equals ‘/UserService‘ into partition ‘cs_uri_stem='/UserService‘
$queryString = "FROM w3c 
        INSERT OVERWRITE TABLE w3c_partitioned PARTITION (cs_uri_stem='/Default.aspx')
        SELECT w3c.logdate, w3c.logtime, w3c.c_ip, 
        w3c.cs_username, w3c.s_ip, 
        w3c.s_port, w3c.cs_method, w3c.cs_uri_query, 
        w3c.sc_status, w3c.sc_bytes, w3c.cs_bytes, w3c.time_taken, 
        w3c.cs_agent, w3c.cs_Referrer 
        WHERE cs_uri_stem='/Default.aspx';"


# Submit the job to the cluster 
Invoke-Hive -Query $queryString 

$queryString = "FROM w3c 
        INSERT OVERWRITE TABLE w3c_partitioned PARTITION (cs_uri_stem='/Info.aspx')
        SELECT w3c.logdate, w3c.logtime, w3c.c_ip, 
        w3c.cs_username, w3c.s_ip, 
        w3c.s_port, w3c.cs_method, w3c.cs_uri_query, 
        w3c.sc_status, w3c.sc_bytes, w3c.cs_bytes, w3c.time_taken, 
        w3c.cs_agent, w3c.cs_Referrer 
        WHERE cs_uri_stem='/Info.aspx';"


# Submit the job to the cluster 
Invoke-Hive -Query $queryString 

$queryString = "FROM w3c 
        INSERT OVERWRITE TABLE w3c_partitioned PARTITION (cs_uri_stem='/UserService')
        SELECT w3c.logdate, w3c.logtime, w3c.c_ip, 
        w3c.cs_username, w3c.s_ip, 
        w3c.s_port, w3c.cs_method, w3c.cs_uri_query, 
        w3c.sc_status, w3c.sc_bytes, w3c.cs_bytes, w3c.time_taken, 
        w3c.cs_agent, w3c.cs_Referrer 
        WHERE cs_uri_stem='/UserService';"


# Submit the job to the cluster 
Invoke-Hive -Query $queryString 

 

image

Looking into the partitioned table in hive warehouse, you should notice that a dedicated folder has been added for each partition

image

Query the Partitioned Hive Table

The script bellow will query the table w3c_partitioned for the number of rows where the column cs_uri_stem='/Default.aspx' .

$queryString = "SELECT COUNT(*) FROM w3c_partitioned WHERE cs_uri_stem='/Default.aspx';"

# Submit the job to the cluster 
Invoke-Hive -Query $queryString 

image

Since the table is partitioned by the column cs_uri_stem, instead of scanning the entire data-set, Hive is scanning only the partition '/Default.aspx‘.

Delete the tables  

$queryString = "Drop table w3c;Drop table w3c_partitioned;" 

# Submit the job to the cluster 
Invoke-Hive -Query $queryString

Friday, January 3, 2014

HDinsight Emulator – Using Sqoop to import/export data from/to SQL Server Express

This post will walk you through the process of importing/exporting data from/to SQL Server Express via Sqoop .

Create dev box in the cloud (optional)

If you don't want to install the HDInsight Emulator on your development machine, you can create a VM in the cloud via Azure IaaS Virtual Machine Service (given that you have Azure Subscription).

Open the Azure Management Portal, click on the Virtual Machine tab, click ‘New’ and select your preferred version on Visual Studio.

image

If you don’t need Visual Studio, you can install VM with Windows Server.

image

Once your VM is ready, it will appear as ‘Running’ in the Virtual Machine Tab. You can Remote Connect to the VM using the link in the portal.

image

For more details on Azure VM, in this post, Scott Hanselman shows how he’s using the Visual Studio image to get work done from his Surface 2 Tablet.

Install HDinsight Emulator

HDInsight Emulator is a single node HDInsight deployment that allows developers to develop on and debug jobs on their local development box. You can install the HDInsight Emulator via Web Platform Installer from here. All the missing prerequisites will be detected and installed automatically!

image

Once the installation is complete, you will notice that all the supported Hadoop services will be running as Windows Services on your local machine.

HDI.Emulator.Services.

Install and configure SQL Server Express

If you have full SQL Server instance installed on your machine, you can skip to the next step. If you want to use the free SQL Express instance, some extra configurations are needed. Here’s how you install and configure SQL Express to make it work with Sqoop.

  • Open ‘Web Platform Installer’ (install from here)
  • Type ‘sql express ‘ in the search box
  • Click ‘Add’ next to ‘SQL Server Express 2008’ and ‘SQL Server Express 2008 Management Studio’

image

  • Set the password for the admin account (you will need to use it in the next step)

image

  • Open ‘SQL Server Configuration Manager’
image
  • Enable the TCP/IP protocol for SQLExpress.
    • Expend ‘SQL Server Network Configuration’
    • Right click on ‘Protocols for SQLExpress’
    • Right click on ‘TCP/IP’ and select ‘Enable’

image

  • Restart SQL Server service.
    • Click on ‘SQL Server Services’
    • Right click on ‘SQL Server (SQLExpress)’ and select ‘Restart’

image 

  • Start ‘SQL Server Browser’ service
    • Click on ‘SQL Server Services’
    • Right click on ‘SQL Server browser’ and select ‘Properties’
    • Change the ‘Start Mode’ to ‘Automatic’, click apply and close the properties window.
    • Start the service

image

Create Database and Table

  • Open ‘SQL Server Management Studio’ and Connect to ‘SQL Express’ server (default)

image

  • Create ‘New Query’

image

  • Run the following command to create new Database (sqoop) and Table (w3c) schema
CREATE DATABASE sqoop 

Go

Use sqoop

Go

CREATE TABLE [dbo].[w3c]( 

[logdate] nvarchar(150) NULL, [logtime] nvarchar(150) NULL, 

[c_ip] nvarchar(150) NULL, [cs_username] nvarchar(150) NULL, 

[s_ip] nvarchar(150) NULL, [s_port] nvarchar(150) NULL, 

[cs_method] nvarchar(150) NULL, [cs_uri_stem] nvarchar(150) NULL, 

[cs_uri_query] nvarchar(150) NULL, [sc_status] [bigint] NULL, 

[sc_bytes] [bigint] NULL, [cs_bytes] [bigint] NULL, 

[time_taken] [bigint] NULL, [cs_agent] nvarchar(150) NULL, 

[cs_referrer] nvarchar(150) NULL) ON [PRIMARY] 

Upload data to HDFS

  • Download the file w3c_2MB.txt 
  • Upload the file from your local drive to HDFS under the folder ‘sqoop’
    • Open the ‘Hadoop Command Prompt’ window (shortcut should be available on your desktop), type:

   > hadoop fs -copyFromLocal w3c_2MB.txt ./sqoop/w3c_2MB.txt

    • Make sure that the file has been uploaded, type:

   > hadoop fs –lsr

image

Export from HDFS to SQL Express

  • Open the ‘Hadoop Command Prompt’ window (shortcut should be available on your desktop), type:
  • In the script bellow replace:
    • <your-sql-password> with the password that you’ve chosen when you installed SQL Express
    • <your-user-name> with the name of your machine

If you are using full SQL Server: instead of localhost\sqlexpress, use localhost. Make sure that the specified user (sa) is enabled and has write permissions on your SQL Server. Notice that you must use SQL Server Authentication.

   > cd c:\Hadoop\sqoop-1.4.2\bin

   > sqoop export --connect "jdbc:sqlserver://localhost\sqlexpress;databaseName=sqoop;user=sa;password=<your-sql-password>" --table w3c --export-dir /user/<your-user-name>/sqoop -m 1 -input-fields-terminated-by " "

Note:

  • Make sure that the export completed successfully

image

  • Open ‘SQL Server Management Studio’ and Connect to ‘SQL Express’ server (default).
  • Create ‘New Query’ Run:
Use sqoop

Go

Select * from w3c
image 

Import from SQL Express back to HDFS

  • Open the ‘Hadoop Command Prompt’ window (shortcut should be available on your desktop)
  • In the script bellow replace:
    • <your-sql-password> with the password that you’ve chosen when you installed SQL Express
    • <your-user-name> with the name of your machine

   > cd c:\Hadoop\sqoop-1.4.2\bin

   > sqoop import --connect "jdbc:sqlserver://localhost\sqlexpress;databaseName=sqoop;user=sa;password=<your-sql-password>" --table w3c --target-dir /user/<your-user-name>/sqoop/import --fields-terminated-by " " -m 1

  • Make sure that the export completed successfully

    > hadoop fs –ls  /user/<your-user-name>/sqoop/import

image

Import data from SQL Server Express directly to Hive table

  • Open the ‘Hadoop Command Prompt’ window (shortcut should be available on your desktop)
  • In the script bellow replace:
    • <your-sql-password> with the password that you’ve chosen when you installed SQL Express

   > cd c:\Hadoop\sqoop-1.4.2\bin

   > sqoop import --connect "jdbc:sqlserver://localhost\sqlexpress;databaseName=sqoop;user=sa;password=<your-sql-password>" --table w3c --hive-table w3c_sqoop --create-hive-table --hive-import -m 1

image

  • Make sure that the export completed successfully.
  • Open hive CLI:

   > c:\Hadoop\hive-0.9.0\bin\hive

   hive> SELECT COUNT(*) FROM w3c_sqoop

image