Accessing External SQL Databases with JDBC (Beta)

Some of your data may already reside in an external SQL database. The PXF JDBC plug-in reads data stored in SQL databases including MySQL, ORACLE, PostgreSQL, and Hive.

This section describes how to use PXF with JDBC, including an example of creating and querying an external table that accesses data in a MySQL database table.

Prerequisites

Before accessing external SQL databases using HAWQ and PXF, ensure that:

  • The JDBC plug-in is installed on all cluster nodes. See Installing PXF Plug-ins for PXF plug-in installation information.
  • The JDBC driver JAR files for the external SQL database are installed on all cluster nodes.
  • The file locations of external SQL database JDBC JAR files are added to pxf-public.classpath. If you manage your HAWQ cluster with Ambari, add the JARS via the Ambari UI. If you managed your cluster from the command line, edit the /etc/pxf/conf/pxf-public.classpath file directly.

Querying External SQL Data

The PXF JDBC plug-in supports the single profile named Jdbc.

Use the following syntax to create a HAWQ external table representing external SQL database tables you access via JDBC: 

CREATE [READABLE | WRITABLE] EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<host>[:<port>]/<jdbc-schema-name>.<jdbc-database-name>.<jdbc-table-name>
    ?PROFILE=Jdbc[&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

JDBC-plug-in-specific keywords and values used in the CREATE EXTERNAL TABLE call are described in the table below.

Keyword Value
<column-name> The name of the PXF external table column. The PXF <column-name> must exactly match the <column-name> used in the external SQL table.
<data-type> The data type of the PXF external table column. The PXF <data-type> must be equivalent to the data type used for <column-name> in the SQL table.
<host> The PXF host. While <host> may identify any PXF agent node, use the HDFS NameNode as it is guaranteed to be available in a running HDFS cluster. If HDFS High Availability is enabled, <host> must identify the HDFS NameService.
<port> The PXF port. If <port> is omitted, PXF assumes <host> identifies a High Availability HDFS Nameservice and connects to the port number designated by the pxf_service_port server configuration parameter value. Default is 51200.
<jdbc-schema-name> The schema name. The default schema name is default.
<jdbc-database-name> The database name. The default database name is determined by the external SQL server.
<jdbc-table-name> The table name.
PROFILE The PROFILE keyword must specify Jdbc.
<custom-option> The custom options supported by the Jdbc profile are discussed later in this section.
FORMAT ‘CUSTOM’ The JDBC CUSTOM FORMAT supports only the built-in 'pxfwritable_import' FORMATTER property.

Note: When creating PXF external tables, you cannot use the HEADER option in your FORMAT specification.

JDBC Custom Options

You include custom options in the LOCATION URI. Preface each option with an ampersand &.

The JDBC plug-in Jdbc profile supports the following <custom-option>s. The JDBC connection set up options are required.

Option Name Description
JDBC_DRIVER The JDBC driver class name. (Required)
DB_URL The URL to the database; includes the hostname, port, and database name. (Required)
USER The database user name. (Required)
PASS The database password for USER. (Required)
PARTITION_BY The partition column, <column-name>:<column-type>. The JDBC plug-in supports date, int, and enum <column-type>s. Use the yyyy-MM-dd format for the date <column-type>. A null PARTITION_BY defaults to a single fragment.
RANGE (Used only when PARTITION_BY is specified.) The query range, <start-value>[:<end-value>]. <end-value> may be empty for an int <column-type>. The RANGE is left closed, right open. That is, the range includes the <start-value> but does not include the <end-value>.
INTERVAL (Used only when PARTITION_BY is specified.) The interval, <interval-num>[:<interval-unit>], of one fragment. INTERVAL may be empty for an enum <column-type>. <interval-unit> may be empty for an int <column-type>.

Example JDBC <custom-option> connection string:

&JDBC_DRIVER=com.mysql.jdbc.Driver&DB_URL=jdbc:mysql://<dbhost>:<dbport>/testdb&USER=user1&PASS=changeme

When specifying the PARTITION_BY option, tune the INTERVAL value and unit based upon the optimal number of JDBC connections to the target database and the optimal distribution of fragments across HAWQ segments. The INTERVAL low boundary is driven by the number of HAWQ segments (default_hash_table_bucket_number), while the high boundary is driven by the acceptable number of JDBC connections to the target database. INTERVAL settings influence the number of fragments, and should ideally not be set too high nor too low. Testing with multiple values may help you select the optimal settings.

Example JDBC <custom-option> substrings identifying partitioning parameters:

&PARTITION_BY=year:int&RANGE=2011:2013&INTERVAL=1
&PARTITION_BY=createdate:date&RANGE=2013-01-01:2016-01-01&INTERVAL=1:month
&PARTITION_BY=color:enum&RANGE=red:yellow:blue

Example: Using the Jdbc Profile to Access a MySQL Database Table

Create a MySQL Table

Perform the following steps to create a MySQL table named mysql_table1 in the default schema of a database named mtestdb:

  1. Connect to the default MySQL database as the root user:

    $ mysql --user=root -p
    
  2. Create a database named mtestdb:

    mysql> CREATE DATABASE mtestdb;
    
  3. Assign privileges to mtestdb to a user named muser1 and then exit:

    mysql> GRANT ALL on mtestdb.* TO 'muser1'@'localhost' IDENTIFIED BY 'muser1';
    mysql> \q
    
  4. Connect to mtestdb as user muser1:

    $ mysql --user=muser1 mtestdb -pmuser1
    
  5. Create a table named mysql_table1 and insert some data into this table:

    mysql> CREATE TABLE mysql_table1(id int);
    mysql> INSERT INTO mysql_table1 VALUES (1);
    mysql> INSERT INTO mysql_table1 VALUES (2);
    mysql> INSERT INTO mysql_table1 VALUES (3);
    
  6. Determine the MySQL server hostname and port:

    mysql> SHOW VARIABLES LIKE 'hostname';
    +---------------+-----------------+
    | Variable_name | Value           |
    +---------------+-----------------+
    | hostname      | mydb.server.com |
    +---------------+-----------------+
    
    mysql> SHOW VARIABLES LIKE 'port';
    +---------------+-------+
    | Variable_name | Value |
    +---------------+-------+
    | port          | 3306  |
    +---------------+-------+
    1 row in set (0.00 sec)
    

    Make note of the hostname and port values returned.

  7. Construct the JDBC connection string, substituting your MySQL server hostname and port number:

    &JDBC_DRIVER=com.mysql.jdbc.Driver&DB_URL=jdbc:mysql://mydb.server.com:3306/mtestdb&USER=muser1&PASS=muser1
    

    Save this string for use later.

Configure PXF

If not already present on your system, download and copy the MySQL connector JAR file to your system and update pxf-public.classpath with the location of this file.

For example, if you manage your HAWQ cluster with Ambari, and the MySQL connector jar file is located in the /usr/share/java directory, add the following line:

/usr/share/java/mysql-connector*.jar

in the Ambari PXF service Configs > Advanced pxf-public.classpath pane and then use the Restart button to restart PXF.

If you manage your HAWQ cluster from the command line, directly edit the /etc/pxf/conf/pxf-public.classpath file, adding the MySQL connector JAR file path. Then restart PXF on each HAWQ node with the following command:

root@hawq-node$ sudo service pxf-service restart

Query Using the Jdbc Profile

Perform the following steps to create and query an external PXF table to access the mysql_table1 table you created in the previous section:

  1. Use the Jdbc profile to create an external table to access the MySQL mysql_table1 table. For example:

    gpadmin=# CREATE EXTERNAL TABLE pxf_jdbc_mysql_table1(id int)
                LOCATION ('pxf://<host>:51200/mtestdb.mysql_table1?PROFILE=JDBC&JDBC_DRIVER=com.mysql.jdbc.Driver&DB_URL=jdbc:mysql://<dbserver-host>:<port>/mtestdb&USER=hawquser1&PASS=hawquser1')
                FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
    

    Substitute your PXF <host>, as well as the DB_URL string you constructed in the previous exercise.

  2. Display all rows of the pxf_jdbc_mysql_table1 table:

    gpadmin=# SELECT * FROM pxf_jdbc_mysql_table1;
     id
    ----
      1
      2
      3
    (3 rows)