Registering Files into HAWQ Internal Tables

The hawq register utility loads and registers HDFS data files or folders into HAWQ internal tables. Files can be read directly, rather than having to be copied or loaded, resulting in higher performance and more efficient transaction processing.

Data from the file or directory specified by <hdfsfilepath> is loaded into the appropriate HAWQ table directory in HDFS and the utility updates the corresponding HAWQ metadata for the files. Either AO or Parquet-formatted tables in HDFS can be loaded into a corresponding table in HAWQ.

You can use hawq register either to:

  • Load and register external Parquet-formatted file data generated by an external system such as Hive or Spark.
  • Recover cluster data from a backup cluster for disaster recovery.

Requirements for running hawq register on the server are:

  • All hosts in your HAWQ cluster (master and segments) must have network access between them and the hosts containing the data to be loaded.
  • The Hadoop client must be configured and the hdfs filepath specified.
  • The files to be registered and the HAWQ table must be located in the same HDFS cluster.
  • The target table DDL is configured with the correct data type mapping.

Registering Externally Generated HDFS File Data to an Existing Table

Files or folders in HDFS can be registered into an existing table, allowing them to be managed as a HAWQ internal table. When registering files, you can optionally specify the maximum amount of data to be loaded, in bytes, using the --eof option. If registering a folder, the actual file sizes are used.

Only HAWQ or Hive-generated Parquet tables are supported. Partitioned tables are not supported. Attempting to register these tables will result in an error.

Metadata for the Parquet file(s) and the destination table must be consistent. Different data types are used by HAWQ tables and Parquet files, so data must be mapped. You must verify that the structure of the Parquet files and the HAWQ table are compatible before running hawq register. Not all HIVE data types can be mapped to HAWQ equivalents. The currently-supported HIVE data types are: boolean, int, smallint, tinyint, bigint, float, double, string, binary, char, and varchar.

As a best practice, create a copy of the Parquet file to be registered before running hawq register You can then then run hawq register on the copy, leaving the original file available for additional Hive queries or if a data mapping error is encountered.

Limitations for Registering Hive Tables to HAWQ

The following HIVE data types cannot be converted to HAWQ equivalents: timestamp, decimal, array, struct, map, and union.

Example: Registering a Hive-Generated Parquet File

This example shows how to register a HIVE-generated parquet file in HDFS into the table parquet_table in HAWQ, which is in the database named postgres. The file path of the HIVE-generated file is hdfs://localhost:8020/temp/hive.paq.

In this example, the location of the database is hdfs://localhost:8020/hawq_default, the tablespace id is 16385, the database id is 16387, the table filenode id is 77160, and the last file under the filenode is numbered 7.

Run the hawq register command for the file location hdfs://localhost:8020/temp/hive.paq:

$ hawq register -d postgres -f hdfs://localhost:8020/temp/hive.paq parquet_table

After running the hawq register command, the corresponding new location of the file in HDFS is: hdfs://localhost:8020/hawq_default/16385/16387/77160/8.

The command updates the metadata of the table parquet_table in HAWQ, which is contained in the table pg_aoseg.pg_paqseg_77160. The pg_aoseg table is a fixed schema for row-oriented and Parquet AO tables. For row-oriented tables, the table name prefix is pg_aoseg. For Parquet tables, the table name prefix is pg_paqseg. 77160 is the relation id of the table.

You can locate the table by one of two methods, either by relation ID or by table name.

To find the relation ID, run the following command on the catalog table pg_class:

SELECT oid FROM pg_class WHERE relname=$relname

To find the table name, run the command:

SELECT segrelid FROM pg_appendonly WHERE relid = $relid

then run:

SELECT relname FROM pg_class WHERE oid = segrelid

Registering Data Using Information from a YAML Configuration File

The hawq register command can register HDFS files by using metadata loaded from a YAML configuration file by using the --config <yaml_config\> option. Both AO and Parquet tables can be registered. Tables need not exist in HAWQ before being registered. In disaster recovery, information in a YAML-format file created by the hawq extract command can re-create HAWQ tables by using metadata from a backup checkpoint.

You can also use a YAML confguration file to append HDFS files to an existing HAWQ table or create a table and register it into HAWQ.

For disaster recovery, tables can be re-registered using the HDFS files and a YAML file. The clusters are assumed to have data periodically imported from Cluster A to Cluster B.

Data is registered according to the following conditions:

  • Existing tables have files appended to the existing HAWQ table.
  • If a table does not exist, it is created and registered into HAWQ. The catalog table will be updated with the file size specified by the YAML file.
  • If the --force option is used, the data in existing catalog tables is erased and re-registered. All HDFS-related catalog contents in pg_aoseg.pg_paqseg_$relid are cleared. The original files on HDFS are retained.

Tables using random distribution are preferred for registering into HAWQ.

There are additional restrictions when registering hash tables. When registering hash-distributed tables using a YAML file, the distribution policy in the YAML file must match that of the table being registered into and the order of the files in the YAML file should reflect the hash distribution. The size of the registered file should be identical to or a multiple of the hash table bucket number.

Only single-level partitioned tables can be registered into HAWQ.

Example: Registration using a YAML Configuration File

This example shows how to use hawq register to register HDFS data using a YAML configuration file generated by hawq extract.

First, create a table in SQL and insert some data into it.

=> CREATE TABLE paq1(a int, b varchar(10))with(appendonly=true, orientation=parquet);
=> INSERT INTO paq1 VALUES(generate_series(1,1000), 'abcde');

Extract the table metadata by using the hawq extract utility.

hawq extract -o paq1.yml paq1

Register the data into new table paq2, using the --config option to identify the YAML file.

hawq register --config paq1.yml paq2

Select the new table and check to verify that the content has been registered.

=> SELECT count(*) FROM paq2;

Data Type Mapping

HIVE and Parquet tables use different data types than HAWQ tables and must be mapped for metadata compatibility. You are responsible for making sure your implementation is mapped to the appropriate data type before running hawq register. The tables below show equivalent data types, if available.

Table 1. HAWQ to Parquet Mapping

HAWQ Data Type Parquet Data Type
bool boolean
int2/int4/date int32
int8/money int64
time/timestamptz/timestamp int64
float4 float
float8 double
bit/varbit/bytea/numeric Byte array
char/bpchar/varchar/name Byte array
text/xml/interval/timetz Byte array
macaddr/inet/cidr Byte array

Additional HAWQ-to-Parquet Mapping

point:

group {
    required int x;
    required int y;
}

circle:

group {
    required int x;
    required int y;
    required int r;
}

box:

group {
    required int x1;
    required int y1;
    required int x2;
    required int y2;
}

iseg:

group {
    required int x1;
    required int y1;
    required int x2;
    required int y2;
}

path:

group {
    repeated group {
        required int x;
        required int y;
    }
}

Table 2. HIVE to HAWQ Mapping

HIVE Data Type HAWQ Data Type
boolean bool
tinyint int2
smallint int2/smallint
int int4 / int
bigint int8 / bigint
float float4
double float8
string varchar
binary bytea
char char
varchar varchar

Extracting Metadata

For more information on extracting metadata to a YAML file and the output content of the YAML file, refer to the reference page for hawq extract.