Using Profiles to Read and Write Data

Built-In Profiles
Adding and Updating Profiles

PXF profiles are collections of common metadata attributes that can be used to simplify the reading and writing of data. You can use any of the built-in profiles that come with PXF or you can create your own.

For example, if you are writing single line records to text files on HDFS, you could use the built-in HdfsTextSimple profile. You specify this profile when you create the PXF external table used to write the data to HDFS.

Built-In Profiles

PXF comes with a number of built-in profiles that group together a collection of metadata attributes. PXF built-in profiles simplify access to the following types of data storage systems:

HDFS File Data (Read + Write)
Hive (Read only)
HBase (Read only)
JSON (Read only)

You can specify a built-in profile when you want to read data that exists inside HDFS files, Hive tables, HBase tables, or JSON files, and when you want to write data into HDFS files.

Profile	Description	Fragmenter/Accessor/Resolver/Metadata/OutputFormat
HdfsTextSimple	Read or write delimited single line records from or to plain text files on HDFS.	org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter org.apache.hawq.pxf.plugins.hdfs.LineBreakAccessor org.apache.hawq.pxf.plugins.hdfs.StringPassResolver
HdfsTextMulti	Read delimited single or multi-line records (with quoted linefeeds) from plain text files on HDFS. This profile is not splittable (non parallel); reading is slower than reading with HdfsTextSimple.	org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter org.apache.hawq.pxf.plugins.hdfs.QuotedLineBreakAccessor org.apache.hawq.pxf.plugins.hdfs.StringPassResolver
Hive	Read a Hive table with any of the available storage formats: text, RC, ORC, Sequence, or Parquet.	org.apache.hawq.pxf.plugins.hive.HiveDataFragmenter org.apache.hawq.pxf.plugins.hive.HiveAccessor org.apache.hawq.pxf.plugins.hive.HiveResolver org.apache.hawq.pxf.plugins.hive.HiveMetadataFetcher org.apache.hawq.pxf.service.io.GPDBWritable
HiveRC	Optimized read of a Hive table where each partition is stored as an RCFile. Note: The `DELIMITER` parameter is mandatory.	org.apache.hawq.pxf.plugins.hive.HiveInputFormatFragmenter org.apache.hawq.pxf.plugins.hive.HiveRCFileAccessor org.apache.hawq.pxf.plugins.hive.HiveColumnarSerdeResolver org.apache.hawq.pxf.plugins.hive.HiveMetadataFetcher org.apache.hawq.pxf.service.io.Text
HiveORC	Optimized read of a Hive table where each partition is stored as an ORC file.	org.apache.hawq.pxf.plugins.hive.HiveInputFormatFragmenter org.apache.hawq.pxf.plugins.hive.HiveORCAccessor org.apache.hawq.pxf.plugins.hive.HiveORCSerdeResolver org.apache.hawq.pxf.plugins.hive.HiveMetadataFetcher org.apache.hawq.pxf.service.io.GPDBWritable
HiveVectorizedORC	Optimized bulk/batch read of a Hive table where each partition is stored as an ORC file.	org.apache.hawq.pxf.plugins.hive.HiveInputFormatFragmenter org.apache.hawq.pxf.plugins.hive.HiveORCVectorizedAccessor org.apache.hawq.pxf.plugins.hive.HiveORCVectorizedResolver org.apache.hawq.pxf.plugins.hive.HiveMetadataFetcher org.apache.hawq.pxf.service.io.GPDBWritable
HiveText	Optimized read of a Hive table where each partition is stored as a text file. Note: The `DELIMITER` parameter is mandatory.	org.apache.hawq.pxf.plugins.hive.HiveInputFormatFragmenter org.apache.hawq.pxf.plugins.hive.HiveLineBreakAccessor org.apache.hawq.pxf.plugins.hive.HiveStringPassResolver org.apache.hawq.pxf.plugins.hive.HiveMetadataFetcher org.apache.hawq.pxf.service.io.Text
HBase	Read an HBase data store engine.	org.apache.hawq.pxf.plugins.hbase.HBaseDataFragmenter org.apache.hawq.pxf.plugins.hbase.HBaseAccessor org.apache.hawq.pxf.plugins.hbase.HBaseResolver
Avro	Read Avro files (fileName.avro).	org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter org.apache.hawq.pxf.plugins.hdfs.AvroFileAccessor org.apache.hawq.pxf.plugins.hdfs.AvroResolver
JSON	Read JSON files (fileName.json) from HDFS.	org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter org.apache.hawq.pxf.plugins.json.JsonAccessor org.apache.hawq.pxf.plugins.json.JsonResolver

Notes: Metadata identifies the Java class that provides field definitions in the relation. OutputFormat identifies the output serialization format (text or binary) for which a specific profile is optimized. While the built-in Hive* profiles provide Metadata and OutputFormat classes, other profiles may have no need to implement or specify these classes.

Adding and Updating Profiles

Each profile has a mandatory unique name and an optional description. In addition, each profile contains a set of plug-ins that are an extensible set of metadata attributes. Administrators can add new profiles or edit the built-in profiles defined in /etc/pxf/conf/pxf-profiles.xml.

Note: Add the JAR files associated with custom PXF plug-ins to the /etc/pxf/conf/pxf-public.classpath configuration file.

After you make changes in pxf-profiles.xml (or any other PXF configuration file), propagate the changes to all nodes with PXF installed, and then restart the PXF service on all nodes.