Using Profiles to Read and Write Data

PXF profiles are collections of common metadata attributes that can be used to simplify the reading and writing of data. You can use any of the built-in profiles that come with PXF or you can create your own.

For example, if you are writing single line records to text files on HDFS, you could use the built-in HdfsTextSimple profile. You specify this profile when you create the PXF external table used to write the data to HDFS.

Built-In Profiles

PXF comes with a number of built-in profiles that group together a collection of metadata attributes. PXF built-in profiles simplify access to the following types of data storage systems:

  • HDFS File Data (Read + Write)
  • Hive (Read only)
  • HBase (Read only)
  • JSON (Read only)

You can specify a built-in profile when you want to read data that exists inside HDFS files, Hive tables, HBase tables, or JSON files, and when you want to write data into HDFS files.

Profile Description Fragmenter/Accessor/Resolver/Metadata/OutputFormat
HdfsTextSimple Read or write delimited single line records from or to plain text files on HDFS.
  • org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter
  • org.apache.hawq.pxf.plugins.hdfs.LineBreakAccessor
  • org.apache.hawq.pxf.plugins.hdfs.StringPassResolver
HdfsTextMulti Read delimited single or multi-line records (with quoted linefeeds) from plain text files on HDFS. This profile is not splittable (non parallel); reading is slower than reading with HdfsTextSimple.
  • org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter
  • org.apache.hawq.pxf.plugins.hdfs.QuotedLineBreakAccessor
  • org.apache.hawq.pxf.plugins.hdfs.StringPassResolver
Hive Read a Hive table with any of the available storage formats: text, RC, ORC, Sequence, or Parquet.
  • org.apache.hawq.pxf.plugins.hive.HiveDataFragmenter
  • org.apache.hawq.pxf.plugins.hive.HiveAccessor
  • org.apache.hawq.pxf.plugins.hive.HiveResolver
  • org.apache.hawq.pxf.plugins.hive.HiveMetadataFetcher
  • org.apache.hawq.pxf.service.io.GPDBWritable
HiveRC Optimized read of a Hive table where each partition is stored as an RCFile.
Note: The DELIMITER parameter is mandatory.
  • org.apache.hawq.pxf.plugins.hive.HiveInputFormatFragmenter
  • org.apache.hawq.pxf.plugins.hive.HiveRCFileAccessor
  • org.apache.hawq.pxf.plugins.hive.HiveColumnarSerdeResolver
  • org.apache.hawq.pxf.plugins.hive.HiveMetadataFetcher
  • org.apache.hawq.pxf.service.io.Text
HiveORC Optimized read of a Hive table where each partition is stored as an ORC file.
  • org.apache.hawq.pxf.plugins.hive.HiveInputFormatFragmenter
  • org.apache.hawq.pxf.plugins.hive.HiveORCAccessor
  • org.apache.hawq.pxf.plugins.hive.HiveORCSerdeResolver
  • org.apache.hawq.pxf.plugins.hive.HiveMetadataFetcher
  • org.apache.hawq.pxf.service.io.GPDBWritable
HiveVectorizedORC Optimized bulk/batch read of a Hive table where each partition is stored as an ORC file.
  • org.apache.hawq.pxf.plugins.hive.HiveInputFormatFragmenter
  • org.apache.hawq.pxf.plugins.hive.HiveORCVectorizedAccessor
  • org.apache.hawq.pxf.plugins.hive.HiveORCVectorizedResolver
  • org.apache.hawq.pxf.plugins.hive.HiveMetadataFetcher
  • org.apache.hawq.pxf.service.io.GPDBWritable
HiveText Optimized read of a Hive table where each partition is stored as a text file.
Note: The DELIMITER parameter is mandatory.
  • org.apache.hawq.pxf.plugins.hive.HiveInputFormatFragmenter
  • org.apache.hawq.pxf.plugins.hive.HiveLineBreakAccessor
  • org.apache.hawq.pxf.plugins.hive.HiveStringPassResolver
  • org.apache.hawq.pxf.plugins.hive.HiveMetadataFetcher
  • org.apache.hawq.pxf.service.io.Text
HBase Read an HBase data store engine.
  • org.apache.hawq.pxf.plugins.hbase.HBaseDataFragmenter
  • org.apache.hawq.pxf.plugins.hbase.HBaseAccessor
  • org.apache.hawq.pxf.plugins.hbase.HBaseResolver
Avro Read Avro files (fileName.avro).
  • org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter
  • org.apache.hawq.pxf.plugins.hdfs.AvroFileAccessor
  • org.apache.hawq.pxf.plugins.hdfs.AvroResolver
JSON Read JSON files (fileName.json) from HDFS.
  • org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter
  • org.apache.hawq.pxf.plugins.json.JsonAccessor
  • org.apache.hawq.pxf.plugins.json.JsonResolver

Notes: Metadata identifies the Java class that provides field definitions in the relation. OutputFormat identifies the output serialization format (text or binary) for which a specific profile is optimized. While the built-in Hive* profiles provide Metadata and OutputFormat classes, other profiles may have no need to implement or specify these classes.

Adding and Updating Profiles

Each profile has a mandatory unique name and an optional description. In addition, each profile contains a set of plug-ins that are an extensible set of metadata attributes. Administrators can add new profiles or edit the built-in profiles defined in /etc/pxf/conf/pxf-profiles.xml.

Note: Add the JAR files associated with custom PXF plug-ins to the /etc/pxf/conf/pxf-public.classpath configuration file.

After you make changes in pxf-profiles.xml (or any other PXF configuration file), propagate the changes to all nodes with PXF installed, and then restart the PXF service on all nodes.