Of course! Performing a scan in HBase using Java is a fundamental operation. Here’s a comprehensive guide covering the basics, advanced options, best practices, and complete code examples.

The Core Concept: Scan vs. Get
In HBase, there are two primary ways to read data:
Get: Retrieves a single row by its uniqueRowKey. It's like a primary key lookup in a relational database.Scan: Retrieves one or more rows that fall within a specified range. It's like aSELECTquery with aWHEREclause on the rowkey range. You can also specify filters to narrow down results based on column family, column qualifier, or cell values.
This guide focuses on the Scan operation.
Prerequisites
Before you write the code, you need to set up your project.
1. Maven Dependency
Add the HBase Client library to your pom.xml:

<dependencies>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>2.4.11</version> <!-- Use a version compatible with your HBase cluster -->
</dependency>
<!-- HBase also depends on Zookeeper, but it's usually a transitive dependency -->
</dependencies>
2. HBase Configuration
Your Java application needs to know how to connect to your HBase cluster. This is done via a Configuration object.
You have two main options:
- Programmatic Configuration: Hardcoding the configuration in your code. Good for simple tests.
hbase-site.xml: Placing the configuration file on the classpath. This is the recommended approach for production, as it separates configuration from code.
Example hbase-site.xml (place this file in your project's src/main/resources directory):
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.zookeeper.quorum</name>
<value>zk1.example.com,zk2.example.com,zk3.example.com</value>
<description>The directory shared by RegionServers.
</description>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
</configuration>
Basic Scan Example
This example demonstrates how to scan an entire table.

Step 1: Create a Connection and Table Object
It's crucial to manage connections and resources properly. Use try-with-resources to ensure they are closed automatically.
Step 2: Create a Scan Object
Instantiate a Scan object. By default, a scan has no start or stop row, meaning it will scan the entire table.
Step 3: Execute the Scan
Use the getTable().getScanner(scan) method to get an ResultScanner. This is an iterator-like object that you can loop over to get all the results.
Step 4: Process the Results
Each call to scanner.next() returns a Result object, which represents a single row. You can extract data from the Result object.
Complete Code: BasicScan.java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;
import java.io.IOException;
public class BasicScan {
public static void main(String[] args) throws IOException {
// 1. Create a configuration object from hbase-site.xml on the classpath
Configuration config = HBaseConfiguration.create();
// Use try-with-resources to manage the connection and scanner
try (Connection connection = ConnectionFactory.createConnection(config);
Table table = connection.getTable(TableName.valueOf("my_table"))) {
// 2. Create a Scan object for the entire table
Scan scan = new Scan();
System.out.println("Starting scan of table: 'my_table'...");
// 3. Get a scanner
try (ResultScanner scanner = table.getScanner(scan)) {
// 4. Loop through the scanner results
for (Result result : scanner) {
// Print the row key
System.out.println("RowKey: " + Bytes.toString(result.getRow()));
// Print all columns and their values
result.forEach(cell -> {
String family = Bytes.toString(CellUtil.cloneFamily(cell));
String qualifier = Bytes.toString(CellUtil.cloneQualifier(cell));
String value = Bytes.toString(CellUtil.cloneValue(cell));
System.out.println(" -> CF: " + family + ", Qualifier: " + qualifier + ", Value: " + value);
});
System.out.println("-------------------------------------------");
}
}
System.out.println("Scan complete.");
}
}
}
Advanced Scan Options
A basic scan is often too slow or returns too much data. HBase provides powerful options to make scans efficient.
1. Limiting the Scan with Start/Stop Rows
You can specify a range of rows to scan. The scan will include rows from the start row up to (but not including) the stop row.
// Scan rows with rowkeys from 'row_100' up to (but not including) 'row_200'
Scan scan = new Scan()
.withStartRow(Bytes.toBytes("row_100"))
.withStopRow(Bytes.toBytes("row_200"));
2. Limiting Columns (Column Families and Qualifiers)
Instead of fetching all columns, you can specify which ones you need. This dramatically reduces the amount of data transferred over the network.
// Scan only the 'cf1' column family
Scan scan = new Scan().addFamily(Bytes.toBytes("cf1"));
// Scan only the 'name' column (qualifier) within the 'cf1' family
Scan scan = new Scan().addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("name"));
// Scan multiple specific columns
Scan scan = new Scan()
.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("name"))
.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("email"));
3. Limiting the Number of Versions
By default, HBase returns the latest version of a cell. You can fetch older versions as well.
// Fetch up to 3 versions of each cell Scan scan = new ReadConsistency.SCAN.readAllVersions(scan).readVersions(3); // Or using the builder pattern (HBase 2.0+) Scan scan = new Scan().readAllVersions().readVersions(3);
4. Using Filters (Powerful Row Selection)
Filters are the most powerful way to narrow down your results. They allow you to apply server-side logic to skip rows or cells that you don't need.
Example: Filter by Column Value
This example scans for all rows where the cf1:name column's value is "John Doe".
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter;
import org.apache.hadoop.hbase.filter.CompareOperator;
import org.apache.hadoop.hbase.filter.SubstringComparator;
// ...
// Create a filter
SingleColumnValueFilter filter = new SingleColumnValueFilter(
Bytes.toBytes("cf1"), // Column Family
Bytes.toBytes("name"), // Column Qualifier
CompareOperator.EQUAL, // Comparison Operator
new SubstringComparator("John Doe") // Comparator (use BinaryComparator for exact match)
);
// Set the filter to ignore rows that don't have the column
filter.setFilterIfMissing(true);
Scan scan = new Scan();
scan.setFilter(filter);
// Then execute the scan as shown in the basic example...
Other Common Filters:
PageFilter: For pagination.PrefixFilter: To scan rows with a specific rowkey prefix.FamilyFilter/QualifierFilter: To filter based on column family/qualifier names.ColumnRangeFilter: To filter columns within a range.
Best Practices
- Close Resources!: Always close
Connection,Table,ResultScanner, andRegionLocatorobjects. Usetry-with-resourcesto prevent resource leaks. - Be Specific: Always specify the column families and columns you need. Never do a
new Scan()on a large table in production without column filters. - Use RowKey Design: The most efficient scans are those that leverage the sorted nature of rowkeys. Design your rowkeys to enable range scans (e.g.,
user_id_timestamp). - Batch Results: For very large result sets,
ResultScannercan consume a lot of memory. You can process results in batches.int batchSize = 100; int count = 0; for (Result result : scanner) { // process result if (++count % batchSize == 0) { // Do something with the batch, or just log progress System.out.println("Processed " + count + " rows..."); } } - Caching: The
ResultScannerfetches results from the RegionServer in batches. You can control this batch size withsetCaching(). A higher value reduces RPC calls but uses more memory.scan.setCaching(500); // Fetch 500 rows per RPC call
Complete Advanced Example
This example combines several advanced features: a row range, column selection, and a filter.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter;
import org.apache.hadoop.hbase.filter.CompareOperator;
import org.apache.hadoop.hbase.util.Bytes;
import java.io.IOException;
public class AdvancedScanExample {
public static void main(String[] args) throws IOException {
Configuration config = HBaseConfiguration.create();
try (Connection connection = ConnectionFactory.createConnection(config);
Table table = connection.getTable(TableName.valueOf("users"))) {
// 1. Define the scan range
Scan scan = new Scan()
.withStartRow(Bytes.toBytes("user_100")) // Start from 'user_100'
.withStopRow(Bytes.toBytes("user_200")); // Stop before 'user_200'
// 2. Specify columns to retrieve
scan.addFamily(Bytes.toBytes("profile"));
scan.addColumn(Bytes.toBytes("contact"), Bytes.toBytes("email"));
// 3. Add a server-side filter
SingleColumnValueFilter ageFilter = new SingleColumnValueFilter(
Bytes.toBytes("profile"),
Bytes.toBytes("age"),
CompareOperator.GREATER_OR_EQUAL,
Bytes.toBytes(30) // Age must be 30 or older
);
ageFilter.setFilterIfMissing(true); // Skip rows without the 'age' column
scan.setFilter(ageFilter);
// 4. Set caching for efficiency
scan.setCaching(100);
System.out.println("Starting advanced scan...");
try (ResultScanner scanner = table.getScanner(scan)) {
for (Result result : scanner) {
String rowKey = Bytes.toString(result.getRow());
byte[] nameValue = result.getValue(Bytes.toBytes("profile"), Bytes.toBytes("name"));
byte[] emailValue = result.getValue(Bytes.toBytes("contact"), Bytes.toBytes("email"));
System.out.println("Found User: " + rowKey);
System.out.println(" Name: " + (nameValue != null ? Bytes.toString(nameValue) : "N/A"));
System.out.println(" Email: " + (emailValue != null ? Bytes.toString(emailValue) : "N/A"));
System.out.println("---");
}
}
System.out.println("Advanced scan complete.");
}
}
} 