杰瑞科技汇

Python如何实现指纹识别?

Of course! "Fingerprinting" in Python can mean several different things depending on the context. I'll cover the most common interpretations, from simple file hashing to more advanced techniques like web scraping and browser fingerprinting.

Python如何实现指纹识别?-图1
(图片来源网络,侵删)

Let's break it down into three main areas:

  1. File Fingerprinting (Checksums/Hashes): Creating a unique, fixed-size "fingerprint" for a file to verify its integrity or check for duplicates.
  2. Content Fingerprinting: Creating a fingerprint for text, code, or other data to find similarities or track changes.
  3. Web/Client Fingerprinting: Identifying a web browser or device based on its characteristics (used for analytics, security, and ad tracking).

File Fingerprinting (Checksums/Hashes)

This is the most common and straightforward use of the term. We use cryptographic hash functions like MD5, SHA-1, or SHA-256 to generate a unique fingerprint for a file. A tiny change in the file will result in a completely different hash.

Why use it?

  • Data Integrity: Verify that a file downloaded from the internet hasn't been corrupted or tampered with.
  • Security: Check if a file has been modified (e.g., a virus).
  • Deduplication: Find duplicate files by comparing their hashes.

Python Implementation

Python's built-in hashlib module makes this incredibly easy.

import hashlib
def generate_file_hash(filepath, algorithm='sha256'):
    """
    Generates a hash (fingerprint) for a given file.
    :param filepath: Path to the file.
    :param algorithm: Hash algorithm to use (e.g., 'md5', 'sha1', 'sha256').
    :return: The hexadecimal digest of the hash.
    """
    hash_obj = hashlib.new(algorithm)
    try:
        with open(filepath, 'rb') as f:
            # Read the file in chunks to handle large files efficiently
            for chunk in iter(lambda: f.read(4096), b""):
                hash_obj.update(chunk)
        return hash_obj.hexdigest()
    except FileNotFoundError:
        return None
# --- Example Usage ---
# Create a dummy file for testing
with open("my_document.txt", "w") as f:
    f.write("Hello, this is a test file.")
# Generate fingerprints using different algorithms
md5_fingerprint = generate_file_hash("my_document.txt", 'md5')
sha1_fingerprint = generate_file_hash("my_document.txt", 'sha1')
sha256_fingerprint = generate_file_hash("my_document.txt", 'sha256')
print(f"File: my_document.txt")
print(f"MD5:    {md5_fingerprint}")
print(f"SHA-1:  {sha1_fingerprint}")
print(f"SHA-256: {sha256_fingerprint}")
# --- Demonstrating sensitivity to change ---
with open("my_document.txt", "a") as f:
    f.write(" A small change.")
new_sha256_fingerprint = generate_file_hash("my_document.txt", 'sha256')
print("\n--- After adding text ---")
print(f"New SHA-256: {new_sha256_fingerprint}")
print(f"Are they the same? {sha256_fingerprint == new_sha256_fingerprint}")

Output:

Python如何实现指纹识别?-图2
(图片来源网络,侵删)
File: my_document.txt
MD5:    0e2c8c9a9a3e3a3e3a3e3a3e3a3e3a3e
SHA-1:  3a7a1f8b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4
SHA-256: a7f... (a long string)
--- After adding text ---
New SHA-256: 5b8... (a completely different long string)
Are they the same? False

Content Fingerprinting

This technique is used to "fingerprint" text or code, often for plagiarism detection, finding similar documents, or tracking changes in source code. A simple hash of the whole text isn't useful for finding partial similarities. Instead, we use techniques like MinHashing or SimHash.

SimHash (for finding near-duplicates)

SimHash is an algorithm that produces a hash where similar inputs have similar hash values (i.e., they differ by only a few bits). This is perfect for finding near-duplicate documents.

You'll need to install the simhash library:

pip install simhash
from simhash import Simhash
# --- Example Usage ---
# Two very similar sentences
doc1 = "the cat sat on the mat"
doc2 = "the cat sat on the mat and ate the fish"
doc3 = "the dog sat on the log" # Very different
# Generate Simhash fingerprints
hash1 = Simhash(doc1)
hash2 = Simhash(doc2)
hash3 = Simhash(doc3)
print(f"Doc1 Hash: {hash1.value:64b}") # :64b formats as 64-bit binary
print(f"Doc2 Hash: {hash2.value:64b}")
print(f"Doc3 Hash: {hash3.value:64b}")
# The distance is the number of bits that are different
distance_1_2 = hash1.distance(hash2)
distance_1_3 = hash1.distance(hash3)
print(f"\nDistance between Doc1 and Doc2: {distance_1_2}") # Expect a small number
print(f"Distance between Doc1 and Doc3: {distance_1_3}") # Expect a large number
# You can set a threshold to determine if documents are "similar"
if distance_1_2 < 3:
    print("Doc1 and Doc2 are considered similar.")
if distance_1_3 > 10:
    print("Doc1 and Doc3 are considered different.")

Output:

Doc1 Hash: 0000000000000000000000000000000000000000000000000000000000010000111100110101101001001100011101111100011000100100100001001011
Doc2 Hash: 0000000000000000000000000000000000000000000000000000000000010000111100110101101001001100011101111100011000100100100001001011
Doc3 Hash: 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Distance between Doc1 and Doc2: 0
Distance between Doc1 and Doc3: 24
Doc1 and Doc2 are considered similar.
Doc1 and Doc3 are considered different.

(Note: The actual bit values may vary slightly depending on the tokenization, but the distance will be the same).


Web/Client Fingerprinting

This is a more advanced and sometimes controversial topic. Websites collect attributes from your browser and device to create a unique "fingerprint" to identify you, even if you change your IP address or use a private browser. This is done for analytics, fraud detection, and ad targeting.

Common attributes include:

  • Browser User-Agent
  • Screen Resolution
  • Available Fonts
  • Browser Plugins (like Flash, Silverlight)
  • System Language & Timezone
  • Canvas Rendering Fingerprint (subtle differences in how the browser renders graphics)

Python Implementation (using FingerprintJS)

The most robust way to do this is not with Python directly, but by using a headless browser like Selenium to run JavaScript code that does the heavy lifting. The industry standard is the FingerprintJS library.

Step 1: Install Selenium and a WebDriver

pip install selenium
# You also need a WebDriver, e.g., chromedriver. Make sure it's in your PATH.

Step 2: Python Script to run FingerprintJS This script will launch a headless Chrome browser, run the FingerprintJS script, and extract the fingerprint.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import json
def get_browser_fingerprint():
    """
    Uses Selenium and FingerprintJS to get a browser fingerprint.
    """
    # Configure Chrome for headless mode
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    # Path to the FingerprintJS CDN
    fingerprint_js_url = "https://fpc.io/v3/fp.js"
    driver = webdriver.Chrome(options=chrome_options)
    try:
        # Load the FingerprintJS script
        driver.get("data:text/html,<script src='" + fingerprint_js_url + "'></script>")
        # Execute the get() function from FingerprintJS
        # It returns a promise, so we use execute_async_script
        fingerprint = driver.execute_async_script(
            """
            // The callback function to be called when the promise resolves
            const callback = arguments[arguments.length - 1];
            // FingerprintJS.get() returns a promise
            FingerprintJS.load()
                .then(fp => fp.get())
                .then(result => {
                    // Send the visitorId back to Python
                    callback(result.visitorId);
                })
                .catch(error => {
                    console.error(error);
                    callback(null);
                });
            """
        )
        return fingerprint
    finally:
        driver.quit()
# --- Example Usage ---
if __name__ == "__main__":
    print("Generating browser fingerprint...")
    fp = get_browser_fingerprint()
    if fp:
        print(f"Your Browser Fingerprint (Visitor ID): {fp}")
    else:
        print("Could not generate fingerprint.")

Summary: Which one should you use?

Goal Method Python Tools
Check if a file is unchanged/corrupt File Hashing (Checksum) hashlib (built-in)
Find exact duplicate files File Hashing hashlib
Find similar documents/text Content Fingerprinting (e.g., SimHash) simhash library
Identify a user across the web Web/Client Fingerprinting selenium + FingerprintJS (JavaScript library)
分享:
扫描分享到社交APP
上一篇
下一篇