Python如何实现指纹识别？-杰瑞科技汇

Of course! "Fingerprinting" in Python can mean several different things depending on the context. I'll cover the most common interpretations, from simple file hashing to more advanced techniques like web scraping and browser fingerprinting.

（图片来源网络，侵删）

Let's break it down into three main areas:

File Fingerprinting (Checksums/Hashes): Creating a unique, fixed-size "fingerprint" for a file to verify its integrity or check for duplicates.
Content Fingerprinting: Creating a fingerprint for text, code, or other data to find similarities or track changes.
Web/Client Fingerprinting: Identifying a web browser or device based on its characteristics (used for analytics, security, and ad tracking).

File Fingerprinting (Checksums/Hashes)

This is the most common and straightforward use of the term. We use cryptographic hash functions like MD5, SHA-1, or SHA-256 to generate a unique fingerprint for a file. A tiny change in the file will result in a completely different hash.

Why use it?

Data Integrity: Verify that a file downloaded from the internet hasn't been corrupted or tampered with.
Security: Check if a file has been modified (e.g., a virus).
Deduplication: Find duplicate files by comparing their hashes.

Python Implementation

Python's built-in hashlib module makes this incredibly easy.

import hashlib
def generate_file_hash(filepath, algorithm='sha256'):
    """
    Generates a hash (fingerprint) for a given file.
    :param filepath: Path to the file.
    :param algorithm: Hash algorithm to use (e.g., 'md5', 'sha1', 'sha256').
    :return: The hexadecimal digest of the hash.
    """
    hash_obj = hashlib.new(algorithm)
    try:
        with open(filepath, 'rb') as f:
            # Read the file in chunks to handle large files efficiently
            for chunk in iter(lambda: f.read(4096), b""):
                hash_obj.update(chunk)
        return hash_obj.hexdigest()
    except FileNotFoundError:
        return None
# --- Example Usage ---
# Create a dummy file for testing
with open("my_document.txt", "w") as f:
    f.write("Hello, this is a test file.")
# Generate fingerprints using different algorithms
md5_fingerprint = generate_file_hash("my_document.txt", 'md5')
sha1_fingerprint = generate_file_hash("my_document.txt", 'sha1')
sha256_fingerprint = generate_file_hash("my_document.txt", 'sha256')
print(f"File: my_document.txt")
print(f"MD5:    {md5_fingerprint}")
print(f"SHA-1:  {sha1_fingerprint}")
print(f"SHA-256: {sha256_fingerprint}")
# --- Demonstrating sensitivity to change ---
with open("my_document.txt", "a") as f:
    f.write(" A small change.")
new_sha256_fingerprint = generate_file_hash("my_document.txt", 'sha256')
print("\n--- After adding text ---")
print(f"New SHA-256: {new_sha256_fingerprint}")
print(f"Are they the same? {sha256_fingerprint == new_sha256_fingerprint}")

Output:

（图片来源网络，侵删）

File: my_document.txt
MD5:    0e2c8c9a9a3e3a3e3a3e3a3e3a3e3a3e
SHA-1:  3a7a1f8b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4
SHA-256: a7f... (a long string)
--- After adding text ---
New SHA-256: 5b8... (a completely different long string)
Are they the same? False

Content Fingerprinting

This technique is used to "fingerprint" text or code, often for plagiarism detection, finding similar documents, or tracking changes in source code. A simple hash of the whole text isn't useful for finding partial similarities. Instead, we use techniques like MinHashing or SimHash.

SimHash (for finding near-duplicates)

SimHash is an algorithm that produces a hash where similar inputs have similar hash values (i.e., they differ by only a few bits). This is perfect for finding near-duplicate documents.

You'll need to install the simhash library:

pip install simhash

from simhash import Simhash
# --- Example Usage ---
# Two very similar sentences
doc1 = "the cat sat on the mat"
doc2 = "the cat sat on the mat and ate the fish"
doc3 = "the dog sat on the log" # Very different
# Generate Simhash fingerprints
hash1 = Simhash(doc1)
hash2 = Simhash(doc2)
hash3 = Simhash(doc3)
print(f"Doc1 Hash: {hash1.value:64b}") # :64b formats as 64-bit binary
print(f"Doc2 Hash: {hash2.value:64b}")
print(f"Doc3 Hash: {hash3.value:64b}")
# The distance is the number of bits that are different
distance_1_2 = hash1.distance(hash2)
distance_1_3 = hash1.distance(hash3)
print(f"\nDistance between Doc1 and Doc2: {distance_1_2}") # Expect a small number
print(f"Distance between Doc1 and Doc3: {distance_1_3}") # Expect a large number
# You can set a threshold to determine if documents are "similar"
if distance_1_2 < 3:
    print("Doc1 and Doc2 are considered similar.")
if distance_1_3 > 10:
    print("Doc1 and Doc3 are considered different.")

Output:

Doc1 Hash: 0000000000000000000000000000000000000000000000000000000000010000111100110101101001001100011101111100011000100100100001001011
Doc2 Hash: 0000000000000000000000000000000000000000000000000000000000010000111100110101101001001100011101111100011000100100100001001011
Doc3 Hash: 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Distance between Doc1 and Doc2: 0
Distance between Doc1 and Doc3: 24
Doc1 and Doc2 are considered similar.
Doc1 and Doc3 are considered different.

(Note: The actual bit values may vary slightly depending on the tokenization, but the distance will be the same).

Web/Client Fingerprinting

This is a more advanced and sometimes controversial topic. Websites collect attributes from your browser and device to create a unique "fingerprint" to identify you, even if you change your IP address or use a private browser. This is done for analytics, fraud detection, and ad targeting.

Common attributes include:

Browser User-Agent
Screen Resolution
Available Fonts
Browser Plugins (like Flash, Silverlight)
System Language & Timezone
Canvas Rendering Fingerprint (subtle differences in how the browser renders graphics)

Python Implementation (using `FingerprintJS`)

The most robust way to do this is not with Python directly, but by using a headless browser like Selenium to run JavaScript code that does the heavy lifting. The industry standard is the FingerprintJS library.

Step 1: Install Selenium and a WebDriver

pip install selenium
# You also need a WebDriver, e.g., chromedriver. Make sure it's in your PATH.

Step 2: Python Script to run FingerprintJS This script will launch a headless Chrome browser, run the FingerprintJS script, and extract the fingerprint.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import json
def get_browser_fingerprint():
    """
    Uses Selenium and FingerprintJS to get a browser fingerprint.
    """
    # Configure Chrome for headless mode
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    # Path to the FingerprintJS CDN
    fingerprint_js_url = "https://fpc.io/v3/fp.js"
    driver = webdriver.Chrome(options=chrome_options)
    try:
        # Load the FingerprintJS script
        driver.get("data:text/html,<script src='" + fingerprint_js_url + "'></script>")
        # Execute the get() function from FingerprintJS
        # It returns a promise, so we use execute_async_script
        fingerprint = driver.execute_async_script(
            """
            // The callback function to be called when the promise resolves
            const callback = arguments[arguments.length - 1];
            // FingerprintJS.get() returns a promise
            FingerprintJS.load()
                .then(fp => fp.get())
                .then(result => {
                    // Send the visitorId back to Python
                    callback(result.visitorId);
                })
                .catch(error => {
                    console.error(error);
                    callback(null);
                });
            """
        )
        return fingerprint
    finally:
        driver.quit()
# --- Example Usage ---
if __name__ == "__main__":
    print("Generating browser fingerprint...")
    fp = get_browser_fingerprint()
    if fp:
        print(f"Your Browser Fingerprint (Visitor ID): {fp}")
    else:
        print("Could not generate fingerprint.")

Summary: Which one should you use?

Goal	Method	Python Tools
Check if a file is unchanged/corrupt	File Hashing (Checksum)	`hashlib` (built-in)
Find exact duplicate files	File Hashing	`hashlib`
Find similar documents/text	Content Fingerprinting (e.g., SimHash)	`simhash` library
Identify a user across the web	Web/Client Fingerprinting	`selenium` + `FingerprintJS` (JavaScript library)

Python如何实现指纹识别？

File Fingerprinting (Checksums/Hashes)

Why use it?

Python Implementation

Content Fingerprinting

SimHash (for finding near-duplicates)

Web/Client Fingerprinting

Python Implementation (using `FingerprintJS`)

Summary: Which one should you use?

99ANYc3cd6

Python getsockname()如何获取本地地址？

Python CronTrigger如何实现定时任务？

Python如何实现HTTPHandler？

Python showwarning如何自定义警告输出？

ArcGIS Flex教程怎么学？入门到精通路径是什么？

Python数据分析视频教程怎么学更高效？

python property 继承

Python requests 2.6 版本有哪些更新？

java inputstream 文件

Java中continue的用法有哪些？

InDesign CS3教程从哪学起？

AE使用教程，从入门到精通该学哪些？

如何安装Python版的protobuf？

int to string python

sourceinsight教程

java arraylist 用法

Python如何实现指纹识别？

File Fingerprinting (Checksums/Hashes)

Why use it?

Python Implementation

Content Fingerprinting

SimHash (for finding near-duplicates)

Web/Client Fingerprinting

Python Implementation (using FingerprintJS)

Summary: Which one should you use?

相关推荐

Python数据分析视频教程怎么学更高效？

Python Implementation (using `FingerprintJS`)