海量数据操作之利刃——分治、哈希、位图、布隆过滤器、堆解析

2022-12-31 15:47:15

掌控海量数据：分治算法、哈希、位图、布隆过滤器和堆的利器

前言

在这个数字化的时代，海量数据无处不在，给 IT 从业者带来了不小的挑战。如何处理、存储、分析这些庞然大物？别担心！我们有强大的算法，助你轻松应对海量数据难题。

分治算法：庖丁解牛，分而治之

分治算法是解决海量数据问题的首选武器。它将一个复杂的问题分解成多个更小、更易处理的子问题，再逐个解决，最终合并子问题的解得到原问题的解。分治算法的典型代表莫过于"快速排序"和"归并排序"，凭借其卓越的性能，它们在海量数据排序领域独领风骚。

代码示例：快速排序

public static void quickSort(int[] arr, int low, int high) {
    if (low < high) {
        int partitionIndex = partition(arr, low, high);

        quickSort(arr, low, partitionIndex - 1);
        quickSort(arr, partitionIndex + 1, high);
    }
}

private static int partition(int[] arr, int low, int high) {
    int pivot = arr[high];
    int i = (low - 1);

    for (int j = low; j <= high - 1; j++) {
        if (arr[j] < pivot) {
            i++;

            int temp = arr[i];
            arr[i] = arr[j];
            arr[j] = temp;
        }
    }

    int temp = arr[i + 1];
    arr[i + 1] = arr[high];
    arr[high] = temp;

    return (i + 1);
}

哈希：化繁为简，妙笔生花

哈希算法堪称数据存储领域的魔法师。它利用哈希函数将数据映射到一个固定长度的输出中，称为哈希值。哈希函数的设计巧妙之处在于，相似的数据往往会产生相同的哈希值，从而让数据查找变得异常高效。哈希算法在海量数据存储中扮演着不可或缺的角色，如数据库索引、缓存系统等。

代码示例：哈希表

public class MyHashMap<K, V> {
    private Entry<K, V>[] table;
    private int size;

    public MyHashMap() {
        this(16);
    }

    public MyHashMap(int capacity) {
        table = new Entry[capacity];
        size = 0;
    }

    public void put(K key, V value) {
        int index = hash(key);
        Entry<K, V> newEntry = new Entry<>(key, value);

        if (table[index] == null) {
            table[index] = newEntry;
        } else {
            Entry<K, V> current = table[index];
            while (current.next != null) {
                current = current.next;
            }

            current.next = newEntry;
        }

        size++;
    }

    public V get(K key) {
        int index = hash(key);

        if (table[index] == null) {
            return null;
        } else {
            Entry<K, V> current = table[index];

            while (current != null) {
                if (current.key.equals(key)) {
                    return current.value;
                }

                current = current.next;
            }
        }

        return null;
    }

    private int hash(K key) {
        return key.hashCode() % table.length;
    }

    private class Entry<K, V> {
        K key;
        V value;
        Entry<K, V> next;

        public Entry(K key, V value) {
            this.key = key;
            this.value = value;
            this.next = null;
        }
    }
}

位图：黑白分明，一目了然

位图，又名比特图，本质上是一个由位（bit）组成的数组。每个位代表一个元素，0表示不存在，1表示存在。位图的精妙之处在于，它能以极低的存储空间存储海量数据，同时提供超快的查询速度。位图在海量数据统计、日志分析、基因组分析等领域大放异彩。

代码示例：位图

public class MyBitSet {
    private long[] bits;
    private int size;

    public MyBitSet(int size) {
        this.size = size;
        bits = new long[(size + 63) / 64];
    }

    public void set(int index) {
        int wordIndex = index / 64;
        int bitIndex = index % 64;

        bits[wordIndex] |= (1L << bitIndex);
    }

    public boolean get(int index) {
        int wordIndex = index / 64;
        int bitIndex = index % 64;

        return (bits[wordIndex] & (1L << bitIndex)) != 0;
    }
}

布隆过滤器：海纳百川，有容乃大

布隆过滤器是一种概率型数据结构，旨在快速判断一个元素是否在一个集合中。它使用一系列哈希函数将元素映射到一系列位，如果所有哈希函数对应的位都为1，则该元素一定在集合中；如果有一个哈希函数对应的位为0，则该元素一定不在集合中。布隆过滤器在海量数据过滤、恶意软件检测、网页去重等领域备受青睐。

代码示例：布隆过滤器

public class MyBloomFilter {
    private int[] bits;
    private int size;
    private int numHashFunctions;

    public MyBloomFilter(int size, int numHashFunctions) {
        this.size = size;
        this.numHashFunctions = numHashFunctions;
        bits = new int[(size + 31) / 32];
    }

    public void add(String element) {
        for (int i = 0; i < numHashFunctions; i++) {
            int hashValue = hash(element, i);
            int wordIndex = hashValue / 32;
            int bitIndex = hashValue % 32;

            bits[wordIndex] |= (1 << bitIndex);
        }
    }

    public boolean contains(String element) {
        for (int i = 0; i < numHashFunctions; i++) {
            int hashValue = hash(element, i);
            int wordIndex = hashValue / 32;
            int bitIndex = hashValue % 32;

            if ((bits[wordIndex] & (1 << bitIndex)) == 0) {
                return false;
            }
        }

        return true;
    }

    private int hash(String element, int index) {
        return Math.abs(element.hashCode() + index) % size;
    }
}

堆：层层递进，井然有序

堆是一种特殊的树形数据结构，它满足堆性质：每个节点的值都大于或等于其子节点的值。堆的优势在于，它可以快速找到最大或最小的元素，时间复杂度为O(logn)。堆在海量数据排序、优先级队列、最短路径算法等领域发挥着举足轻重的作用。

代码示例：二叉堆

public class MyBinaryHeap {
    private int[] heap;
    private int size;

    public MyBinaryHeap(int capacity) {
        heap = new int[capacity + 1];
        size = 0;
    }

    public void insert(int value) {
        if (size == heap.length - 1) {
            expandHeap();
        }

        heap[++size] = value;
        heapifyUp(size);
    }

    public int extractMin() {
        if (size == 0) {
            throw new RuntimeException("Heap is empty!");
        }

        int min = heap[1];
        heap[1] = heap[size];
        size--;
        heapifyDown(1);

        return min;
    }

    private void heapifyUp(int index) {
        int parentIndex = index / 2;

        while (index > 1 && heap[index] < heap[parentIndex]) {
            swap(index, parentIndex);
            index = parentIndex;
            parentIndex = index / 2;
        }
    }

    private void heapifyDown(int index) {
        int leftIndex = 2 * index;
        int rightIndex = 2 * index + 1;
        int

Kyle

探索Web开发资源和人工智能教程的代码社区

联系我

扫码关注微信公众号

海量数据操作之利刃——分治、哈希、位图、布隆过滤器、堆解析

Kyle

两个类相等判断：即使没有相等方法

Modal 重复打开难题解决方案：技术指南

MongoDB 聚合入门：使用 PHP 提取数据见解

如何拦截 Apache HttpClient5 请求正文：全面攻略

Python 装饰器：提升代码潜能的秘钥