サクサク読めて、アプリ限定の機能も多数!
トップへ戻る
アメリカ大統領選
research.neustar.biz
We are always worried about the size of our sketches. At AK we like to count stuff, and if you can count stuff with smaller sketches then you can count more stuff! We constantly have conversations about how we might be able to make our sketches, and subsequently our datastores, smaller. During our science summit, Muthu pointed us at some of the new work in Frugal Streaming. The concept of Frugal S
Standard data warehouse table setup. About a dozen numeric columns per table. The largest table (impressions) is 100x larger than the next biggest table (clicks), and 5000x larger than the smallest table (conversions). All tables sorted by date and by user id (SORTKEY), distributed by user id (DISTKEY). Two types of queries. (pardon the pseudo-SQL, see Appendix for full queries) SELECT keys, aggre
Introduction Here at AK, we’re in the business of storing huge amounts of information in the form of 64 bit keys. As shown in other blog posts and in the HLL post by Matt, one efficient way of getting an estimate of the size of the set of these keys is by using the HyperLogLog (HLL) algorithm. There are two important decisions one has to make when implementing this algorithm. The first is how ma
Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure Intro In the Zipfian world of AK, the HyperLogLog distinct value (DV) sketch reigns supreme. This DV sketch is the workhorse behind the majority of our DV counters (and we’re not alone) and enables us to have a real time, in memory data store with incredibly high throughput. HLL was conceived of by Flajolet et. al. in the ph
We’re happy to announce the first open-source release of AK’s PostgreSQL extension for building and manipulating HyperLogLog data structures in SQL, postgresql-hll. We are releasing this code under the Apache License, Version 2.0 which we feel is an excellent balance between permissive usage and liability limitation. What is it and what can I do with it? The extension introduces a new data type, h
Matt Abrams recently pointed me to Google’s excellent paper “HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm” [UPDATE: changed the link to the paper version without typos] and I thought I’d share my take on it and explain a few points that I had trouble getting through the first time. The paper offers a few interesting improvements that are w
このページを最初にブックマークしてみませんか?
『http://blog.aggregateknowledge.com/』の新着エントリーを見る
j次のブックマーク
k前のブックマーク
lあとで読む
eコメント一覧を開く
oページを開く