JavaScript library for handling dictzip compressed files effectively, i.e. it does not uncompress and load into memory the whole data blob, but instead provides an interface for (asynchronous or synchronous) random access to the compressed data. Hence it can handle really huge amounts of data which may occur e.g. when working with local files accessed through the W3C's File API. This implementatio
DICTZIP(1) DICTZIP(1) NAME dictzip, dictunzip - compress (or expand) files, allowing random access SYNOPSIS dictzip [options] name dictunzip [options] name DESCRIPTION dictzip compresses files using the gzip(1) algorithm (LZ77) in a manner which is completely compatible with the gzip file format. An extension to the gzip file format (Extra Field, described in 2.3.1.1 of RFC 1952) allows extra data
Optimize the encoding and transfer size of text-based assets Stay organized with collections Save and categorize content based on your preferences. Next to eliminating unnecessary resource downloads, the best thing you can do to improve page load speed is to minimize the overall download size by optimizing and compressing the remaining resources. Data compression 101 Once you've set up your websit
id: 495 所有者: msakamoto-sf 作成日: 2009-11-22 17:11:47 カテゴリ: Linux UNIX Windows [ Prev ] [ Next ] [ 技術 ] お仕事絡みで、ZIPファイルの歴史が気になったので調べてみた。 前々から何となく「gzipとzlibとzipってどう違うんだろう」とは思っていたのだけれど、WindowsでLhacaやLhaplusなどのアーカイブソフト、あるいはXP以降ならOSの機能としてデフォルトでzip圧縮できるし、Linux/UNIXでも2-3回コマンドラインオプションを試行錯誤してmanページ見ればtar.gz作ったり逆にWindows上で圧縮したzipを適当に解凍できるので「ま、いっか。」で済ませてた。 でもせっかくなので、技術的な詳細には突っ込まないが、ざっくりとした歴史や流れをWikipediaを中心に追って
圧縮レベル2と3では、bzip2よりずっと短い所要時間で高い圧縮率が得られています。興味深いのはレベル4で、所要時間が大きく増えたのに圧縮率が下がっています。xzはレベル4からLZ法の一致文字列を探すアルゴリズムが変わるので、これが裏目に出ているようです。 bzip2より2割以上高い圧縮率が得られるレベル7以上では、所要時間は5倍以上になります。ログファイルの圧縮方式が混ざるのは何かと面倒なので、5倍の所要時間でこの程度の圧縮率の差ではxzに変更する気にはなれないです。 圧縮率はそうでもないですが、xzの伸張速度の速さはとても魅力的です。デフォルトの圧縮率のファイルを伸張するのに、bzip2が1分22秒かかるのに対してxzは25秒しか掛かりません。ログを集計するときに伸張速度が3倍近く速いのはとても有利です。 もし圧縮方法を決め直せるならxzにするかもしれません。適宜レベルを調節してbzi
After a very fast evaluation, LZ4 has been recently integrated into the Apache project Hadoop - MapReduce. This is an important news, since, in my humble opinion, Hadoop is among the most advanced and ambitious projects to date (an opinion which is shared by some). It also serves as an excellent illustration of LZ4 usage, as an in-memory compression algorithm for big server applications. But firs
Ville Tuulos Principal Engineer @ AdRoll ville.tuulos@adroll.com We faced the key technical challenge of modern Business Intelligence: How to query tens of billions of events interactively? Our solution, DeliRoll, is implemented in Python. Everyone knows that Python is SLOW. You can't handle big data with low latency in Python! Small Benchmark Data: 1.5 billion rows, 400 columns - 660GB. Smaller e
For many companies, understanding what is going on in your business involves lots of data. But, how do you query 10s of billions of data points? How can a company begin to make sense of so much information? Ville Tuulos, Principle Engineer at AdRoll, a company producing tons of big data, demonstrates how AdRoll uses Python to squeeze every bit of performance out of a single high-end server. They m
ORCFile in HDP 2: Better Compression, Better Performance The upcoming Hive 0.12 is set to bring some great new advancements in the storage layer in the forms of higher compression and better query performance. Higher Compression ORCFile was introduced in Hive 0.11 and offered excellent compression, delivered through a number of techniques including run-length encoding, dictionary encoding for stri
In byte dictionary encoding, a separate dictionary of unique values is created for each block of column values on disk. (An Amazon Redshift disk block occupies 1 MB.) The dictionary contains up to 256 one-byte values that are stored as indexes to the original data values. If more than 256 values are stored in a single block, the extra values are written into the block in raw, uncompressed form. Th
Text255 and text32k encodings are useful for compressing VARCHAR columns in which the same words recur often. A separate dictionary of unique words is created for each block of column values on disk. (An Amazon Redshift disk block occupies 1 MB.) The dictionary contains the first 245 unique words in the column. Those words are replaced on disk by a one-byte index value representing one of the 245
Mostly encodings are useful when the data type for a column is larger than most of the stored values require. By specifying a mostly encoding for this type of column, you can compress the majority of the values in the column to a smaller standard storage size. The remaining values that cannot be compressed are stored in their raw form. For example, you can compress a 16-bit column, such as an INT2
Introduction Apache HBase is the Hadoop open-source, distributed, versioned storage manager well suited for random, realtime read/write access. Wait wait? random, realtime read/write access? How is that possible? Is not Hadoop just a sequential read/write, batch processing system? Yes, we’re talking about the same thing, and in the next few paragraphs, I’m going to explain to you how HBase achiev
Delta encodings are very useful for date time columns. Delta encoding compresses data by recording the difference between values that follow each other in the column. This difference is recorded in a separate dictionary for each block of column values on disk. (An Amazon Redshift disk block occupies 1 MB.) For example, suppose that the column contains 10 integers in sequence from 1 to 10. The firs
隠れたデータベースの遅延原因を特定し、そのレスポンスの改善手法紹介 @ dbtech showcase Tokyo 2019
リリース、障害情報などのサービスのお知らせ
最新の人気エントリーの配信
処理を実行中です
j次のブックマーク
k前のブックマーク
lあとで読む
eコメント一覧を開く
oページを開く