[B! hdfs] s-woolのブックマーク

s-wool id:s-wool

hdfsに関するs-woolのブックマーク (13)

Presto at Twitter
Presto is an open source distributed SQL query engine for running queries against large datasets stored in Hadoop/HDFS clusters. It uses in-memory parallel processing, pipelining, data locality, caching, and dynamic compilation to byte code for low query latency. Key techniques include caching frequently used metadata and compiled plans, processing data locally on nodes where it resides, and contr
s-wool 2016/03/24
presto

twitter

HDFS
リンク
Hadoop filesystem at Twitter
Twitter runs multiple large Hadoop clusters that are among the biggest in the world. Hadoop is at the core of our data platform and provides vast storage for analytics of user actions on Twitter. In this post, we will highlight our contributions to ViewFs, the client-side Hadoop filesystem view, and its versatile usage here. ViewFs makes the interaction with our HDFS infrastructure as simple as a
s-wool 2015/10/06
あとで読む

hadoop

twitter

hdfs
リンク
Apache Hadoop YARN: Avoiding 6 Time-Consuming "Gotchas" | Cloudera Developer Blog
Having a good grasp of HDFS recovery processes is important when running or moving toward production-ready Apache Hadoop. An important design requirement of HDFS is to ensure continuous and correct operations to support production deployments. One particularly complex area is ensuring correctness of writes to HDFS in the presence of network and node failures, where the lease recovery, block recove
s-wool 2015/02/16
あとで読む

あとで

hdfs

hadoop
リンク
Compression Options in Hadoop - A Tale of Tradeoffs
Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, the
s-wool 2014/08/13
hadoop

Yahoo

hdfs

compression
リンク
Saving capacity with HDFS RAID | Engineering Blog | Facebook Code | Facebook
As we continue to evolve our data infrastructure, we’re constantly looking for ways to maximize the utility and efficiency of our systems. One techno logy we’ve deployed is HDFS RAID, an implementation of Erasure Codes in HDFS to reduce the replication factor of data in HDFS. We finished putting this into production last year and wanted to share the lessons we learned along the way and how we incre
s-wool 2014/06/17
hdfs

hadoop

Facebook
リンク
dfs.datanode.failed.volumes.toleratedとdatanodeのdecommission - wyukawa's diary
HDFSにはdfs.datanode.failed.volumes.toleratedという設定項目があります。defaultは0。 <property> <name>dfs.datanode.failed.volumes.tolerated</name> <value>0</value> <description>The number of volumes that are allowed to fail before a datanode stops offering service. By default any volume failure will cause a datanode to shutdown. </description> </property>内容は下記に詳しいです。 By default, the failure of a single dfs.data.dir
s-wool 2014/06/05
hadoop

hdfs
リンク
Hadoop SlaveサーバとJBODとRAID - カイワレの大冒険 Second
夏はビールがことごとくうまいなと感じる@masudaKです。少し前にHadoop専用サーバの環境構築をする機会に恵まれたのですが、ディスク構成をどうするかでわりと目新しいことばかりだったので、備忘録も兼ねて文字に起こしておきます。前提として、HadoopのMasterではJobTrackerとNameNodeが動いて、SlaveではTaskTrackerとDataNodeが動いてるとします。DataNodeが動くSlaveはクラスタ構成になっていて、HDFSによってデータが分散・冗長化されているとします。今回の記事ではジョブのデータを読み書きし、IOに対するケアが必要なSlaveのディスク構成を対象とします。Masterはメタデータなど大事なデータは保存してますが、読み・書きの量は少ないのと、単にRAID1で組んでおけば、ディスクに対するケアはそこまで必要ないので、この記事では取り上
s-wool 2014/01/28
わかったようなまだわからないような

hadoop

hdfs
リンク
Fluentd＋WebHDFS＆DataNode半死で起きた問題 | 外道父の匠
Fluentd CollectorからHDFSに書き込むのに fluent-plugin-webhdfs を利用していますが、 DataNodeが１台変死した際に色々おかしくなったので書き留めておきます。原因特定と解決方法の確立はできていません！あしからず。直接の原因はSLAVEサーバ（DataNode）が中途半端に落ちたこと１台のSLAVEサーバに異常が発生したことが直接の原因であり、状態としては SLAVEサーバがKernel Panic!! ホストへのPingは通る各種デーモンへのTCP接続は確立できる各種デーモンは一切お返事をしてくれない試したのがDataNodeでないのが心苦しいですが、復旧前に確認できたのはSSH接続で、 ssh -p22 host は無応答で、telnet host 22 はリクエスト待ち状態になる半死状態でした。この状態が、Fluentdまたは
s-wool 2013/10/01
似たような現象にあってtd-agent再起動せざるを得なかった。

fluentd

webhdfs

hadoop

hdfs
リンク
http://www.makeitsmartjp.com/2013/01/hdfs-s3-copy.html
s-wool 2013/09/11
aws

hdfs
リンク
[Hadoop]複数ディスクを使って効率の良い処理
Hadoop では一つのノードあたり複数ディスクを使うことができますが，ディスクを増やすことによってどれくらい性能が向上するか調べました． HDFSで使用するディスクをdfs.data.dirにコンマ区切りで記入することで複数使えます． <property> <name>dfs.data.dir</name> <value>/data/local/${user.name}/hadoop/dfs/data, /data/local2/${user.name}/hadoop/dfs/data</value> </property> しかし，これだけではまだダメで，mapタスク，reduceタスクが中間データを書き込むディスクも複数指定しなしとHadoopのジョブで複数ディスクを効率良く使えません．mapred.local.dir で設定可能です． <property> <name>mapre
s-wool 2013/08/30
hadoop

hdfs

tuning
リンク
HDFSのappend機能を使った場合の編集ログ
編集ログとHDFSの追記以前、HDFSのfsimageとeditsの変更 | Tech Blogに書いたように、HDFSのeditsにはトランザクションが記録されます。 HDFSのappend(追記)機能を使って書き込んだ場合、editsの内容はどのように見えるのでしょうか？のコードを利用させていただき検証しました。サンプルコードの準備上記のコードをコピーしてJavaのファイルを作成します。（ただ、パッケージ行のみコメントアウトしました） Avroの準備コードで使用しているFsInputはCDH4には含まれていないようなので、http://avro.apache.org/からソース一式をダウンロードします。 $ wget http://ftp.tsukuba.wide.ad.jp/software/apache/avro/stable/avro-src-1.7.4.tar.gz
s-wool 2013/08/27
hdfs

hadoop

webhdfs
リンク
Fluentd＋WebHDFSの書き込み問題 | 外道父の匠
以前に晒したFluentdからWebHDFSに対してログを流し込むフローの部分を、少しキツ目の環境にブっこんで運用したら色々問題点がでてきたので記しておきます。どちらかというとFluentdというよりはHDFSよりの話になります。 HDFSファイルのCREATEエラー複数のFluentd CollectorからHDFSの１ファイルへ共通して書き込むというスタイルをとってみました。こんな感じに１分毎のHDFSファイルとして。 Collector-01 ─┐ Collector-02 ─┤ Collector-03 ─┼─> HDFS %Y%m%d-%H%M.log Collector-04 ─┤ ...
s-wool 2013/08/22
hadoop

hdfs

fluentd

td-agent
リンク
1