[B! spark] mogwaingのブックマーク

mogwaing id:mogwaing

sparkに関するmogwaingのブックマーク (23)

第20回　Sparkの設計と実装［1］～登場の背景とデータ処理の特徴 | gihyo.jp
はじめに今回から2回に渡って、並列データ処理系のひとつであるSparkについて解説します。まずはじめに、Sparkの開発が始められた経緯を紹介し、次にSparkの特徴を説明します。 Sparkが登場した背景 Sparkは、Hadoop MapReduceと同様に、複数の計算機を用いてデータ処理を行う並列データ処理系です。2009年に、カリフォルニア大学バークレー校のAMPLabにて、Matei Zaharia氏を中心として開発が始まりました。Sparkの開発が始まった当時、世の中にはすでにHadoopが存在しており、高い耐障害性を有しかつスケーラブルな並列データ処理を、コモディティな計算機を用いて行うことは一般的になりつつありました。しかし、Hadoop MapReduceは必ずしも個々の計算機のメモリを効率的に活用する設計ではありませんでした。 Hadoop MapReduceは、ジョ
mogwaing 2016/05/11
spark

hadoop

parallel processing

distributed system
リンク
How to recommend top 10 products in Spark ALS for all the users?
How can we get top 10 recommended products in PySpark. I understand there are methods like recommendProducts to recommend products for a single user and predictAll to predict rating for the {user,it em} pair. But is there a efficient way i can output the top 10 it ems for each user for all the users?
mogwaing 2016/03/21
spark

mllib
リンク
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
mogwaing 2016/03/21
spark

mllib
リンク
PySpark: Top N Records In Each Group
mogwaing 2016/03/16
spark

tips
リンク
【機械学習】Spark MLlibをPythonで動かしてレコメンデーションしてみる - Qiita
Sparkシリーズ第２弾です。今度はMLlibを使って協調フィルタリングを用いたレコメンデーションの実装を行います。第一弾【機械学習】iPython NotebookでSparkを起動させてMLlibを試す http://qiita.com/kenmatsu4/it ems/00ad151e857d546a97c3 環境 OS: Mac OSX Yosem ite 10.10.3 Spark: spark-1.5.0-bin-hadoop2.6 Python: 2.7.10 |Anaconda 2.2.0 (x86_64)| (default, May 28 2015, 17:04:42) 本稿では上記の環境で行ったものを記載していますので、他の環境では設定が異なる場合もあるかと思いますのでご注意ください。また、基本的にiPython NotebookでのSparkの実行を想定しています。
mogwaing 2016/02/26
spark

mllib

machine learning

collaborative filtering
リンク
Search | Packt Subscription
Search over 7,500 Programming & Development eBooks and videos to advance your IT skills, including Web Development, Application Development and Networking
mogwaing 2016/02/25
spark

mllib

collaborative filtering
リンク
https://docs.prediction.io/templates/recommendation/training-with-implicit-preference/
mogwaing 2016/02/19
[[collaborative filtering]

spark

mllib
リンク
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
This document discusses how the Spark Cassandra Connector optimizes for data locality when performing analytics on Cassandra data using Spark. It does this by using the partition keys and token ranges to create Spark partitions that correspond to the data distribution across the Cassandra nodes, allowing work to be done locally to each data node without moving data across the network. This improve
mogwaing 2015/09/25
cassandra

spark
リンク
第8回　データ処理における並列アルゴリズム［3］ | gihyo.jp
はじめに前回は、結合処理の並列化における基本戦略について説明し、ソートマージ結合における具体的な並列アルゴリズムを説明しました。今回は、ImpalaやPrestoに加えて、Apache SparkやHadoop MapReduceのMap Joinにおいても用いられているハッシュ結合における具体的な並列アルゴリズムを説明します。ハッシュ結合における並列アルゴリズムハッシュ結合は、2つのデータにおいて同一の属性値をもつレコードを見つける方法として、レコードのハッシュ値を用いるものです[1]⁠。すなわち、当該方法においては、一方のデータのすべてのレコードの結合キーに対してハッシュ関数を用いてハッシュ値を計算し、当該ハッシュ値からなるハッシュ表を事前に構築しておき、他方のデータのレコードの結合キーに対して同一のハッシュ関数から得られたハッシュ値を用いてハッシュ表を参照することにより、同一の
mogwaing 2015/08/05
hadoop

parallel db

parallel processing

distributed system

spark

impala

presto
リンク
第5回　データ処理の並列化 | gihyo.jp
はじめに前回は、データ処理の方法を整理し、また、宣言型言語をインターフェースとして用いる並列データベースなどのデータ処理系を詳細に見ていく準備として、当該データ処理系における実行プランの作成の流れをかんたんに説明しました。今回は、当該データ処理系において、どのように実行プランを並列化するかについて、その概要を説明します。データ処理における並列性について並列データベースをはじめとするデータ処理系は、SQL文などの問い合わせ（クエリ）の内容に応じてデータ処理を行うものであり、問い合わせの観点においては、当該処理系において用いられる並列性（Parallelism）は、次の2つに分類することができます。問い合わせ間の並列性（Inter-Query Parallelism）問い合わせ内の並列性（Intra-Query Parallelism）問い合わせ間の並列性は、複数の異なる問い合わせ
mogwaing 2015/05/27
hadoop

impala

presto

spark

parallel processing

parallel db

distributed system
リンク
http://www.cise.ufl.edu/class/cis6930fa11lad/cis6930fa11_Spark.pdf
mogwaing 2014/11/25
spark

parallel processing
リンク
Spark Summit Videos Now Online
mogwaing 2014/07/19
spark

hadoop

presentation
リンク
Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014 The document discusses the results of a study on the impact of climate change on coffee production. Researchers found that suitable land for coffee production could decline by up to 50% by 2050 due to rising temperatures and changing rain patterns associated with climate change. Arabica coffee was found to be most at risk, as its growing regions
mogwaing 2014/07/07
spark

hadoop

pig

scalding

twitter
リンク
Classifiying documents using Naive Bayes on Apache Spark / MLlib
In recent years, Apache Spark has gained in popularity as a faster alternative to Hadoop and it reached a major milestone last month by releasing the production ready version 1.0.0. It claims to be up to a 100 times faster by leveraging the distributed memory of the cluster and by not being tied to the multi stage execution of Map/Reduce. Like Hadoop, it offers a similar ecosystem with a database
mogwaing 2014/06/16
spark

hadoop

mllib

naive bayes
リンク
What's new in Apache Mahout
mogwaing 2014/05/28
mahout

hadoop

spark

analytics
リンク
Spark SQL: Manipulating Structured Data Using Apache Spark
Unified governance for all data, analytics and AI assets
mogwaing 2014/03/27
hadoop

spark

shark

parallel processing

parallel db
リンク
SparkR by amplab-extras
R frontend for Spark Project maintained by amplab-extras Hosted on GitHub Pages — Theme by mattgraham R on Spark SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster. NOTE: As of April 2015, SparkR has been officially merged into Apache Spa
mogwaing 2014/02/21
spark

parallel processing

hadoop

R
リンク
https://amplab.cs.berkeley.edu/2014/01/26/large-scale-data-analysis-made-easier-with-sparkr/
mogwaing 2014/02/03
parallel processing

database

spark

R
リンク
Big Data Benchmark
Click Here for the previous version of the benchmark Introduction Several analytic frameworks have been announced in the last year. Among them are inexpensive data-warehousing solutions based on traditional Massively Parallel Processor (MPP) architectures (Redshift), systems which impose MPP-like execution engines on top of Hadoop (Impala, HAWQ), and systems which optimize MapReduce to improve per
mogwaing 2013/06/28
database

benchmark

parallel processing

impala

shark

spark

redshift

hawq
リンク
Real-time Processing (Spark, Puma, HOP)
Spark Streaming Spark Streaming is an interesting extension to Spark that adds support for continuous stream processing to Spark. Spark Streaming is in active development at UC Berkeley's amplab alongside the rest of the Spark project. The group recently gave a presentation at AmpCamp 2012 and the video gives a pretty good overview. If you'd like to follow along with the video with your own copy o
mogwaing 2013/06/28
parallel processing

database

spark

stream processing
リンク
1 2 次のページ

お知らせ

もっと読む

公式Twitter

@HatenaBookmark
リリース、障害情報などのサービスのお知らせ
@hatebu
最新の人気エントリーの配信

キーボードショートカット一覧

j次のブックマーク

k前のブックマーク

lあとで読む

eコメント一覧を開く

oページを開く

設定を変更しましたx