[B! spark] uokadaのブックマーク

uokada id:uokada

sparkに関するuokadaのブックマーク (23)

Yahoo!ショッピング：データ基盤における次世代クエリエンジン（Spark/Trino）移行の取り組みについて
はじめに本ブログシリーズでは、Yahoo!ショッピングのデータ分析基盤を最適化するために取り組んだ大規模プロジェクト――Apache HiveからTrinoとApache Sparkへの移行――につい...
uokada 2025/05/17
trino

spark

yahoo
リンク
Roblox Research
uokada 2024/07/16
spark
リンク
Easy Guide to Create a Write Data Source in Apache Spark 3
uokada 2024/03/21
spark

apache spark
リンク
Easy Guide to Create a Custom Read Data Source in Apache Spark 3
uokada 2024/03/21
spark

apache spark
リンク
Best practices for performance tuning AWS Glue for Apache Spark jobs -
Best practices for performance tuning AWS Glue for Apache Spark jobs Roman Myers, Takashi Onikura, and Noritaka Sekiyama, Amazon Web Services (AWS) December 2023 (document history) AWS Glue provides different options for tuning performance. This guide defines key topics for tuning AWS Glue for Apache Spark. It then provides a baseline strategy for you to follow when tuning these AWS Glue for Apach
uokada 2024/01/16
aws

spark

performance
リンク
How to set timezone to UTC in Apache Spark?
uokada 2023/12/18
spark
リンク
Upgrading Data Warehouse Infrastructure at Airbnb
uokada 2022/10/25
あとで読む

data

spark

framework

infrastructure
リンク
Run startup commands in spark-shell
uokada 2022/04/27
“:load /Users/steve/.scalarc”

spark
リンク
Deequ で大規模なデータ品質をテスト | Amazon Web Services
Amazon Web Services ブログ Deequ で大規模なデータ品質をテスト一般的に、コード用のユニットテストを書くと思いますが、お使いのデータもテストしているのでしょうか? 不正確または不正なデータは、本番システムに大きな影響を与える可能性があります。データ品質問題の例は次のとおりです。値がない場合は、本番システムで null 以外の値を必要とするエラー (NullPointerException) が発生する可能性があります。データ分布が変化すると、機械学習モデルで予期しない出力につながることがあります。データの集計を誤ると、ビジネスでの判断を下す際に誤った意思決定につながる可能性があります。このブログ記事では、Amazon で開発し、使用されているオープンソースツールである Deequ を紹介したいと思います。Deequ では、データセットのデータ品質メトリクス
uokada 2021/10/29
amazon

spark

dataQuality
リンク
Apache Spark 3.1 Release: Spark on Kubernetes is now Generally Available
Apache Spark 3.1 Release: Spark on Kubernetes is now Generally Available With the Apache Spark 3.1 release in March 2021, the Spark on Kubernetes project is now officially declared as production-ready and Generally Available. This is the achievement of 3 years of booming community contribution and adoption of the project – since initial support for Spark-on-Kubernetes was added in Spark 2.3 (Febru
uokada 2021/05/12
spark

k8s

kubernetes
リンク
GitHub - awslabs/deequ: Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert
uokada 2021/03/05
GitHub

spark
リンク
Migrating Apache Spark workloads from AWS EMR to Kubernetes
IntroductionESG research found that 43% of respondents considering cloud as their primary deployment for Apache Spark. And it makes a lot of sense because the cloud provides scalability, reliability, availability, and massive economies of scale. Another strong selling point of cloud deployment is a low barrier of entry in the form of managed services. Each one of the ‘Big Three’ cloud providers co
uokada 2020/10/15
EKS

aws

spark

k8s
リンク
GitHub - mrpowers-io/spark-daria: Essential Spark extensions and helper methods ✨😲
uokada 2020/05/04
spark
リンク
Pyspark — data manipulation and pipeline
uokada 2020/04/29
python

spark
リンク
Big Data: Google Replaces YARN with Kubernetes to Schedule Apache Spark
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
uokada 2020/02/01
kubernetes

yarn

apache

spark
リンク
AWS GlueでApache Sparkジョブをスケーリングし、データをパーティション分割するためのベストプラクティス | Amazon Web Services
Amazon Web Services ブログ AWS GlueでApache Sparkジョブをスケーリングし、データをパーティション分割するためのベストプラクティス AWS GlueはApache Spark ETLジョブでのデータ分析・データ処理を行うために、様々なデータソースから大量のデータセットを準備(抽出および変換)し、ロードするサーバーレスな環境を提供します。この投稿のシリーズでは、Apache SparkアプリケーションとGlueのETLジョブの開発者、ビッグデータアーキテクト、データエンジニア、およびビジネスアナリストが、AWS Glue上で実行するデータ処理のジョブを自動的にスケールするのに役に立つベストプラクティスについて説明します。まず最初の投稿では、データ処理を行うジョブのスケーリングを管理する上で重要な2つのAWS Glueの機能について説明します。1つ目は、
uokada 2019/11/06
aws

spark

apache
リンク
Spark+AI Summit 2019 セッションハイライト (Spark Meetup Tokyo #1 - Spark+AI Summit 2019)
■Spark Meetup Tokyo #1 - Spark+AI Summit 2019 発表資料 (2019/06/12) Spark+AI Summit 2019 セッションハイライト株式会社NTTデータ (NTT DATA) 技術革新統括本部猿田浩輔 / 田中正浩 / 都築正宜 ※イベント概要 https://spark-meetup-tokyo.connpass.com/event/131791/
uokada 2019/06/15
spark
リンク
Spark Internals - Hadoop Source Code Reading #16 in Japan
The document discusses Spark internals and provides an overview of key components such as the Spark code base size and growth over time, core developers, Scala basics used in Spark, RDDs, tasks, caching/block management, and schedulers for running Spark on clusters including Mesos and YARN. It also includes tips for using IntelliJ IDEA to work with Spark's Scala code base.
uokada 2019/06/15
spark

hadoop
リンク
ストリーム処理を支えるキューイングシステムの選び方
This document discusses messaging queues and platforms. It begins with an introduction to messaging queues and their core components. It then provides a table comparing 8 popular open source messaging platforms: Apache Kafka, ActiveMQ, RabbitMQ, NATS, NSQ, Redis, ZeroMQ, and Nanomsg. The document discusses using Apache Kafka for streaming and integration with Google Pub/Sub, Dataflow, and BigQuery
uokada 2019/03/17
spark

queue

druid
リンク
Hue - The open source SQL Assistant for Data Warehouses
SparkのRDDとcontextを共有するために Livy Spark REST Job Server APIを使用する方法 Published on 12 February 2016 in Hue 3.10 / Programming / Spark / Tutorial - 4 minutes read - Last modified on 04 February 2020 （元のブログ記事はこちらです） Livyは任意の場所からApache Sparkを使用するためのオープンソースのRESTインターフェースです。LivyはローカルまたはYARNで実行される、Spark ContextのPython, Scala, Rのコード、あるいはプログラムのスニペットの実行をサポートしています。エピソード1では、対話的なシェルAPIの使用方法を以前に説明しました。このフォローアップでは、
uokada 2019/02/10
spark

livy
リンク
1 2 次のページ

お知らせ

もっと読む

公式Twitter

@HatenaBookmark
リリース、障害情報などのサービスのお知らせ
@hatebu
最新の人気エントリーの配信

キーボードショートカット一覧

j次のブックマーク

k前のブックマーク

lあとで読む

eコメント一覧を開く

oページを開く

設定を変更しましたx