並び順

ブックマーク数

期間指定

  • から
  • まで

41 - 80 件 / 87件

新着順 人気順

"stream processing"の検索結果41 - 80 件 / 87件

  • The Future of Data Engineering

    The Future of Data Engineering Chris Riccomini on July 29, 2019 I have been thinking lately about where we’ve come in data engineering over the past few years, and about what the future holds for work in this area. Most of this thought has been framed in the context of what some of our teams are doing at WePay, but I believe the framework below applies more broadly, and is worth sharing. I present

      The Future of Data Engineering
    • Data-Oriented Design

      Online release of Data-Oriented Design : This is the free, online, reduced version. Some inessential chapters are excluded from this version, but in the spirit of this being an education resource, the essentials are present for anyone wanting to learn about data-oriented design. Expect some odd formatting and some broken images and listings as this is auto generated and the Latex to html converter

      • Our First Netflix Data Engineering Summit

        IntroductionEarlier this summer Netflix held our first-ever Data Engineering Forum. Engineers from across the company came together to share best practices on everything from Data Processing Patterns to Building Reliable Data Pipelines. The result was a series of talks which we are now sharing with the rest of the Data Engineering community! You can find each of the talks below with a short descri

          Our First Netflix Data Engineering Summit
        • DataHub: A generalized metadata search & discovery tool

          Authored byMars Lan Co-Founder & CTO at Metaphor | Co-creator of DataHub August 14, 2019 Co-authors: Mars Lan, Seyi Adebajo, Shirshanka Das Editor’s note: Since publishing this blog post, the team open sourced DataHub in February 2020. You can read more on the journey of open sourcing the platform here. As the operator of the world’s largest professional network and the Economic Graph, LinkedIn’s

            DataHub: A generalized metadata search & discovery tool
          • Data-Oriented Design

            Online release of Data-Oriented Design : This is the free, online, reduced version. Some inessential chapters are excluded from this version, but in the spirit of this being an education resource, the essentials are present for anyone wanting to learn about data-oriented design. Expect some odd formatting and some broken images and listings as this is auto generated and the Latex to html converter

            • Optimizing batch processing with custom checkpoints in AWS Lambda | Amazon Web Services

              AWS Compute Blog Optimizing batch processing with custom checkpoints in AWS Lambda AWS Lambda can process batches of messages from sources like Amazon Kinesis Data Streams or Amazon DynamoDB Streams. In normal operation, the processing function moves from one batch to the next to consume messages from the stream. However, when an error occurs in one of the items in the batch, this can result in re

                Optimizing batch processing with custom checkpoints in AWS Lambda | Amazon Web Services
              • How LinkedIn customizes Apache Kafka for 7 trillion messages per day

                Open Source How LinkedIn customizes Apache Kafka for 7 trillion messages per day Co-authors: Jon Lee and Wesley Wu Apache Kafka is a core part of our infrastructure at LinkedIn. It was originally developed in-house as a stream processing platform and was subsequently open sourced, with a large external adoption rate today. While many other companies and projects leverage Kafka, few—if any—do so at

                  How LinkedIn customizes Apache Kafka for 7 trillion messages per day
                • GitHub - ArroyoSystems/arroyo: Distributed stream processing engine in Rust

                  You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                    GitHub - ArroyoSystems/arroyo: Distributed stream processing engine in Rust
                  • Introducing Amazon Kinesis Data Analytics Studio – Quickly Interact with Streaming Data Using SQL, Python, or Scala | Amazon Web Services

                    AWS News Blog Introducing Amazon Kinesis Data Analytics Studio – Quickly Interact with Streaming Data Using SQL, Python, or Scala The best way to get timely insights and react quickly to new information you receive from your business and your applications is to analyze streaming data. This is data that must usually be processed sequentially and incrementally on a record-by-record basis or over sli

                      Introducing Amazon Kinesis Data Analytics Studio – Quickly Interact with Streaming Data Using SQL, Python, or Scala | Amazon Web Services
                    • "the most popular OSS data projects"を眺めてみる(1位〜10位)

                      ※具体的なアンケートの質問は不明? この記事 ↑の上位20製品について、簡単に調べてみました。 私がよく知らない製品(Flyteとか)、みんな知っているだろう製品(Sparkとか)は記載薄めです。 なお、私の知識は 知っている Apache Airflow, Trino, Prefect, Apache Spark, Amundsen, Apache Flink, Apache Kafka,Apache Duid, pandas 名前だけ知っている dbt, Apache Pinot, Apache SuperSet, Great Expectations, Dask, Apache Arrow, Apache Gobblin 知らない Dagster, Flyte, RudderStack, Ray な感じです。 目次 dbt Apache Airflow Apache Superset

                        "the most popular OSS data projects"を眺めてみる(1位〜10位)
                      • Hello, Redis Stack - Redis

                        Today we’re thrilled to announce Redis Stack. Redis Stack consolidates the capabilities of the leading Redis modules into a single product, making it easy for developers to build modern, real-time applications with the speed and stability of Redis. Prologue At Redis, we’re building a real-time data layer to meet the universal demand for responsive, low-latency applications and services. To build a

                          Hello, Redis Stack - Redis
                        • ストリーム処理システムに求められる機能性、および Apache Flink におけるその対応

                          はじめに#このポストではストリーム処理の survay 論文の話題に対して Apache Flink における例を挙げて紹介する。 論文概要#Fragkoulis, M., Carbone, P., Kalavri, V., & Katsifodimos, A. (2020). A Survey on the Evolution of Stream Processing Systems. 2020年の論文。 過去30年ぐらいのストリーム処理のフレームワークを調査し、その発展を論じている。 ストリーム処理に特徴的に求められるいくつかの機能性 (functionality) についてその実現方法をいくつか挙げ、比較的古いフレームワークと最近のフレームワークでの対比を行っている。 このポストのスコープ#このポストでは前述のストリーム処理システムに求められる機能性とそれがなぜ必要となるかについて簡

                            ストリーム処理システムに求められる機能性、および Apache Flink におけるその対応
                          • Scribe: Transporting petabytes per hour via a distributed, buffered queueing system

                            Scribe: Transporting petabytes per hour via a distributed, buffered queueing system Our hardware infrastructure comprises millions of machines, all of which generate logs that we need to process, store, and serve. The total size of these logs is several petabytes every hour. The outputs are generally processed somewhere other than where they were generated: They can be relevant to a variety of dow

                              Scribe: Transporting petabytes per hour via a distributed, buffered queueing system
                            • Project Flogo

                              Project Flogo Ecosystem Scroll through the action elements to read more about what you can build on the core! Project Flogo is a resource efficient, Go-based open source ecosystem for building event-driven apps. Event-driven, you say? Yup, the notion of triggers and actions are leveraged to process incoming events. An action, a common interface, exposes key capabilities such as application integra

                                Project Flogo
                              • Spring Batch on Kubernetes: Efficient batch processing at scale

                                Spring Batch on Kubernetes: Efficient batch processing at scale Introduction Batch processing has been a challenging area of computer science since its inception in the early days of punch cards and magnetic tapes. Nowadays, the modern cloud computing era comes with a whole new set of challenges for how to develop and operate batch workload efficiently in a cloud environment. In this blog post, I

                                  Spring Batch on Kubernetes: Efficient batch processing at scale
                                • GitHub - gazette/core: Build platforms that flexibly mix SQL, batch, and stream processing paradigms

                                  Gazette makes it easy to build platforms that flexibly mix SQL, batch, and millisecond-latency streaming processing paradigms. It enables teams, applications, and analysts to work from a common catalog of data in the way that's most convenient to them. Gazette's core abstraction is a "journal" -- a streaming append log that's represented using regular files in a BLOB store (i.e., S3). The magic of

                                    GitHub - gazette/core: Build platforms that flexibly mix SQL, batch, and stream processing paradigms
                                  • Dataflow の仕組み: Dataflow の手法について | Google Cloud 公式ブログ

                                    ※この投稿は米国時間 2020 年 8 月 22 日に、Google Cloud blog に投稿されたものの抄訳です。 編集者注: 本記事は Dataflow の開発に至った Google 内部の歴史と、Google Cloud サービスとしての Dataflow の機能、市場における他社製品との比較対照について掘り下げる 3 回シリーズのブログの第 2 回です。第 1 回の記事をご参照ください。Dataflow の仕組み: 誕生秘話 本シリーズの第 1 回では、Google 内での Dataflow 開発の背景について取り上げ、ラムダ アーキテクチャとの比較について解説しました。今回は Dataflow を動かす主要なシステムのいくつかについて、もう少し詳しく見ていきましょう。第 1 回で述べたように、Dataflow にはそれまでのシステムのために構築した数多くのテクノロジーが活用さ

                                      Dataflow の仕組み: Dataflow の手法について | Google Cloud 公式ブログ
                                    • Designing a Production-Ready Kappa Architecture for Timely Data Stream Processing

                                      Designing a Production-Ready Kappa Architecture for Timely Data Stream Processing At Uber, we use robust data processing systems such as Apache Flink and Apache Spark to power the streaming applications that helps us calculate up-to-date pricing, enhance driver dispatching, and fight fraud on our platform. Such solutions can process data at a massive scale in real time with exactly-once semantics,

                                        Designing a Production-Ready Kappa Architecture for Timely Data Stream Processing
                                      • Fluent Bitを導入しました:ローカル実行・確認方法と、導入の過程でハマったこと - Uzabase for Engineers

                                        AlphaDrive、NewsPicks兼務でエンジニアしている大場です。 最近はNewsPicks Webの新基盤開発を行っています。 新基盤はNext.jsで開発していてAWSのFargateで構築しているのですが、このFargate上で取得したログをS3、New Relicに送るためにFluent Bitを導入しました。 今回はローカルでの実行・確認方法と、導入の過程で問題になったことを紹介します! Fluent Bit とは ローカル実行・確認方法 イメージの選択 設定ファイルの準備 デバッグ用の設定を追加する 動作確認 ltsv形式のログを展開する Stream Processorを使う その他の設定について Fluent Bitで導入の過程でハマったこと S3 プラグインでgzip圧縮時に Content-Encoding: gzip が固定 S3オブジェクト内のデータを正確に

                                          Fluent Bitを導入しました:ローカル実行・確認方法と、導入の過程でハマったこと - Uzabase for Engineers
                                        • Pub/Sub によりこれまで以上にアクセスしやすくなったスケーラブルなリアルタイム分析 | Google Cloud 公式ブログ

                                          ※この投稿は米国時間 2020 年 12 月 8 日に、Google Cloud blog に投稿されたものの抄訳です。 近頃はリアルタイム分析がビジネスに欠かせなくなっています。最新のデータに基づくリアルタイムの自動意思決定は、もはや高度なテクノロジー ファーストの企業だけのものではありません。それは、ビジネスを行うための基本的な方法になりつつあります。IDC によれば、作成されるデータの 4 分の 1 以上は、今後 5 年でリアルタイムのデータになります。この増加を促進していると思われる要因は、サービスとユーザー エクスペリエンスの品質向上という競争圧力です。もう一つの要因は、従来のさまざまなビジネスのコンシューマライゼーションです。以前はエージェントによって行われていた多くの機能が消費者自身によって行われるようになりました。現在、銀行、小売業者、サービス プロバイダはそれぞれ、内部ア

                                            Pub/Sub によりこれまで以上にアクセスしやすくなったスケーラブルなリアルタイム分析 | Google Cloud 公式ブログ
                                          • Download free O'Reilly books · GitHub

                                            books.md From theme: Programming Microservices for Java Developers: A Hands-On Introduction to Frameworks and Containers http://www.oreilly.com/programming/free/files/microservices-for-java-developers.pdf http://www.oreilly.com/programming/free/files/microservices-for-java-developers.epub http://www.oreilly.com/programming/free/files/microservices-for-java-developers.mobi Modern Java EE Design Pat

                                              Download free O'Reilly books · GitHub
                                            • Lessons Learned: The Journey to Real-Time Machine Learning at Instacart

                                              Figure 1: How ML models support shopping journey at InstacartInstacart incorporates machine learning extensively to improve the quality of experience for all actors in our “four-sided marketplace” — customers who place orders on Instacart apps to get deliveries in as fast as 30 minutes, shoppers who can go online at anytime to fulfill customer orders, retailers that sell their products and can mak

                                                Lessons Learned: The Journey to Real-Time Machine Learning at Instacart
                                              • Announcing Message DB: Event Store and Message Store for PostgreSQL

                                                The Eventide Project team is excited to announce Message DB: A fully-featured event store and message store implemented in PostgreSQL for pub/sub, event sourcing, and evented microservices applications. For more specifics, visit Message DB on GitHub: https://github.com/message-db/message-db Message DB was distilled from the Eventide Project to make it easier for users to write clients in the langu

                                                  Announcing Message DB: Event Store and Message Store for PostgreSQL
                                                • Data Engineer: Interview Questions

                                                  Here is a list of common data engineering interview questions, with answers, which you may encounter for an interview as a data engineer. The questions during an interview for a data engineer aim to check not only the grasp of data systems and architectures but also a keen understanding of your technical prowess and problem-solving skills. This article lists essential interview questions and answe

                                                    Data Engineer: Interview Questions
                                                  • New AWS Lambda controls for stream processing and asynchronous invocations | Amazon Web Services

                                                    AWS Compute Blog New AWS Lambda controls for stream processing and asynchronous invocations Today AWS Lambda is introducing new controls for asynchronous and stream processing invocations. These new features allow you to customize responses to Lambda function errors and build more resilient event-driven and stream-processing applications. Stream processing function invocations When processing data

                                                      New AWS Lambda controls for stream processing and asynchronous invocations | Amazon Web Services
                                                    • Change Data Capture for Microservices

                                                      Transcript Morling: Welcome to this talk about Change Data Capture for microservices. Let me set the scene a little bit with a maybe blunt statement and an observation. The world around us, this is happening in real time. People buy stuff in an online store, maybe they do some payment transactions. Maybe you have machinery or IoT devices, which send over measurements or all kinds of sensor data. N

                                                        Change Data Capture for Microservices
                                                      • SE Radio 393: Jay Kreps on Enterprise Integration Architecture with a Kafka Event Log – Software Engineering Radio

                                                        SE Radio 393: Jay Kreps on Enterprise Integration Architecture with a Kafka Event Log Jay Kreps, CEO of Confluent discusses an enterprise integration architecture organized around an event log. Robert Blumen spoke with Jay about the N-squared problem of data integration; how LinkedIn tried and failed to solve the integration problem;  the nature of events; the enterprise event schema; schema defin

                                                        • Machine Learning Design Patterns - higepon blog

                                                          Scaling Min-max & clipping は一様分布に良い Z-score は正規分布に良い。 input data によっては non-linear な変換の方が適切。例えば Wikipedia page views。これは正直意識してなかった。 この視点で圧力コンペのデータでやってみた(02-01-scaling.ipynb) Categorical 入力が array of categorical である場合は考えたこともなかった。dummy と one hot encoding の違いを理解した。 Design Pattern 1: Hashed Feature Kaggle では経験のないパターン。新しい ID や cold start にも対応できるのが良い。学習データにはない空港が建設された場合どうするか。というのはわかりやすい例だった。感覚的には hash が衝

                                                            Machine Learning Design Patterns - higepon blog
                                                          • Real-time machine learning: challenges and solutions

                                                            [Twitter discussion, LinkedIn] Updates Jan 3, 2023: Update the online features section to differentiate between real-time features and near real-time features. If you’re interested in this topic, my book Designing Machine Learning Systems (O’Reilly, June 2022) covers online prediction and continual learning in much more detail. Real-time machine learning is the approach of using real-time data to

                                                              Real-time machine learning: challenges and solutions
                                                            • The Day of a new Command-Line Interface: Shell

                                                              This article continues the long-lost series on how to migrate away from terminal protocols as the main building block for command-line and text-dominant user interfaces. The previous ones (Chasing the dream of a terminal-free CLI (frustration/idea, 2016) and Dawn of a new Command-Line Interface (design, 2017)) might be worth an extra read afterwards, but they are not prerequisites to understanding

                                                                The Day of a new Command-Line Interface: Shell
                                                              • Going Reactive with Spring, Coroutines and Kotlin Flow

                                                                Going Reactive with Spring, Coroutines and Kotlin Flow Since we announced Spring Framework official support for Kotlin in January 2017, a lot of things happened. Kotlin was announced as an official Android development language at Google I/O 2017, we continued to improve the Kotlin support across Spring portfolio and Kotlin itself has continued to evolve with key new features like coroutines. I wou

                                                                  Going Reactive with Spring, Coroutines and Kotlin Flow
                                                                • Data engineering at Meta: High-Level Overview of the internal tech stack

                                                                  Data engineering at Meta: High-Level Overview of the internal tech stack This article provides an overview of the internal tech stack that we use on a daily basis as data engineers at Meta. The idea is to shed some light on the work we do, and how the tools and frameworks contribute to making our day-to-day data engineering work more efficient, and to share some of the design decisions and technic

                                                                    Data engineering at Meta: High-Level Overview of the internal tech stack
                                                                  • Lessons learned from combining SQS and Lambda in a data project - Solita Data

                                                                    In June 2018, AWS Lambda added Amazon Simple Queue Service (SQS) to supported event sources, removing a lot of heavy lifting of running a polling service or creating extra SQS to SNS mappings. In a recent project we utilized this functionality and configured our data pipelines to use AWS Lambda functions for processing the incoming data items and SQS queues for buffering them. The built-in functio

                                                                      Lessons learned from combining SQS and Lambda in a data project - Solita Data
                                                                    • GitHub - infinyon/fluvio: Lean and mean distributed stream processing system written in rust and web assembly.

                                                                      You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

                                                                        GitHub - infinyon/fluvio: Lean and mean distributed stream processing system written in rust and web assembly.
                                                                      • Decoding protobuf messages using AWS Lambda | Amazon Web Services

                                                                        AWS Compute Blog Decoding protobuf messages using AWS Lambda This post is written by Ennio Pastore, Data Lab Architect. Protobuf is short for protocol buffers, which are language- and platform-neutral mechanisms for serializing structured data. Compared to XML or JSON the size of the messages is smaller, so the network transfer is faster, reducing latency in the interactions between applications.

                                                                          Decoding protobuf messages using AWS Lambda | Amazon Web Services
                                                                        • Rapid Event Notification System at Netflix

                                                                          By: Ankush Gulati, David Gevorkyan Additional credits: Michael Clark, Gokhan Ozer IntroNetflix has more than 220 million active members who perform a variety of actions throughout each session, ranging from renaming a profile to watching a title. Reacting to these actions in near real-time to keep the experience consistent across devices is critical for ensuring an optimal member experience. This

                                                                            Rapid Event Notification System at Netflix
                                                                          • データレイク関連の OSS - Delta Lake, Apache Hudi, Apache Kudu

                                                                            はじめに#前回のポストではデータレイクとはどういうものかというのを調べた。 今回はデータレイクの文脈でどのような OSS が注目されているのかを見ていきたい。 以下は NTT データさんによる講演資料であり、その中で「近年登場してきた、リアルタイム分析に利用可能なOSSストレージレイヤソフト」というのが3つ挙げられている。 Delta LakeApache HudiApache Kuduこれらはすべて論理的なストレージレイヤーを担う。 こちらの講演資料に付け足すようなこともないかもしれないが、このポストではデータレイクという文脈から自分で調べて理解した内容をまとめるということを目的にする。 当然 Hadoop, Hive, Spark 等もデータレイクの文脈において超重要だが、「データレイク」という言葉がよく聞かれるようになる前から普及していたのでこのポストでは触れないことにする。 Del

                                                                              データレイク関連の OSS - Delta Lake, Apache Hudi, Apache Kudu
                                                                            • GitHub - puresec/sas-top-10: Serverless Architectures Security Top 10 Guide

                                                                              The Ten Most Critical Risks for Serverless Applications v1.0 Preface The “Serverless architectures Security Top 10” document is meant to serve as a security awareness and education guide. The document is curated and maintained by top industry practitioners and security researchers with vast experience in application security, cloud and serverless architectures. As many organizations are still expl

                                                                                GitHub - puresec/sas-top-10: Serverless Architectures Security Top 10 Guide
                                                                              • RabbitMQ vs Kafka: Which Platform Should You Choose in 2023?

                                                                                Have you ever found yourself standing at a crossroads, trying to decide between RabbitMQ vs Kafka for your Microservices-based system? Have you ever wondered which of these messaging platforms is most suitable for your use case? RabbitMQ and Apache Kafka are well-known solutions in the asynchronous messaging domain, but despite popular belief, they aren’t one-size-fits-all solutions. As a software

                                                                                  RabbitMQ vs Kafka: Which Platform Should You Choose in 2023?
                                                                                • Structured Streaming Programming Guide - Spark 3.5.1 Documentation

                                                                                  Structured Streaming Programming Guide Overview Quick Example Programming Model Basic Concepts Handling Event-time and Late Data Fault Tolerance Semantics API using Datasets and DataFrames Creating streaming DataFrames and streaming Datasets Input Sources Schema inference and partition of streaming DataFrames/Datasets Operations on streaming DataFrames/Datasets Basic Operations - Selection, Projec