サクサク読めて、アプリ限定の機能も多数!
トップへ戻る
Wikipedia
horicky.blogspot.com
This is the second part of my text processing series. In this blog, we'll look into how text documents can be stored in a form that can be easily retrieved by a query. I'll used the popular open source Apache Lucene index for illustration. There are two main processing flow in the system ... Document indexing: Given a document, add it into the index Document retrieval: Given a query, retrieve th
Continue from my last post of walking down the list of machine learning technique. In this post, I will covered Decision Tree and Ensemble methods. We'll continue using the iris data we prepare in this earlier post. Decision Tree model is one of the oldest machine learning model and is usually used to illustrate the very basic idea of machine learning. Based on a tree of decision nodes, the lea
In the previous 2 posts, we have covered how to visualize input data to explore strong signals as well as how to prepare input data to a form that is situation for learning. In this and subsequent posts, I'll go through various machine learning techniques to build our predictive model. Linear regression Logistic regression Linear and Logistic regression with regularization Neural network Support
NOSQL has become a very heated topic for large web-scale deployment where scalability and semi-structured data driven the DB requirement towards NOSQL. There has been many NOSQL products evolving in over last couple years. In my past blogs, I have been covering the underlying distributed system theory of NOSQL, as well as some specific products such as CouchDB and Cassandra/HBase. Last Friday I wa
I was motivated to write this blog from a discussion on the Machine Learning Connection group For classification and regression problem, there are different choices of Machine Learning Models each of which can be viewed as a blackbox that solve the same problem. However, each model come from a different algorithm approaches and will perform differently under different data set. The best way is to
Unsupervised machine learning has broad application in many e-commerce sites and one common usage is to find clusters of consumers with common behaviors. In clustering methods, K-means is the most basic and also efficient one. K-Means clustering involve the following logical steps 1) Determine the value of k 2) Determine the initial k centroids 3) Repeat until converge - Determine membership: Assi
A lot of real life problems can be expressed in terms of entities related to each other and best captured using graphical models. Well defined graph theory can be applied to processing the graph and return interesting results. The general processing patterns can be categorized into the following ... Capture (e.g. When John is connected to Peter in a social network, a link is created between two Pe
Hadoop Map/Reduce model is very good in processing large amount of data in parallel. It provides a general partitioning mechanism (based on the key of the data) to distribute aggregation workload across different machines. Basically, map/reduce algorithm design is all about how to select the right key for the record at different stage of processing. However, "time dimension" has a very different c
Looking back after 2.5 years since my previous post on scalable system design techniques, I've observed an emergence of a set of commonly used design patterns. Here is my attempt to capture and share them. Load Balancer In this model, there is a dispatcher that determines which worker instance will handle the request based on different policies. The application should best be "stateless" so any wo
Recently in a number of "scalability discussion meeting", I've seen the following pattern coming up repeatedly ... To make your app scalable, you try to make your app layer “stateless”.OK, so you move the "state" out from your application layer out to a shared DB, or shared data layer.Now, how do we make the data tier scalable, by definition, we cannot make the data tier stateless.OK, now lets thi
Since the emerging of Hadoop implementation, I have been trying to morph existing algorithms from various areas into the map/reduce model. The result is pretty encouraging and I've found Map/Reduce is applicable in a wide spectrum of application scenarios. So I want to write down my findings but then found the scope is too broad and also I haven't spent enough time to explore different problem dom
In my previous post, I talk about the methodology of transforming a sequential algorithm into parallel. After that, we can implement the parallel algorithm, one of the popular framework we can use is the Apache Opensource Hadoop Map/Reduce framework. Functional Programming Multithreading is one of the popular way of doing parallel programming, but major complexity of multi-thread programming is to
Once common feature in Social Network site is to recommend people connection. e.g. "People you may know" from Linkedin. The basic idea is very simple; if person A and person B doesn't know each other but they have a lot of common friends, then the system should recommend person B to person A and vice versa. From a graph theory perspective, for each person who is 2-degree reachable from person A, w
TF-IDF (Term Frequency, Inverse Document Frequency) is a basic technique to compute the relevancy of a document with respect to a particular term. "Term" is a generalized element contains within a document. A "term" is a generalized idea of what a document contains. (e.g. a term can be a word, a phrase, or a concept). Intuitively, the relevancy of a document to a term can be calculated from the pe
I received some constructive criticism regarding my previous blog in NoSQL patterns that I covered only the key/value store but have left out Graph DB. The Property Graph Model A property graph is a collection of Nodes and Directed Arcs. Each node represents an entity and has an unique id as well as a Node Type. The Node Type defines a set of metadata that the node has. Each arc represents a unidi
The recent rise of NoSQL provides an alternative model in building extremely large scale storage system. Nevetheless, compare to the more mature RDBMS, NoSQL has some fundamental limitations that we need to be aware of. It calls for a more relaxed data consistency model It provides primitive querying and searching capability There are techniques we can employ to mitigate each of these issue. Regar
I have attended a presentation by Simon Guest from Microsoft on their cloud computing architecture. Although there was no new concept or idea introduced, Simon has provided an excellent summary on the major patterns of doing cloud computing. I have to admit that I am not familiar with Azure and this is my first time hearing a Microsoft cloud computing presentation. I felt Microsoft has explained t
In classical prediction use case, the predicted output is either a number (for regression) or category (for classification). A set of training data (x, y) where x is the input and y is the labeled output is provided to train a parameterized predictive model. The model is characterized by a set of parameters w Given an input x, for the model predicts y_hat = f(x; w) for regression, or the model pr
Over the last couple years, we see an emerging data storage mechanism for storing large scale of data. These storage solution differs quite significantly with the RDBMS model and is also known as the NOSQL. Some of the key players include ... GoogleBigTable, HBase, Hypertable AmazonDynamo, Voldemort, Cassendra, Riak Redis CouchDB, MongoDB These solutions has a number of characteristics in common K
Lets look at how one can layer a cluster on top of CouchDB. Couch Cluster A “Couch Cluster” is composed of multiple “partitions”. Each partition is composed of multiple replicated DB instances. We call each replica a “virtual node”, which is basically a DB instance hosted inside a "physical node", which is a CouchDB process running in a machine. “Virtual node” can migrate across machines (which we
CouchDB is an Apache OpenSource project. It is Damien Katz's brain child and has a number of very attractive features based on very cool technologies. Such as ... RESTful API Schema-less document store (document in JSON format) Multi-Version-Concurrency-Control model User-defined query structured as map/reduce Incremental Index Update mechanism Multi-Master Replication model Written in Erlang (Erl
Under the category of "Concurrent Oriented Programming", Erlang has got some good attention recently due to some declared success from Facebook engineers of using Erlang in large scale applications. Tempted to figure out the underlying ingredients of Erlang, I decided to spent some time to learn the language. Multi-threading Problem Multiple threads of execution is a common programming model in mo
このページを最初にブックマークしてみませんか?
『Pragmatic Programming Techniques』の新着エントリーを見る
j次のブックマーク
k前のブックマーク
lあとで読む
eコメント一覧を開く
oページを開く