Data Soup (by Ashish Thusoo)
A blog on data processing infrastructure and applications-
Some exciting HDFS improvements
Posted on August 30th, 2009 4 commentsDhruba has an interesting piece on some cool improvements to HDFS that he is working on that can potentially have 25-30% space savings. The idea is to RAID to dfs in some form such that it does not negatively impact read and write performance and at the same time it provides similar fault tolerance to node failures as the current default scheme of 3 way replicas. Dhruba explains it here http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html. Very cool indeed…
-
Who uses Amazon Web Services?
Posted on May 28th, 2009 7 commentsAmazon Web Services really changed the way one thought about cloud computing by exposing the cloud as infrastructural pieces (Compute using EC2, Storage using EBS and S3). Before its launch, the cloud abstractions were either available at the hardware level or at the application level. By that I mean that the cloud was either the hardware hosting services provided by companies like godaddy.com, justhost.com etc. or web services and applications like salesforce.com and many others. Amazon Web Services in that regard could be considered truely as a cloud operating system and provides similar abstractions that an operating system provides to software applications.
I wanted to find out more about who was using these services and I found some very good information at AWS Case Studies. As mentioned there the primary case studies are in the areas of delivery (application, media, content, web or commerce), analysis (HPC, search) or data management (backup and storage, data conversion). One can easily think of the delivery applications as being a great place to start as moving compute in the cloud is easier than moving data into the cloud. I am not sure how users of AWS are able to overcome the limited bandwidth that makes moving lots of data into the cloud time consuming. Perhaps AWS Import/Export is the low tech answer to this (just ship your hard drives to Amazon), but there does not seem to be a good technology solution to the problem of moving existing data into the cloud and that clearly limits the applicability of analytical applications on legacy data. I am not sure how much data the HPC applications mentioned in the case studies deal with. If there are high compute on small data sets, they are clearly great for the cloud. Search on the other hand is probably generating the crawl data on the cloud as well.
For the delivery applications, there seems to be some debate on how much do the users of the clouds actually save on the cloud. It is generally accepted that the amazon and similar clouds are more cost effective than running and maintianing your own servers and machines in a data center, but the more effective comparison is with hosting providers mentioned above. I found an interesting post on the relative pricing of AWS and Godaddy at Thomas Brox Rost. It is a bit dated and the prices may have moved now, but it is nevertheless and interesting take on the idea. Before reserved pricing though, the hosting providers were significantly cheaper that AWS.
-
Memory Trends
Posted on May 23rd, 2009 2 commentsThe trends in the price, performance and capacities of different components (disk, cpu, memory and network) along with the demands imposed by applications on data processing systems determine how those later are architected. What got me thinking on this subject is the emergence in the last decade of clustering technologies to address large scale data processing needs of applications in different domains like web, mobile, health care, genomics etc. The tremendous explosion of data meant that the aggregated processing power needed by these applications was initially available only through very expensive “super computers”. That and the need for systems that would scale with data growth lead to the invention of highly scalable systems built out of commodity servers. One of the most visible examples of this breed of systems has been the infrastructure stack built at Google – GFS, MapReduce and BigTable. However, the notion of what represents the commodity building block for such systems is always evolving and how building blocks are tied together to create an uber “super computer” is determined a lot by the relatve price, perfromance and capacities of the different components.
As an example while 16GB of memory is commonplace in commodity servers costing $3K-$4K today, 4GB was quite the norm in that price range just 5 years back. Considering the fact that the ecosystem of the semiconductor industry is geared towards smaller, cheaper, faster – the trend towards more processing chops and capacity being available in each of the commodity nodes seems like a secular trend. However, the relative trends for memory, cpu, disk and network in commodity nodes ultimately determine how these have to be tied together to create balanced systems that can provide the best performance at the best price. This of course got me thinking on how to measure the trends for each of the individual components and to understand the factors that influence these trends. So hence a post on “memory trends” – hopefully, there will be others on the cpu, disk and network as well.
As far as memory is concerned, there are three predominant performance measures that I can think of in server grade memory:
- Capacity
- Speed
- Cost
Capacity and speed trends are determined by how much investments the semiconductor industry is making to upgrade old fabs and to build new ones and also the investements in R&D. Cost on the other hand is determined by the supply and demand equation. I found some very useful information on how to measure these at semiconductor industry handbook. According to this, the book-to-bill ratio is a good proxy to determine whther there is overcapacity in the industry or not. Overcapacity signaled by book-to-bill ratio < 1 would indicate a lower price trend, while capacity contraints signaled by book-to-bill ratio >> 1 would indicate higher prices (I don’t think that this has happened yet??). A ratio close to 1 would signal price stability.
The industry itself is highly cyclical and while it used to be dependent a lot on the business sector of the economy in the 1980s, it depends a lot on the demand generated by the consumer sector now as a result of new consumer devices like mobile phones, ipods, gaming consoles, PCs etc. As a result, when the economy is booming, the industry invests heavily in new manufacturing platnts and plant upgrades. This leads to overcapacity and falling prices and a falling book-to-bill ratio once the economy turns into recession. That in turn stifles the new investments, leads to production cuts etc. and as the economy starts rebounding, the demand starts pushing the industry into capacity constraints leading to stabilization in prices and more investments to address anticipated demand, and the whole cycle continues…
This downturn has hit the memory sector specially badly leading to a lot of consolidation and mergers and the suspension of a number of fab investements and process improvements. That means that the industry is going to eliminate overcapacity (though that will take some time it seems – the current book-to-bill ratio is at a pathetic 0.61). It also means though that rapid falling in memory prices and rapid increases in memory capacities that we have become so accustomed to, may not happen at the same pace for some time – of course after we hit a bottom. Some excellent sources to follow are:
There is another little secret that I found about this industry. Guess what – the automobile industry is not the only sector getting bailouts. The semiconductor manufacturers have also been decimated by the down turn, and they are also viewed as primary manufacturering employers in many of their home countries, and China, Taiwan, Singapore etc. have all given funds and loans to bail them out during these times. Check out some of these links at seekingalpha.com and also at zdnetasia.com. Quimonda deal though seems to have not gone through and the company already announced a bankruptcy.
Signing off now.. has been a bit of a longish post and will post a chart of the book-to-bill ratio on some future post. Enough about memory for now, don’t even remember what I started with
) -
Analytics on Amazon EC2 and S3
Posted on May 17th, 2009 10 commentsJoydeep’s recepie to run Hive over Amazon Ec2+S3 is a must read for the ever growing legions of hadoopers. Hive+Hadoop on Amazon EC2 and S3 is a really powerful way of running SQL like queries on large data sets, provided you can easily get such data sets into the Amazon cloud. Once you are past that problem though, you can do some serious number crunching on this data with Hadoop and Hive.
On that note, it would be interesting to find who the real users of Amazon Web Services are and how they are able to get these data sets into the cloud? Do they generate these data sets on the cloud itself and how big are these data sets any way?



Recent Comments