Saturday, January 14, 2012

Windows Azure and Cloud Computing Posts for 1/8/2012+

A compendium of Windows Azure, Service Bus, EAI & EDI Access Control, Connect, SQL Azure Database, and other cloud-computing articles. image222

image433

Note: This post is updated daily or more frequently, depending on the availability of new articles in the following sections:


Azure Blob, Drive, Table, Queue and Hadoop Services

Avkash Chauhan (@avkashchauhan) described Setting Windows Azure Blob Storage (asv) as data source directly from Portal at Hadoop on Azure in a 1/13/2011 post:

imageAfter you log into your Hadoop Portal and configured your cluster, you can select “Manage Data” tile as below:

imageOn the next screen you can select:

  • “Set up ASV” to set your Windows Azure Blob Storage as data source
  • “Set up S3” to set your Amazon S3 Storage as data source

When you select “Set up ASV”, in the next screen you would need to enter your Windows Azure Storage Name and key as below:

After you select “Save Settings”, if you Azure Storage credentials are correct you will get the following message:

Now you Azure Blob Storage is set up to use with Interactive JavaScript shell or you can remote into your cluster to access from there as well. You can test it directly at Interactive JavaScript shell as below:

Note: If you really want to know how Azure Blob Storage was configured with Hadoop, it was done by adding proper Azure Storage credentials into core-site.xml as below:

If you open C:\Apps\Dist\conf\core-site.xml you will see the following parameters related with Azure Blob Storage access from Hadoop Cluster:

<property>
<name>fs.azure.buffer.dir</name>
<value>/tmp</value>
</property>
<property>
<name>fs.azure.storageConnectionString</name>
<value>DefaultEndpointsProtocol=https;AccountName=happybuddha;AccountKey=***********************************************************==</value>
</property>

More info is here:

Resources:


Avkash Chauhan (@avkashchauhan) explained Using Symbolic links with local storage to store large amount of data in Windows Azure Application Drive in a 1/13/2012 post:

imageBy default the application drive in Windows Azure VM has maximum size limitation of 1GB with any kind of VM type (small, extra-large). So if you have a large or extra-large VM, and running multiple sites configured in your Web Role, you might want to places lots of data within different sites. For example if your application is based on ASP.NET then you can download lots of ASP templates and dynamic content from Azure blog storage, which must reside inside within the application folder(s) cannot be placed at local storage. In this scenario you might see the 1GB size limitation a serious application design issue.

imageAs you may know, the Application drive consist your package as a drive and due to limit (under ~400MB) of Azure Package the application drive size 1GB seems quite enough. Because of it, the 1GB size of application drive seems logical and enough.

You may also know that every Azure VM comes with Local Storage space which you can use for any of your application use. The size of Local Storage Space per VM type is as below:

Still if you have a scenario in which you need to place large amount of data in your application drive you can use the following alternative solution:

  1. Use Local Storage in your Windows Azure Application
  2. Download necessary data from Azure Blob Storage in specific folders at local storage during role startup
  3. After you download the blob from Azure Storage, Create symbolic link from your application folder to Azure local storage.
  4. Windows Server 2008 supports Symbolic links and this way even when your files are stored in Local Storage, still they appear to be inside the Application drive.
  5. More information about NTFS Symbolic links is located below:


Benjamin Guinebertière (@benjguin) described On demand map reduce cluster and persistent storage | un cluster map reduce à la demande et des données persistantes in a bilingual post on 1/11/2012. Following is the English content:

imageHere is one of the use cases of Hadoop on Azure: you have a few applications accumulating data over time and you need to execute batches against this data a few times a month. You need many machines in an Hadoop cluster, but most of the time, you don’t need the cluster, just the data.

imageOne possible way is shown in the following diagram, that we will explain in this post.

image

Hadoop

image_thumb3_thumb_thumb11Hadoop is a framework that implements map/reduce algorithm to execute code against big amounts of data (Terabytes).

On an Hadoop cluster, data is typically spread across the different data nodes of the Hadoop Distributed File System (HDFS). Even one big file can be spread across the cluster in blocks of 64 Mb (by default).

So data nodes play two roles at the same time: they are a processing role and they are also hosting the data itself. This means that removing processing power removes HDFS storage at the same time.

Persistent storage

In order to make data survive cluster removals, it is possible to have the data to persistent storage. In Windows Azure, the candidate is Windows Azure Blobs, because it is what corresponds the most to files, which is what HDFS stores.

NB: other Windows Azure persistent storages also include Windows Azure Tables (non relationnal) and SQL Azure (relationnal, with sharding capabilities called federations).

Pricing on Windows Azure

Official pricing are described here and you should refer to that URL in order to have up to date pricings.

While I’m writing this article, current pricing are the following:

  • Using Windows Azure blobs costs
    * $0.14 per GB stored per month based on the daily average – There are discounts for high volumes. Between 1 and 50 TB, it’s $0.125 / GB / month.
    * $1.00 per 1,000,000 storage transactions
  • An Hadoop cluster uses an 8-CPU head node (Extra large) and n 2-CPU data nodes (Medium).
    * nodes are charged $0.12 per hour and per CPU. An 8 node + 1 head node cluster costs (8x2CPU+1x8CPU)x$0.12x750h=$2160/month.
  • There are also data transfer in and out of the Windows Azure DataCenter.
    * Inbound data transfer are free of charge
    * Outbound data transfer: North America and Europe regions: $0.12/GB, Asia Pacific Region: $0.19/GB

Disclaimer: in this post, I don’t take into account any additional cost that may come for Hadoop as a service. For now, current version is in CTP (Community Technology Preview) and no price was announced. I personnaly have no idea of how this could be charged, or even if this would be charged. I just suppose the relative comparisons between costs would keep roughly the same.

In current Hadoop on Azure CTP (community Technology preview) the following clusters are available (they are offered at no charge to a limited number of testers).

image

In order to store 1 TB of storage one needs at least a cluster with 3 TB because HDFS replicates data 3 times (by default). So medium cluster is OK. Note that for computation moving to a large cluster may be needed as additional data will be generated by computation.

In order to store 1 TB of storage in Windows Azure blobs, one needs 1 TB of Windows Azure blob storage (replication on 3 different physical nodes is included in the price).

So storing 1 TB of data in an Hadoop cluster with HDFS costs $2160/month while storing 1 TB of data in Windows Azure storage blobs costs 1024x$0.125=$128/month.

Copying 1 TB of data to or out of Windows Azure blobs inside the datacenter will incur storage transactions. As an approximation let’s count a storage transaction / 1 MB. (per MSDN documentation a PUT storage transaction on a block blob may contain up to 4 MB of data). So copying 1 TB of data would roughly cost $1.

Let’s now suppose we need the Hadoop cluster 72 hours (3 x 24h) a month for computation. We would use an extra large cluster to have the result faster and to get extra storage capacity for intermediary data. That cluster costs (32x2CPU+1x8CPU)x$0.12x72h=$622.08.

So using an extra large cluster 3 times 24 h a month would cost the following per month:

  • permanently store 1 TB of data in Windows Azure Storage: $128.00.
  • copy 1 TB of data to and from Windows Azure storage 3 times = 3x2x$1=$6
  • Hadoop Extra Large cluster: $622.08 ==> $756,08

So it is ~2.9 times cheaper to store 1 TB of data in Windows Azure Storage and have 3 times a 32 node cluster for 24 hours rather than permanently having an 8 node Hadoop cluster storing permanently 1 TB of data.

Interactions between storage and applications

An additional consideration is the way applications may interact with the storage.

HDFS would mainly be accessed thru a Java API, or a Thrift API. It may also be possible to interact with HDFS data thru other stacks like HIVE and an ODBC driver like this one or this one.

Windows Azure blobs may also be accessed thru a number of ways like .NET, REST, Java, and PHP APIs. Windows Azure storage may also offer security and permissions features that are more suited for remote access like shared access signatures.

Depending on the scenarios, it may be easier to access Windows Azure Storage rather than HDFS.

How to copy data between Windows Azure Blobs and HDFS

Let’s now see how to copy data between Windows Azure Storage and HDFS.

asv://
asv://

First of all, you need to give your Windows Azure storage credentials to the Hadoop cluster. From the www.hadooponazure.com portal, this can be done in the following way

image

image

image

Then, the asv:// scheme can be used instead of hdfs://. Here is an example:

image

image

This can also be used from the JavaScript interactive console:

image

Copying as a distributed job

In order to copy data from Windows Azure Storage to HDFS, it is interesting to have the whole cluster participating in this copy instead of just one thread of one server. While the
hadoop fs –cp
command will do the 1 thread copy, the
hadoop distcp
command will generate a map job that will copy the data.

Here is an example

image

image

Here are a few tips and tricks:

Hadoop on Azure won’t list the content of an Windows Azure Blob container (the first level folder, just after /). You just need to have at least a second level folder so that you can work on folders (in other words for Azure blob purists, the blob names needs to contain at least one /). Trying to list a container content would result in

ls: Path must be absolute: asv://mycontainer
Usage: java FsShell [-ls <path>]

Here is an example

image

That’s why I have a fr-fr folder under my books container in the following example:

image

A distributed copy (distcp) may generate a few more storage transactions on the Windows Azure storage than needed because of Hadoop default strategy which uses idle nodes to execute several times the same tasks. This mainly happens at the end of the copy. Remember we calculated that 1 TB of data would cost ~$1 in storage transactions. That may be ~$1.20 because of speculative execution.

Why not bypass HDFS, after all?

It is possible to use asv: instead of hdfs: including while defining the source or the destination of a map reduce job. So why use HDFS?

Here are a few drawbacks with this non HDFS approach:

  • you won’t have processing close to the data which will generate network traffic which is slower than interprocess communication inside a machine.
  • you will generate many storage transactions against Windows Azure storage (remember: 1 million of them costs $1 real money). In particular Hadoop may run a single task several times from multiple nodes just because it has available nodes and that one of those tasks may fail.
  • HDFS has a default behavior of spreading files in chunks of 64 MB and this will automatically spread map tasks to those blocks of data. Running directly against Windows Azure Storage may need additional tuning (like explicitly defining a number of tasks).

Conclusion

In a case where you need to work three days a month on 1 TB of data, it is roughly three times cheaper to have a 32 node cluster that takes its data from and to Azure Blobs Storage each time it is created and destroyed than having an 8 node cluster that keeps the 1 TB data full time. Copying data between Windows Azure storage and HDFS should be done thru distcp which generates a map job to copy in a distributed way.

This leverages Hadoop as well as Windows Azure elasticity.


Denny Lee revised his Hadoop on Azure Scenario: Query a web log via HiveQL TechNet wiki article on 1/10/2012:

The purpose of this wiki post is to provide an example scenario on how to work with Hadoop on Azure, upload a web log sample file via secure FTP, and run some simple HiveQL queries.

Preparation
Please download the sample weblog file ex20111214.log.gz and the weblog_sample.hql file to a local location on your computer.
Upload the Weblog file

To upload the weblog, let's make use of the FTP option using curl. To do this, you will need to do the following:

1) From the Interactive Javascript Console, create the folder weblog using the command #mkdir
#mkdir weblog


2) Follow the instructions at How to FTP Data to Hadoop on Windows Azure to get the data up to the weblog folder you had created (i.e. instructions on how to open up the FTP ports and how to push data up to HDFS using curl). In this case, you can use the curl command used to ftp the data is noted below.

curl -k -T ex20111214.log.gz ftps://Campschurmann:[MD5 Hash Password]@tardis.cloudapp.net:2226/user/Campschurmann/weblog/

Some quick notes based on the color coding:

  • Campschurmann - this is the username that I had specified when I had created my cluster. Notice that it is case sensitive.
  • [MD5 Hash Password] - this is a MD5 hash of the password you had created when you had created your cluster.
  • tardis.cloudapp.net - this is the name of my cluster where i had specified tardis as the name of my cluster.
  • 2226 - this is the FTP port to allow the transfer of data, this is the port you had previously opened in the Open Ports Live tile.
  • weblog - this is the folder that you had just created using the #mkdir command.
3) To verify the file, go back to the Interactive Javascript console and type #ls weblog and you should see the file listed.
Create HiveQL table pointing to this sample weblog file

Now that you have uploaded the sample weblog file, now you can create a Hive table that points to the weblog folder you just created which contains the sample file. To do this:

1) Go to the Interactive Hive Console, and type the command below.

CREATE EXTERNAL TABLE weblog_sample (
evtdate STRING,
evttime STRING,
svrsitename STRING,
svrip STRING,
csmethod STRING,
csuristem STRING,
csuriquery STRING,
svrport INT,
csusername STRING,
cip STRING,
UserAgent STRING,
Referer STRING,
scstatus STRING,
scsubstatus STRING,
scwin32status STRING,
scbytes STRING,
csbytes STRING,
timetaken STRING
)
COMMENT 'This is a web log sample'
ROW FORMAT DELIMITED FIELDS TERMINATED by '32'
STORED AS TEXTFILE
LOCATION '/user/campschurmann/weblog/';

You should be able to copy/paste it directly from this wiki post but just in case you cannot, the weblog_sample.hql file you had previously downloaded contains the same command.

Note
You will notice that this is a CREATE EXTERNAL TABLE command - this allows you to create a Hive table that points to the files located in a folder instead of going through the task of uploading the data into separate hive table / partitions.

More Information: For more information about CREATING TABLES in HIVE, please reference the Apache Hive Tutorial > Creating Tables

2) To verify that the table exists, type the command:
show tables
and you should see the weblog_sample table that you had created listed.

3) To validate the data can be read, you can type the command:
select * from weblog_sample limit 10;
and you should view the first ten rows from the weblog_sample Hive table, which is pointing to the ex20111214.log.gz web log file.
Note: You may notice that the weblog_sample Hive table is pointing to the weblog folder which contains a compressed gzip file. There advantage is that if your weblog files are already gzipped, you do not need to decompress them to read them with Hive.

Querying your Hive Table

As noted above, you can run your HiveQL queries against this sample web log. But one of the key important things is to utilize hive parsing functions to extract valuable data from the key-value pairs. For example, the query below extracts out the first page hierarchy information from the csuristem column and groups by that value, and does a count.

select regexp_replace(split(csuristem, "/")[1], "MainFeed.aspx", "Home"), count(*)
from weblog_sample
group by regexp_replace(split(csuristem, "/")[1], "MainFeed.aspx", "Home")

The page hierarchy in the csuristem column looks like:
/Olympics/archive/2007/09/13/Lena-Lake.aspx

By using the split function, in the form of:
split(csuristem, "/")[1]

I'm able to extract out the first value of the string array defined by "/" - in the above case, this would be the value "Olympics". I'm also using the regexp_replace function to change the MainFeed.aspx page to indicate that it's actually the Home Page.

Finally, I use the group by and count(*) functions to perform my aggregate query.

More Information: To review all of the available Hive functions, please reference the Apache Hive Language Manual UDF at Apache Hive > LanguageManual UDF .


Avkash Chauhan (@avkashchauhan) reported With Azure SDK 1.6, Azure Diagnostics is enabled by default can cause thousands of daily transections to Azure Storage on 1/11/2012:

imageWith Windows Azure SDK 1.6, Azure Diagnostics is enabled by default as below in your Azure Application. So when you deploy your application directly from Visual Studio, the package is temporarily stored in “vsdeploy” container at the Azure Storage which you have configured during Publish setup. Because of the setting below, the same Azure storage is used by Azure Diagnostics to store diagnostics data:

imageIf you deploy your application directly without making change to default diagnostics setting, you will see about ~10,000 transactions added daily to your Azure storage. The best way to disable these transection by disabling the Azure Diagnostics directly in your application and the deploy it again by deleting the previous one. When diagnostics is disabled the setting should look like a below:

Also when default diagnostics is disabled you will NOT see the following configuration in your service configuration and service definition:

 ServiceDefinition.csdef:
<Imports>
<Import moduleName="Diagnostics" />
</Imports>

ServiceConfiguration.cscfg:
<ConfigurationSettings>
<Setting name="Microsoft.WindowsAzure.Plugins.Diagnostics.ConnectionString" value="UseDevelopmentStorage=true" />
</ConfigurationSettings>

After that you can just package your Windows Azure Application as below:

And depend on your packaging settings…

The Windows Azure application will be created as below:

Now you can just deploy this package directly at the portal and you will not see any Azure Diagnostics transection in any Azure storage because we haven’t used any.

The trick here is that in this deployment we haven't use any Azure Storage reference (You must not deploy your package form VS2010 or Azure Storage to avoid) so the default Azure Diagnostics will not be able to send diagnostics data anywhere because the Azure VM does not have any reference to Azure VM.

Windows Azure diagnostics (WAD) data is the source of my data for the SQL Azure Federations and BigData Quartet (with apologies to Lawrence Durrell) described in the SQL Azure Database, Federations and Reporting section below.


Matthew Hayes posted Introducing DataFu: an open source collection of useful Apache Pig UDFs to the LinkedIn Engineering blong on 1/10/2012:

imageAt LinkedIn, we make extensive use of Apache Pig for performing data analysis on Hadoop. Pig is a simple, high-level programming language that consists of just a few dozen operators and makes it easy to write MapReduce jobs. For more advanced tasks, Pig also supports User Defined Functions (UDFs), which let you integrate custom code in Java, Python, and JavaScript into your Pig scripts. [See the next post below for Apache Pig support in Windows Azure.]

imageOver time, as we worked on data intensive products such as People You May Know and Skills, we developed a large number of UDFs at LinkedIn. Today, I'm happy to announce that we have consolidated these UDFs into a single, general-purpose library called DataFu and we are open sourcing it under the Apache 2.0 license:

Check out DataFu on GitHub!

DataFu includes UDFs for common statistics tasks, PageRank, set operations, bag operations, and a comprehensive suite of tests. Read on to learn more.

image_thumb3_thumb_thumb11What's included?

Here's a taste of what you can do with DataFu:

Example: Computing Quantiles

Let's walk through an example of how we could use DataFu. We will compute quantiles for a fake data set. You can grab all the code for this example, including scripts to generate test data, from this gist.

Let’s imagine that we collected 10,000 temperature readings from three sensors and have stored the data in HDFS under the name temperature.txt. The readings follow a normal distribution with mean values of 60, 50, and 40 degrees and standard deviation values of 5, 10, and 3.

Box Plot

We can use DataFu to compute quantiles using the Quantile UDF. The constructor for the UDF takes the quantiles to be computed. In this case we provide 0.25, 0.5, and 0.75 to compute the 25th, 50th, and 75th percentiles (a.k.a quartiles). We also provide 0.0 and 1.0 to compute the min and max.

Quantile UDF example script

define Quartile datafu.pig.stats.Quantile('0.0','0.25','0.5','0.75','1.0');

 

temperature = LOAD 'temperature.txt' AS (id:chararray, temp:double);

 

temperature = GROUP temperature BY id;

 

temperature_quartiles = FOREACH temperature {

sorted = ORDER temperature by temp; -- must be sorted

GENERATE group as id, Quartile(sorted.temp) as quartiles;

}

 

DUMP temperature_quartiles

view rawtemperature.pigThis Gist brought to you by GitHub.

Quantile UDF example output, 10,000 measurements

(1,(41.58171454288797,56.559375253601715,59.91093458980706,63.335574106080365,79.2841731889925))

(2,(14.393515179526304,43.39558395897533,50.081758806889766,56.54245916209963,91.03574746442487))

(3,(29.865710766927595,37.86257868882021,39.97075970657039,41.989584898364704,51.31349575866486))

view rawtemperature_output.txtThis Gist brought to you by GitHub.

The values in each row of the output are the min, 25th percentile, 50th percentile (median), 75th percentile, and max.

StreamingQuantile UDF

The Quantile UDF determines the quantiles by reading the input values for a key in sorted order and picking out the quantiles based on the size of the input DataBag. Alternatively we can estimate quantiles using the StreamingQuantile UDF, contributed to DataFu by Josh Wills of Cloudera, which does not require that the input data be sorted.

StreamingQuantile UDF example script

define Quartile datafu.pig.stats.StreamingQuantile('0.0','0.25','0.5','0.75','1.0');

 

temperature = LOAD 'temperature.txt' AS (id:chararray, temp:double);

 

temperature = GROUP temperature BY id;

 

temperature_quartiles = FOREACH temperature {

-- sort not necessary

GENERATE group as id, Quartile(temperature.temp) as quartiles;

}

 

DUMP temperature_quartiles

view rawtemperature2.pigThis Gist brought to you by GitHub.

StreamingQuantile UDF example output, 10,000 measurements

(1,(41.58171454288797,56.24183579452584,59.61727093346221,62.919576028265375,79.2841731889925))

(2,(14.393515179526304,42.55929349057328,49.50432161293486,56.020101184758644,91.03574746442487))

(3,(29.865710766927595,37.64744333815733,39.84941055349095,41.77693877565934,51.31349575866486))

view rawtemperature2_output.txtThis Gist brought to you by GitHub.

Notice that the 25th, 50th, and 75th percentile values computed by StreamingQuantile are fairly close to the exact values computed by Quantile.

Accuracy vs. Runtime

StreamingQuantile samples the data with in-memory buffers. It implements the Accumulator interface, which makes it much more efficient than the Quantile UDF for very large input data. Where Quantile needs access to all the input data, StreamingQuantile can be fed the data incrementally. With Quantile, the input data will be spilled to disk as the DataBag is materialized if it is too large to fit in memory. For very large input data, this can be significant.

To demonstrate this, we can change our experiment so that instead of processing three sets of 10,000 measurements, we will process three sets of 1 billion. Let’s compare the output of Quantile and StreamingQuantile on this data set:

Quantile UDF example output, 1 billion measurements

(1,(30.524038,56.62764,60.000134,63.372384,90.561695))

(2,(-9.845137,43.25512,49.999536,56.74441,109.714687))

(3,(21.564769,37.976644,40.000025,42.023622,58.057268))

view rawquantile_output_for_1_billion.txtThis Gist brought to you by GitHub.

StreamingQuantile UDF example output, 1 billion measurements

(1,(30.524038,55.993967,59.488968,62.775554,90.561695))

(2,(-9.845137,41.95725,48.977708,55.554239,109.714687))

(3,(21.564769,37.569332,39.692373,41.666762,58.057268))

view rawstreaming_quantile_output_for_1_billion.txtThis Gist brought to you by GitHub.

The 25th, 50th, and 75th percentile values computed using StreamingQuantile are only estimates, but they are pretty close to the exact values computed with Quantile. With StreamingQuantile and Quantile there is a tradeoff between accuracy and runtime. The script using Quantile takes 5 times as long to run as the one using StreamingQuantile when the input is the three sets of 1 billion measurements.

Testing

DataFu has a suite of unit tests for each UDF. Instead of just testing the Java code for a UDF directly, which might overlook issues with the way the UDF works in an actual Pig script, we used PigUnit to do our testing. This let us run Pig scripts locally and still integrate our tests into a framework such as JUnit or TestNG.

We have also integrated the code coverage tracking tool Cobertura into our Ant build file. This helps us flag areas in DataFu which lack sufficient testing.

Conclusion

We hope this gives you a taste of what you can do with DataFu. We are accepting contributions, so if you are interested in helping out, please fork the code and send us your pull requests!


Avkash Chauhan (@avkashchauhan) described Running Apache Pig (Pig Latin) at Apache Hadoop on Windows Azure on 1/10/2012:

imageThe Microsoft Distribution of Apache Hadoop comes with Pig Support along with an Interactive JavaScript shell where users can run their Pig queries immediately without adding specific configuration. The Apache distribution running on Windows Azure has built in support to Apache Pig.

image_thumb3_thumb_thumb11Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

imageAt the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:

  • Ease of programming: It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
  • Optimization opportunities: The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility: Users can create their own functions to do special-purpose processing.

Apache Pig has two execution modes or exectypes:

  • Local Mode: - To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the -x flag (pig -x local).

Example:

$ pig -x local

$ pig

  • Mapreduce Mode: - To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; you can, but don't need to, specify it using the -x flag (pig OR pig -x mapreduce).

Example:

$ pig -x mapreduce

You can run Pig in either mode using the "pig" command (the bin/pig Perl script) or the "java" command (java -cp pig.jar ...). To learn more about Apache Pig please click here.

After you have configured your Hadoop Cluster on Windows Azure, you can remote login to your Hadoop Cluster. To run Pig scripts you can copy sample pig files at the C:\Apps\dist\pig folder from the link here:

Now, you can launch the Hadoop Command Line Shortcut and run the command as below:

cd c:\apps\dist\examples\pig

hadoop fs -copyFromLocal excite.log.bz2 excite.log.bz2

C:\Apps\dist\pig>pig

grunt> run script1-hadoop.pig

Once the Job has started you can see the job details at Job Tracker (http://localhost:50030/jobtracker.jsp)

script1-hadoop.pig: …

Avkash continues with the complete contents of the script.


Bruce Kyle reported the availability of an ISV Video: Whitewater Backup to the Cloud on Windows Azure Storage on 1/9/2012:

    imageWhitewater from Riverbed automates backup for enterprises into Windows Azure. Azure Architect Evangelist Allan Naim talks with Riverbed Marketing Director Bob Gilbert about Riverbed's decision to support Windows Azure for their Whitewater backup appliance.

    imageISV Video: Whitewater Backup to the Cloud on Windows Azure Storage

    Bob explains why Riverbed selected Windows Azure for their customer to backup data into the cloud, how data is secured using security keys in conjunction with the Whiteware appliance. Bob explains why Azure is an important component in an enterprise back up strategy and for disaster recovery.

    Architectural Overview

    imageFor an architectural overview of how Whitewater works with Windows Azure storage, see ISV Case Study: Cloud Storage Gateway Provides Backup into Windows Azure.

    About Whitewater

    Whitewater automates enterprise backup to the cloud. The product from Riverbed is a gateway that combines data de-duplication, network optimization, data encryption, and integration with Windows Azure storage services through a single, virtual, or physical appliance.

    No longer do enterprises need to use tape and move the files from disk to disk. Nor does an enterprise need to pay for a traditional disaster recovery site.

    Whitewater cloud storage gateways and public cloud storage provides secure, off-site storage with restore anywhere capability for disaster recovery (DR). Data can be recovered by Whitewater from any location. Public cloud storage offers excellent DR capabilities without the high costs associated with remote DR sites or colocation service providers. With the public cloud and Whitewater cloud storage gateways – any size organization can significantly reduce its business risk from unplanned outages without the large capital investments and running costs required by competing solutions.

    Additional Resources

    Getting started with Windows Azure


    My (@rogerjenn) Generating Big Data for Use with SQL Azure Federations and Apache Hadoop on Windows Azure Clusters post of 1/8/2012 begins:

    Background

    imageMost tutorials for and demos of SQL Azure Federations use tables having only a few rows. Similarly, sample data for Apache Hadoop on Windows Azure also doesn’t qualify for “Big Data” status.

    image_thumb3_thumb_thumb11I wanted to use larger tables with a variety of SQL Server data types for testing these new SQL Azure and Windows Azure technologies. I also wanted to determine best practices for creating large data sets, as well as discover and deal with problems creating them. For SQL Azure Federations, I wanted to demonstrate issues with federating SQL Azure tables that don’t have a column innately suited to act as a federation key. [See the SQL Azure Database, Federations and Reporting section below.]

    imageFortunately, I had enabled collecting Windows Azure Diagnostics for my OakLeaf Systems Azure Table Services Sample Project in November 2011. Following is the source code and configuration data shown in the “Setting up Windows Azure Diagnostics” section of my Adding Trace, Event, Counter and Error Logging to the OakLeaf Systems Azure Table Services Sample Project post of 12/5/2010, updated 11/20/2011:

    image

    As you can see from the preceding code’s Windows Performance Counters section, data is available for the following six counters (shown in the sequence of their appearance in the data):

    1. \Network Interface(Microsoft Virtual Machine Bus Network Adapter _2)\Bytes Sent/sec
    2. \Network Interface(Microsoft Virtual Machine Bus Network Adapter _2)\Bytes Received/sec
    3. \ASP.NET Applications(__Total__)\Requests/Sec
    4. \TCPv4\Connections Established
    5. \Memory\Available Mbytes
    6. \Processor(_Total)\% Processor Time

    The post continues with a detailed, illustrated tutorial.


    <Return to section navigation list>

    SQL Azure Database, Federations and Reporting

    My (@rogerjenn) Adding Missing Rows to a SQL Azure Federation with the SQL Azure Federation Data Migration Wizard v1 post of 1/14/2012 begins:

    imageMy Loading Big Data into Federated SQL Azure Tables with the SQL Azure Federation Data Migration Wizard v1.0 post of 1/12/2012 described a problem uploading part of the data for a federation member database in its “Auto-sharding Larger Data Batches” section near the end. This post describes the process I used to determine which of the 398,000 source rows were missing so I could restart the upload process with data for the correct row. The correct data is that which doesn’t cause a primary key constraint conflict and doesn’t result in any missing rows in the resultset.

    imageUpdate 1/14/2012 8:45 AM PST: My initial approach wasn’t successful, but executing a MERGE operation succeeded in replacing the missing rows. See the “Executing a MERGE Command to Add Missing Rows” sections near the end of this post.

    Background

    The Loading Big Data into Federated SQL Azure Tables with the SQL Azure Federation Data Migration Wizard v1.0 post described initially loading 1,000 rows to the federation root database (AzureDiagnostics1), splitting that database into five additional federation members based on CounterId values of 1 through 6. This was limited to 1,000 rows so as to minimize the time required for partitioning but still deliver a reasonable number of rows (166 or 167) to each partition member. The initial Timestamp value of that rowset, created from a WADPerformanceCountersTable-1000rows.txt tab-delimited text file, was 2011-07-25 10:33:21.9432881.

    After creating the six-member federation, I uploaded a second rowset created from a WADPerformanceCountersTable-Page-79.txt file with 398,000 rows of data for the time period that immediately preceded the 1,000 row upload. It’s last timestamp value was 2011-07-25 10:33:21.9432881, the same as that for the 1,000-row rowset. (There are several successive rows with identical Timestamp values.) This addition failed for imagefederation member 4 after adding 50,000 rows.

    The reason for adding batches of data in reverse chronological order is that I had previously downloaded approximately 8 GB of Windows Azure diagnostic data in 1-GB increments for bulk loading into an SQL Azure 2008 R2 SP1 database in ascending date order. This database was intended for testing uploads on scale similar to that which might be common for large enterprises. Adding later values assured that primary key constraint conflicts wouldn’t occur.

    The post continues with an illustrated tutorial on the topic.


    My (@rogerjenn) Loading Big Data into Federated SQL Azure Tables with the SQL Azure Federation Data Migration Wizard v1.0 post of 1/12/2012 begins:

    imageMost demonstrations, workshops and hands-on labs for SQL Azure Federations use simple T-SQL INSERT … VALUES() statements executed by copying and pasting into query editing windows of the SQL Azure Management Portal or SQL Server Management Studio 2008 R2 SP1. Obviously that approach won’t work for production applications.

    imageGeorge Huey (@gihuey), a Microsoft Data Architect, is the author of the SQL Azure Migration Wizard (SQLAzureMW), which migrates entire SQL Server databases to SQL Azure. I described Using the SQL Azure Migration Wizard v3.3.3 with the AdventureWorksLT2008R2 Sample Database in a detailed 7/18/2010 post. SQLAzureMW was at v3.8 when I wrote this post.

    imageYou can use SQLAzureMW to upload data to existing federated SQL Azure tables, but George’s new (as of 12/12/2011) SQL Azure Federation Data Migration Wizard (SQLAzureFedMW) v1.0 is simpler and more straightforward for uploading data, especially Big Data.

    For more information about SQL Azure Federations, see MSDN’s Federations in SQL Azure (SQL Azure Database) topic and its subtopics.

    This tutorial explains how to use SQLAzureFedMW to load data from a local SQL Server 2008 R2 SP1 WADPerfCounters table into a federated WADPerfCounters table in the AzureDiagnostics1 root member of the WADFederation. Here’s the SQL Azure Portal’s query editing page displaying the first 5 columns of 10 rows from 398,000 rows uploaded in an initial test:

    image

    The post continues with a lengthy illustrated tutorial on the topic.


    My (@rogerjenn) Creating a SQL Azure Federation in the Windows Azure Platform Portal post of 1/11/2012 begins:

    imageUpdate 1/11/2012 2:00 PM PST: The problem with uploading data to the federation via BCP with George Huey’s SQL Azure Federation Data Migration Wizard has been solved. Stay tuned for a forthcoming Loading Big Data into Federated SQL Azure Tables with the SQL Azure Federation Data Migration Wizard v1.0 post later today.

    Update 1/10/2012 2:20 PM PST: The initial problem with creating a table in the federation root has been solved at step 13 and later. The problem now is uploading data to the federation via BCP with George Huey’s SQL Azure Federation Data Migration Wizard. Stand by for a post on that later.

    imageThis tutorial assumes that you have obtained a free trial or pay-per-use subscription to the Windows Azure Platform.

    imageThis sample federation requires a 1-GB root database and five 1-GB member databases created from the data source described in OakLeaf’s Generating Big Data for Use with SQL Azure Federations and Apache Hadoop on Windows Azure Clusters post of 1/8/2012.

    Note: Users of a Windows Azure Free Trial subscription are allotted only one 1-GB database at no charge. The additional five 1-GB databases will cost about $1.65 per day until you drop them. To drop the spending limit on your subscription, open the Subscriptions Profile page, click the yellow Would You Like to Upgrade Now? link to open the Fully Enable Windows Azure dialog, select the Yes, Upgrade My Subscription option and click the check button to convert to pay-as-you-go pricing for more than the allotted resources.

    Warning: Removing the spending cap cannot be reversed.

    The post continues with a step-by-step tutorial.


    Cihan Biyikoglu (@cihangirb) posted a Accessing Federations in SQL Azure using Entity Framework roadmap on 1/10/2011:

    imageEntity Framework is a popular these days in many web applications and I get EF support in federations question a few times a week these days… Here is a quick collection of of articles on the topic;

    imageThis post just came out on the Entity Framework and Federations on the ADO.Net team blog;

    SQLCAT folks, James, Rick and others;

    A few of the limitations and workarounds are listed here;


    <Return to section navigation list>

    MarketPlace DataMarket, Social Analytics and OData

    The Data Explorer Team (@DataExplorer) asked Have you tried the “Data Explorer” samples yet? on 1/12/2012:

    imageIf you have been following this blog in the last few months you may have already seen several blog posts explaining how to accomplish some end-to-end scenarios that involve different kinds of data sources (Excel, Text files, OData feeds or some Web APIs, for example) using the “Data Explorer” Cloud Service.

    imageWhile this is great for showing some quick tips & tricks in “Data Explorer”, we have been also working on a number of additional learning resources: tutorial videos and step-by-step samples, as well as the “Data Explorer” Formula language and library specifications.

    These are some of the contents you will find:

    • Importing data from the Windows Azure Marketplace.
    • Merging multiple files.
    • Adding a Web page as a data source.
    • Using lookup and merges.
    • Playing with functions and lookup tables.
    • Importing a table from a text file.

    You can access all these resources in the Data Explorer Learning Page.

    If you have any questions or there are any other samples that you would like to see in the future to help you learn more about “Data Explorer”, don’t hesitate to let us know with a comment on this post or using our MSDN Forum; we are looking forward to hear from you!


    The Social Analytics Team (@described Finding Top Handles in a 1/9/2012 post:

    imageThis post is a continuation of a series of posts that provides details on how the Entities can be used to accomplish some basic scenarios in the Social Analytics lab.

    In this scenario we’ll look at identifying top participants for one of the lab datasets. To accomplish this, we need on two Entities in our dataset (pictured below):

    imageUsing these two entities we can discover who is most active in the conversations included in our dataset.

    Handles

    Handles are the primary element of a person’s identity for a site. You may have discovered (as mentioned in our “Boo” post) the ability to open a column in the Engagement Client for a person’s conversation. Handle is the entity used to track all content items for a person. The Bill Gates lab is already tracking over 199,000 handles.

    Here’s a sampling of some handles from the Bill Gates slice:


    HandleReferenceEntities

    We look for references to a handle in every content item that is processed for a dataset. In the Bill Gates dataset, we can find references to handles as authors and as mentions:

    Although we process posts as an individual Content Items, our focus in Social Analytics is in aggregates at the Thread Level and above, which is why we attach MessageThreadId to just about everywhere that we use ContentItemId.

    Finding the top Handles

    With these two entities, we can easily find the top participants in the Bill Gates dataset over the last 7 days. We’ll use the following LINQ statement in LINQPad:

     (
    from h in Handles.Expand("HandleReferences").AsEnumerable()
    where
    h.HandleReferences.Max(hr => hr.LastUpdatedOn) > DateTime.Now.AddDays(-7)
    orderby h.HandleReferences.Count descending
    select new
    {
    h.Id
    ,h.Name
    ,References = h.HandleReferences.Count,
    h.ProfileUrl,
    LastReference = h.HandleReferences.Max(hr => hr.LastUpdatedOn)
    }
    ).Take(5)

    The results:

    This basic analytic provide a good starting point to understand your top participants and know who you might want to build relationships and who you might want to ignore as noise that’s not relevant to your interests.

    Check out our next blog post to find out more about what you can do with the entities in the Social Analytics Lab.

    The team’s avatar isn’t likely to win any graphic arts design awards.


    <Return to section navigation list>

    Windows Azure Access Control, Service Bus and Workflow

    Chris Klug (@ZeroKoll) described Trying Out Azure Service Bus Relay Load Balancing in a 1/9/2012 post:

    imageAs you might have noticed from previous posts, I am somewhat interested in the Azure Service Bus. Why? Well, why not? To be honest, I don’t actually know exactly why, but I feel that it offers something very flexible out of the box, and without to much hassle.

    imageOne of the later feature that was added to it is the support for load balancing when using message relaying. (You can read more about message relaying here)

    It is pretty cool, and just works… And by just works, I mean it really just works. If you have a service using message relaying today, adding another instance will automatically enable the feature. But remember, the messages are delivered to ONE of the services, not both. So if your service cannot handle that, make sure you change the implementation to make sure that only one instance is running at any time.

    In previous versions of the bus, adding a second service would throw an exception, which is obviously not the case. So if you were depending on this to make sure only one instance was ever running, you will have to revisit that code and make some changes…

    I have decided to crate a tiny sample to show off the feature… So let’s have a look!

    I started by creating 3 projects, two console applications, and one class library. I think this would be the most common way to do it. Not using 2 console apps (!), but having a server project, a client project and a class library with the service contract.

    Let’s start with the contract. I have decided to create one of those contracts that really shouldn’t ever exist, but still works for demos. It is called INullService and has a single method called DoNothing(), which takes no parameters, and returns void. Like this

    [ServiceContract(Name="NullService", Namespace="http://chris.59north.com/azure/relaydemo")]
    public interface INullService
    {
    [OperationContract(IsOneWay = true)]
    void DoNothing();
    }

    public interface INullServiceChannel : INullService, IChannel { }

    As you can see, the service contract is defined as an interface, which is the way it should be. Do not adorn classes with ServiceContractAttribute…

    I have also marked the method as being one-way and added an extra interface extending the INullService interface by implementing IChannel as well… Just as I have done in previous demos…

    Now that the contract is done, I will move to the server project. I start off by adding references to the class library project, Microsoft.ServiceBus and System.ServiceModel.

    Actually I only add a reference to the ServiceBus assembly as it is the only one only being used from config, but if you don’t have ReSharper, it is easier to do it up front. With ReSharper, you can just type the name of the type you need, press Alt-Enter and select “Reference assembly…”. (Did I mention that I have started to like ReSharper a LOT?

    Once the references have been added, I add a new class called NullService, which implements INullService in the most simple way possible

    internal class NullService : INullService
    {
    public void DoNothing()
    {
    Console.WriteLine("Doing nothing...");
    }
    }

    That’s all there is to the actual service. Next up is the Main method, inside which I start off by creating a new ServiceHost instance. I do this passing in a Type referring to the NullService class I just created, and getting the rest of the information needed from the app.config file by automagic. I then open the host, write a message to the screen and wait for a key to be pressed. Once the key is pressed, I close the host, and let the program close…

    static void Main(string[] args)
    {
    var host = new ServiceHost(typeof(NullService));

    host.Open();

    Console.WriteLine("Service listening at: " + host.Description.Endpoints.First().Address);
    Console.WriteLine("Press any key to exit...");
    Console.ReadKey();

    host.Close();
    }

    Once again, very simple! Unfortunately, there is as mentioned a bit of config to go with it. The config is however very standard, so I won’t cover it here, but it is available in the download at the bottom of the page…

    Ok, server done, time to look at the client!

    The client is a bit more complicated. I started out by creating a very simple client, unfortunately, it kept calling the same service instance all the time, which doesn’t really show off the load balancing very well. To get around this, I made it a bit more complicated by making sure that all the calls were being made in parallel using Tasks.

    Let’s go through it one step at the time, once again leaving out the config…

    The Main method just makes a single call to a method called ExecuteCallBatch(). The method is named like this because each time the method is run, it will call the service 50 times. Remember, we need a least a little load for this to work…

    The ExecuteCallBatch() method starts by asking the user if he/she wants to use the same channel for all calls (and also if he/she wants to quit the application).

    Console.Write("Do you want to keep the channel? (y/n/c) ");

    var key = Console.ReadKey().Key;

    if (key == ConsoleKey.C)
    return;

    var keepChannel = Console.ReadKey().Key == ConsoleKey.Y;

    Ok, now we know that the users wants to go on, and whether or not he/she wants to keep the channel alive for all calls.

    Next I create a couple of variables, one Task variable, and one INullServiceChannel, and if the channel should be kept alive, I set the channel variable to a channel.

    var channel = keepChannel ? GetNewChannel() : null;
    Task t = null;

    The GetNewChannel() method really just instantiates a new channel using a ChannelFactory<> and opens it

    private static INullServiceChannel GetNewChannel()
    {
    var channel = new ChannelFactory<INullServiceChannel>("RelayEndpoint").CreateChannel();
    channel.Open();
    return channel;
    }

    Now that I have the variables I need, it is time to create the loop that calls the service. All it does is that it creates a new Task, passing in an anonymous method, and starts it. The anonymous method is responsible for calling the service. It also checks whether or not the channel should be re-used. If not, it creates its own channel, which it also closes when done.

    for (int i = 0; i < 50; i++)
    {
    t = new Task(() =>
    {
    var c = channel;
    if (!keepChannel)
    c = GetNewChannel();

    c.DoNothing();

    if (!keepChannel)
    c.Close();
    });
    t.Start();
    }

    The reason that I keep a reference to the Task, is that at the end of the loop, I use that reference to wait for the last Task to complete before carrying on.

    The way I do it in this sample isn’t the best idea, as it assumes that the last Task to be created, will also be the last one to finish. This isn’t necessarily true, but for this demo it is an assumption I can live with. The app might crash if I am unlucky, but it is unlikely enough for me to ignore…

    Once the last Task has finished, I close the channel if it has been kept throughout the call batch, and then finally call ExecuteCallBatch() again, causing a loop that can only be stopped by pressing “c” at he prompt.

    t.Wait();

    if (keepChannel)
    channel.Close();

    ExecuteCallBatch();

    Ok, that is all there is to it. To try it out, I start 2 or 3 instances of the server application, and let them all connect and get up and running before I start the client. I then let the client run a couple of batches, looking at the nice little printouts I get in all of the server windows, proving the that the calls are being nicely load balanced…

    If you run through the demo a couple of times, varying between keeping the same channel and not, you will see that the result is the same. The messages are being load balanced in both scenarios, which I assume means that utilizing sessions in the services might cause some problems. But I still haven’t looked into this so I might be wrong… Feel free to tell me if you know!

    That was it for this time!

    Download code here: DarksideCookie.Azure.ServiceBusDemo.zip (142.84 kb)


    <Return to section navigation list>

    Windows Azure VM Role, Virtual Network, Connect, RDP and CDN

    imageNo significant articles today.


    <Return to section navigation list>

    Live Windows Azure Apps, APIs, Tools and Test Harnesses

    Nuno Godino (@nunogodinho) wrote Windows Azure to Deliver Connected Device Applications and ACloudyPlace posted it on 1/12/2012:

    imageCurrently the mobile market is expanding rapidly and very important; so it’s strategic that we take mobile into account when planning a solution. There will be a significant number of people accessing apps using connected devices like Smartphones, Tablets, Slates, etc.

    imageBased on this we need to take into account one very important thing, connectivity, in this case, wireless. In order to better understand the amount of data we are talking about, AT&T stated that “We will deliver as much data over our network in the first 5 weeks of 2015 as we did all year in 2010”. That statement illustrates how much more we’re going to need to take these types of devices into account.

    Best Practices

    In this article we’ll be focusing on some of the Best Practices that, when used, will make our applications even better and easier to reach on Connected Devices.

    When building these kinds of applications we have different kinds of data that can be accessed by the client from Databases, to Files; and these Files can be Static or Dynamic.

    Best Practice: Get Out of the Way When You Can

    If we talk about Static Content the best thing to do would be to send the client directly to Windows Azure Blob Storage, instead of placing some Hosted Service in the middle, since this way we can use the Auto-Scaling capabilities of Windows Azure Storage to manage the load of our Blob Data. This is very important for:

    • Media – like images, videos,
    • Binaries – like XAP, ZIP, MSI, DLLs,
    • Data Files – like XML, CSV, XLSX,
    Best Practice: Serve Public Blobs using Windows Azure CDN

    Also very important when talking about Static Content is that it should be available as close as possible to the client to reduce latency and serve the client quicker. If we enable Windows Azure CDN, it will propagate the files into the 24 nodes that currently exist in the CDN allowing them to be automatically closer to the client. This will also allow us to deliver the content to more customers, simultaneously reducing the number of transactions and the latency on the central storage account.

    Best Practice: Always Secure Your Content

    When delivering content, we may have some static content which is public and some that isn’t. When sensitive files need to be shared, we should always use Windows Azure Shared Access Signatures to enable security since it provides direct access to the content in a controlled (ACLed) environment and access can be time-bound and revoked on demand. This way our private content will continue to stay private, while at the same time be accessible directly.

    To use Shared Access Signatures we should place a Windows Azure Compute node that will be contacted whenever a Client needs to access specific private content. This Compute node will be responsible for generating the Shared Access Signature that should be used by the customer, and since it’s time-bound, the client device can store it and use it while it’s still active. By not allowing the Device to do the Shared Access Signature generation, we’ll be able to manage all access and prevent any unauthorized requests for content.

    Best Practice: Partition your Data

    When we talk about data it’s very important that it’s scalable and for that we need to partition our Data. The data can be partitioned Horizontally (aka Sharding), Vertically, or Hybrid. Normally we do Hybrid partitioning even if we don’t think of it, since we place all Relational Data inside a SQL Azure Database and all the Files inside Blob Storage, even if those files are related to the elements that exist inside the Relational Database.

    (Figure 1 – Example of Hybrid Partitioning)

    If we need to partition the same type of data, we can partition it Horizontally (Sharding), which is basically done by splitting our data across several different nodes that have the same schema associated, like the example below.

    (Figure 2 – Example of Horizontal Partitioning/Sharding)

    Best Practice: Choose your Data Storage Strategy Correctly

    Also important when talking about Data is the fact that we need to choose our storage location carefully as that will influence our partitioning and even the scalability or our application. We should choose the storage location based on the type of data as well as the way it’s consumed, so if we have data that is searched intensively and needs to be indexed we should choose SQL Azure, but if we don’t need that data to be highly indexed then Windows Azure Table Storage will be best suited for it.

    Best Practice: Cache your Data

    Another very important part of making your device solutions better is to cache your data, since we are never “always on”, and sometimes we can be “online” but we prefer not to be because of costs, for example when we need to use Roaming. Because of this we should always cache our data in order to keep from incurring costs, and also to avoid unnecessary stress onto the Compute or even Storage nodes.

    We can also use In-Memory caching in the Compute node on Windows Azure; using this method we can use Windows Azure Caching to avoid stressing the Storage nodes too much.

    Best Practice: Consume Services efficiently

    When we’re consuming services on our mobile devices one very important piece of information to keep in mind is the size of the communication that is taking place, since this will affect our solution’s scalability, costs, and also usability. We should always take into account that receiving the results in JSON format is a lot smaller than receiving it in a XML format, but at the same time there may be significant differences in interoperability and security so you’ll need to keep these things in mind.

    Best Practice: Always Enable Compression on Dynamic Content Types

    Compression is very important because this way we can shrink the data transmitted through the wire, resulting in reduced costs for us and the customer.

    Summary

    These Best Practices are essential and, when used, will improve our solution. Getting acquainted with Best Practices will make working with Windows Azure even easier and more efficient.

    In this article we haven’t taken any particular focus on architecting or the Backend costs of the mobile solution, since that would need to be addressed in a different article.

    Full disclosure: I’m a paid contributor to Red Gate Software’s ACloudyPlace.com.


    Pankaj Arora posted a Book Excerpt: "To The Cloud: Cloud Powering an Enterprise" on 1/11/2012:

    Hi, I’m Pankaj Arora, a senior manager in the Microsoft IT Global Strategic Initiatives team.

    When Microsoft CEO Steve Ballmer publicly declared “We’re all in” on cloud computing in March 2010, he wasn’t just referring to Microsoft’s products. He also was giving his IT organization a mandate to move to cloud computing. Since that declaration, my colleagues and I have learned a lot about what it takes to adopt cloud computing at a global enterprise. We now have cloud deployments of all the common models—SaaS, PaaS, and IaaS—and we’re starting to use Data-as-a-Service.

    With numerous deployment experiences under our belt—and industry predictions of even greater cloud adoption in 2012 as a backdrop—I want this community to know about a book I’ve co-authored with two colleagues titled, To the Cloud: Cloud Powering an Enterprise. In summary, the book addresses the Why, What and How of enterprise cloud adoption. It’s based on our own experiences and best practices adopting cloud computing, while also drawing on industry and customers experiences.

    The following is an excerpt from Chapter 4 of the pre-production version of the book, which is available in print and eBook through Amazon, Barnes & Noble and McGraw-Hill amongst other outlets. You can see more on the book website here.

    Feel free to ask questions, and I hope these excerpts (and the book) help you with your cloud computing strategy and deployments.

    Pankaj

    Architectural Principles

    Moving applications and data out of the corporate data center does not eliminate the risk of hardware failures, unexpected demand for an application, or unforeseen problems that arise in production. Designed well, however, a service running in the cloud should be more scalable and fault-tolerant, and perform better than an on-premises solution.

    Virtualization and cloud fabric technologies, as used by cloud providers, make it possible to scale out to a theoretically unlimited capacity. This means that application architecture and the level of automaton, not physical capacity, constrain scalability. In this section, we introduce several design principles that application engineers and operations personnel need to understand to properly architect a highly scalable and reliable application for the cloud.

    Resiliency

    A properly designed application will not go down just because something happens to a single scale unit. A poorly designed application, in contrast, may experience performance problems, data loss, or an outage when a single component fails. This is why cloud-centric software engineers cultivate a certain level of pessimism. By thinking of all the worst-case scenarios, they can design applications that are fault tolerant and resilient when something goes wrong.

    Monolithic software design, in which the presentation layer and functional logic are tightly integrated into one application component, may not scale effectively or handle failure gracefully. To optimize an application for the cloud, developers need to eliminate tight dependencies and break the business logic and tasks into loosely-coupled modular components so that they can function independently. Ideally, application functionality will consist of autonomous roles that function regardless of the state of other application components. To minimize enterprise complexity, developers should also leverage reusable services where possible.

    We talked about the Microsoft online auction tool earlier. One way to design such an application would be to split it into three components, as each service has a different demand pattern and is relatively asynchronous from the others: a UI layer responsible for presenting information to the user, an image resizer, and a business logic component that applies the bidding rules and makes the appropriate database updates. At the start of the auction, a lot of image resizing occurs as people upload pictures of items they add to the catalog. Toward the end of the auction, as people try to outbid each other, the bidding engine is in higher demand. Each component adds scale units as needed based on system load. If, for example, the image resizer component fails, the entire functionality of the tool is not lost.

    Pessimism aside, the redundancy and automation built into cloud models make cloud services more reliable, in general. Often, cloud providers have multiple “availability zones” in which they segment network infrastructure, hardware, and even power from one another. Operating multiple scale units of a single application across these zones can further reduce risk; some providers require this before they will guarantee a higher SLA. Therefore, the real question when considering failure is, what happens if an instance of an application is abruptly rebooted, goes down, or is moved?

    • How will IT know the failure happened?
    • What application functionality, if any, will still be available?
    • Which steps will be required to recover data and functionality for users?

    Removing unnecessary dependencies makes applications more stable. If a service that the application relies upon for one usage scenario goes down, other application scenarios should remain available.

    For the back-end, because some cloud providers may throttle requests or terminate long-running queries on SQL PaaS and other storage platforms, engineers should include retry logic. For example, a component that requests data from another source could include logic that asks for the data a specified number of times within a specified time period before it throws an exception.

    For the occasional reboot of a cloud instance, application design should include a persistent cache so that another scale unit or the original instance that reboots can recover transactions. Using persistent state requires taking a closer look at statelessness—another design principle for cloud-based applications.

    Statelessness

    Designing for statelessness is crucial for scalability and fault tolerance in the cloud. Whether an outage is unexpected or planned (as with an operating system update), as one scale unit goes down, another picks up the work. An application user should not notice that anything happened. It is important to deploy more than one scale unit for each critical cloud service, if not for scaling purposes, simply for redundancy and availability.

    Cloud providers generally necessitate that applications be stateless. During a single session, users of an application can interact with one or more scale unit instances that operate independently in what is known as “stateless load balancing” or “lack of session affinity.” Developers should not hold application or session state in the working memory of a scale unit because there is no guarantee the user will exclusively interact with that particular scale unit. Therefore, without stateless design, many applications will not be able to scale out properly in the cloud. Most cloud providers offer persistent storage to address this issue, allowing the application to store session state in a way that any scale unit can retrieve.

    Parallelization

    Taking advantage of parallelization and multithreaded application design improves performance and is a core cloud design principle. Load balancing and other services inherent in cloud platforms can help distribute load with relative ease. With low-cost rapid provisioning in the cloud, scale units are available on-demand for parallel processing within a few API calls and are decommissioned just as easily.

    Massive parallelization can also be used for high performance computing scenarios, such as for real-time enterprise data analytics. Many cloud providers directly or indirectly support frameworks that enable splitting up massive tasks for parallel processing. For example, Microsoft partnered with the University of Washington to demonstrate the power of Windows Azure for performing scientific research. The result was 2.5 million points of calculation performed by the equivalent of 2,000 servers in less than one week,[1] a compute job that otherwise may have taken months.

    Latency

    Software engineers can apply the following general design principles to reduce the potential that network latency will interfere with availability and performance.

    • Use caching, especially for data retrieved from higher latency systems, as would be the case with cross-premises systems.
    • Reduce chattiness and/or payloads between components, especially when cross-premises integration is involved.
    • Geo-distribute and replicate content globally. As previously mentioned, enabling the content delivery network in Windows Azure, for example, allows end users to receive BLOB storage content from the closest geographical location.
    Automated Scaling

    For cloud offerings that support auto-scaling features, engineers can poll existing monitoring APIs and use service management APIs to build self-scaling capabilities into their applications. For example, consider utilization-based logic that automatically adds an application instance when traffic randomly spikes or reaches certain CPU consumption thresholds. The same routine might listen for messages, instructing instances to shut down once demand has fallen to more typical levels.

    Some logic might be finance-based. For example, developers can add cost control logic to prevent noncritical applications from auto-scaling under specified conditions or to trigger alerts in case of usage spikes.

    Scaling of data is as important as application scaling, and once again, it is a matter of proper design. Rethinking the architecture of an application’s data layer for use in the cloud, while potentially cumbersome, can also lead to performance and availability improvements. For example, if no cloud data storage service offers a solution large enough to contain the data in an existing database, consider breaking the dataset into partitions and storing it across multiple instances. This practice, known as “sharding,” has become standard for many cloud platforms and is built into several, including SQL Azure. Even if this is not necessary initially, it may become so over time as data requirements grow.


    [1] “Scientists Unfolding Protein Mystery, Fighting Disease with Windows Azure.” Microsoft.


    Wade Wegner (@WadeWegner) asked Are You Building Mobile + Cloud Applications? Tell Me! on 1/9/2012:

    imageIf you follow my blog or on Twitter then you know that I’m passionate about using services running in Windows Azure to power mobile applications. To effectively run mobile services for mobile apps you need a platform that is responsive to a global audience and able to scale to the needs of your user base – Windows Azure provides these capabilities.

    As part of the refresh of the WindowsAzure.com we have also provided additional information about mobile scenarios – it’s worth taking a look.

    WindowsAzureWe’ve built a lot of resources that you should take a look at, including: the Windows Azure Toolkit for iOS, the Windows Azure Toolkit for Android, the Windows Azure Toolkit for Windows Phone, and a host of NuGet packages for Windows Phone and Windows Azure. All of these resources include native libraries (e.g. Objective-C for iOS and .NET for Windows Phone), sample applications, documentation, and tools. We also have a lot of videos and guides available to make the process of getting started as easy as possible.

    How can you help?

    imageOne of my primary goals in 2012 is to continue to find and build compelling mobile applications that benefit from Windows Azure. We already have few great stories (see T-Mobile USA, Red Badger, easyJet, and more) but that’s only scratching the surface – we can do a lot more!

    So, I have a few questions of you:

    • Are you building mobile applications that use services in Windows Azure?
    • Are you looking for additional PR and opportunities to highlight your applications?
    • Have you tried any of the toolkits or NuGet packages?
    • Do you have feedback for me regarding the use of the toolkits or NuGet packages?
    • What should we do that we aren’t today?
    • Do you have an application released to a marketplace – either Windows Phone, Apple, or Android – that uses Windows Azure?

    If you have any feedback to these questions then please contact me at wade.wegner@microsoft.com. I want to hear from you!

    Let’s see what we can accomplish together!


    Ram Jeyaraman reported Windows Azure Libraries for Java Available, including support for Service Bus in a 1/9/2012 post to the Interoperability @ Microsoft blog:

    Good news for all you Java developers out there: I am happy to share with you the availability of Windows Azure libraries for Java that provide Java-based access to the functionality exposed via the REST API in Windows Azure Service Bus.

    You can download the Windows Azure libraries for Java from GitHub.

    This is an early step as we continue to make Windows Azure a great cloud platform for many languages, including .NET and Java. If you’re using Windows Azure Service Bus from Java, please let us know your feedback on how these libraries are working for you and how we can improve them. Your feedback is very important to us!

    You may refer to Windows Azure Java Developer Center for related information.

    Openness and interoperability are important to Microsoft, our customers, partners, and developers and we believe these libraries will enable Java applications to more easily connect to Windows Azure, in particular the Service Bus, making it easier for applications written on any platform to interoperate with each another through Windows Azure.


    <Return to section navigation list>

    Visual Studio LightSwitch and Entity Framework 4.1+

    Beth Massi (@bethmassi) posted Drop-down Lists Tips & Tricks (Beth Massi) on 1/12/2012 to the Visual Studio LightSwitch blog:

    imageDrop-down lists in Visual Studio LightSwitch are represented by the Auto Complete Box control. This control allows the user to select from a list of values, either coming from another table or a static list of choices defined by a Choice List property. There’s a lot of nice things you can do with drop-down lists to make the user experience better on your screens. Check out the following articles on my blog for more information:


    Beth Massi (@bethmassi) described Creating Cascading Drop Down Lists in Visual Studio LightSwitch in a 1/12/2012 post:

    imageA common technique on data entry screens is using one “drop down list” (called an auto-complete box in LightSwitch) as a filter into the next. This limits the amount of choices that need to be brought down and guides the user into easily locating values. This technique is also useful if you have cascading filtered lists where the first selection filters data in the second, which filters data in the next, and so forth. LightSwitch makes this easy to do using parameterized queries and parameter binding. In this post let’s take a look a couple common scenarios.

    Cascading Lists based on Multiple Tables

    imageLet’s take an example where we have a table of States and a child table of Cities. Cities are then selected on Customers when entering them into the system. So we have one-to-many relationships between State and Cities and City and Customers. Our data model looks like this:

    image

    When the user is entering new customers we don’t want to display thousands of cities in the dropdown to choose from. Although the users can use the auto-complete box feature to locate a city, bringing down all these records affects performance. It’s better to either use a Modal Window Picker search dialog (like I showed in this article) or present the list of States first and then filter the list of Cities down based on that selection.

    First we need to create a couple queries. The first will simply sort the list of States so that they show up in alphabetical order in the list. Right-click on the States table in the Solution Explorer and Add Query to open the Query Designer. Create a query called “SortedStates” that sorts on the state’s Name Ascending:

    image

    Next create a query called “CitiesByState” by right-clicking on the Cities table in the Solution Explorer and selecting Add Query again. This time we will create a parameterized query: Where the State.Id property is equal to a new parameter called Id. The Query Designer should now look like this:

    image

    Now create the Customer Detail Screen like normal. Right-click on the Screens node and select “Add Screen…”, select the Edit Details Screen template then choose Customers for the screen data. The Screen Designer opens and all the fields in the Customer entity will be in the content tree. The City field is displayed as an auto-complete box.

    image

    Next we’ll need to add a data item to our screen for tracking the selected State. We will use this value to determine the filter on the City list so that users only see cities in the selected state. Click “Add Data Item” and add a Local Property of type State called SelectedState.

    image

    Next, drag the SelectedState over onto the content tree above the City. LightSwitch will automatically create an auto-complete box control for you.

    image

    Since we want to display the states sorted, next add the SortedStates query to the screen. Click “Add Data Item” again, this time select Query and choose SortedStates.

    image

    Then select the SelectedState auto-complete box in the content tree and on the Properties window, set the Choices property to SortedStates.

    image

    Next, add the CitiesByState query to the screen and set that as the Choices property of the Cities auto-complete box. Again, click “Add Data Item” and choose the CitiesByState query.

    image

    Then select the Cities auto-complete box and set the Choices property to this query.

    image

    Lastly we need to hook up the parameter binding. Select the Id parameter of the CitiesByState query and in the properties window set the Parameter Binding to SelectedState.Id. Once you do this a grey arrow on the left side will indicate the binding.

    image

    Once you set the value of a query parameter, LightSwitch will automatically execute the query for you so you don’t need to write any code for this. Hit F5 and see what you get. Notice that the Cities drop down list is empty until you select a State at which point it feeds the CitiesByState query and executes it. Also notice that if you make a city selection and then change the state, the selection is still displayed correctly, it doesn’t disappear. Just keep in mind that anytime a user changes the state, the city query is re-executed against the server.

    image

    One additional thing that you might want to do is to initially display the state to which the city belongs. As it is, the Selected State comes up blank when the screen is opened. This is because it is bound to a screen property which is not backed by data. However we can easily set the initial value of the SelectedState in code. Back in the Screen Designer drop down the “Write Code” button at the top right and select the InitializeDataWorkspace method and write the following:

    Private Sub CustomerDetail_InitializeDataWorkspace(saveChangesTo As List(Of Microsoft.LightSwitch.IDataService))
        ' Write your code here.
        If Me.Customer.City IsNot Nothing Then
            Me.SelectedState = Me.Customer.City.State
        End If
    End Sub

    Now when you run the screen again, the Selected State will be displayed.

    Cascading Lists Based on a Single Table

    Another option is to create cascading lists based on a single table. For instance say we do not want to have a State table at all. Instead it may make more sense to store the State on the City. So our data model could be simplified by having just a City table related to many Customers. image

    This time when we create a parameterized query called CitesByState, we’ll set it up Where the State is equal to a new parameter called State.

    image

    On the screen, select “Add Data Item” to add the CitiesByState to the screen and set it as the Choices property of the City auto-complete box just like before.

    image

    This time, however, the State is the query parameter we need to bind. Add a string screen property to the screen to hold the selected state. Click “Add Data Item” again, add a required Local Property of type String and name it SelectedState.

    image

    Drag the SelectedState onto the content tree above the City. This time LightSwitch will create a textbox for us since this is just a local string property.

    image

    Finally, we need to set up the query parameter binding. Select the State query parameter and in the properties window set the Parameter Binding to SelectedState.

    image

    In order to set the SelectedState when the screen opens, the same code as before will work. Now when we run this, you will see a textbox that will filter the list of cities.

    image

    However, this may not be exactly what we want. If the user has a free-form text field then they could mistype a state code and the query would return no results. It would be better to present the states in a auto-complete box like before. Close the application and open the Screen Designer again. Select the SelectedState screen property. Notice in the properties window you can display this as a static list of values by creating a Choice List.

    image

    Enter the states that the user should select from and then run the application again. Now we get an auto-complete box like before. However, this approach leaves us having to define the choice list of states on every screen we want this functionality. The first approach using a State table solves this issue but there is also one other approach we could take to avoid having to create a separate table.

    Using a Choice List on the Entity

    We could improve this situation by defining the choice list of states on the Customer entity. Then we would only have to define the lists of states in one place. Create a State property on the Customer using the Data Designer and on the properties window select Choice List and specify the list of states there.

    image

    Now add a new detail screen for Customer and you will see the State and City properties are set up as auto-complete boxes. Next add the CitiesByState query to the screen by clicking “Add Data Item” like before. We can use the same CitiesByState query as the previous example. Select the City auto-complete box and in the properties window set the Choices property to CitiesByState like before.

    The difference is the query parameter binding. Select the State query parameter on the left and in the properties window set the Parameter Binding to Customer.State.

    image

    With this technique, you also do not need to write any code to set the initial selected State because we are now storing that value on the Customer record. Run the application again and the screen should behave the same. The only difference now is that we are storing the State on the Customer instead of a separate table.

    image

    This technique is the simplest to set up, so if you have a lot of Customer screens, then this may be the best choice for you if you have a static list of values like States. However if you have dynamic list of values then it’s better to store these in a separate table like the first technique showed.

    For more information on filtering data and configuring lists see:


    Rowan Miller reported EF 4.3 Beta 1 Released in a 1/12/2012 post to the ADO.NET Team blog:

    At the end of November we released Beta 1 of Code First Migrations. At the time we released Code First Migration Beta 1 we also announced that we would be rolling the migrations work into the main EntityFramework NuGet package and releasing it as EF 4.3.

    Today we are making Beta 1 of EF 4.3 available. This release also includes a number of bug fixes for the DbContext API and Code First.

    We are planning for this to be the last pre-release version of migrations and our next release will be the final RTM of EF 4.3.

    What Changed

    This release has been primarily about integrating migrations into the EntityFramework NuGet package, improving quality and cleaning up the API surface ready to RTM.

    Notable changes to Code First Migrations include:

    • New Enable-Migrations command. You now use the Enable-Migrations command to add the Migrations folder and Configuration class to your project. This command will also automatically fill in your context type in the Configuration class (provided you have a single context defined in your project).
    • Update-Database.exe command line tool. In addition to the power shell commands this release also includes Update-Database.exe which can be used to run the migrations process from a command line. You can find this tool in the ‘packages\EntityFramework.4.3.0-beta1\tools\’ under your solutions directory. The syntax for this command line tool is very similar to the Update-Database power shell command. Run ‘Update-Database /?’ from a command prompt for more information on the syntax.
    • Migrations database initializer. This release includes the System.Data.Entity.MigrateDatabaseToLatestVersion database initializer that can be used to automatically upgrade to the latest version when your application launches.
    • Complete xml documentation. This release now includes xml documentation (IntelliSense) for the migrations API surface.
    • Improved logging. If you specify the –Verbose flag when running commands in Package Manager Console we now provide more information to help with debugging.

    Other notable changes in EF 4.3 include:

    • Removal of EdmMetadata table. If you allow Code First to create a database by simply running your application (without using Migrations) the creation is now performed as an Automatic Migration. You can then enable migrations and continue evolving your database using migrations.
    • Bug fix for GetDatabaseValues. In earlier releases this method would fail if your entity classes and context were in different namespaces. This issue is now fixed and the classes don’t need to be in the same namespace to use GetDatabaseValues.
    • Bug fix to support Unicode DbSet names. In earlier releases you would get an exception when running a query against a DbSet that contained some Unicode characters. This issue is now fixed.
    • Data Annotations on non-public properties. Code First will not include private, protected or internal properties by default. If you manually include them in your model Code First used to ignore any Data Annotations on those members. This is now fixed and Code First will process the Data Annotations.
    • More configuration file settings. We’ve enabled more Code First related settings to be specified in the App/Web.config file. This gives you the ability to set the default connection factory and database initializers from the config file. You can also specify constructor arguments to be used when constructing these objects. More details are available in the EF 4.3 Configuration File Settings blog post.
    Getting Started

    You can get EF 4.3 Beta 1 by installing the latest pre-release version of the EntityFramework NuGet package.

    You will need NuGet 1.6 installed and specify the –IncludePrerelease flag at the Package Manager Console to get this pre-release version. Pre-release packages can only be installed from the Package Manager Console.

    InstallPackage

    There are two walkthroughs for EF 4.3 Beta 1. One focuses on the no-magic workflow that uses a code-based migration for every change. The other looks at using automatic migrations to avoid having lots of code in you project for simple changes.

    Upgrading From ‘Code First Migrations Beta 1’

    If you have Code First Migrations Beta 1 installed you will need to uninstall the EntityFramework.Migrations package by running ‘Uninstall-Package EntityFramework.Migrations’ in Package Manager Console.

    You can then install EF 4.3 Beta 1 using the ‘Install-Package EntityFramework –IncludePrerelease’ command.

    You will need to close and re-open Visual Studio after installing the new package, this is required to unload the old migrations commands.

    RTM Timeline

    We are planning for this to be the last pre-release version of migrations and are still on-track to get a full supported, go-live, release of EF 4.3 published this quarter (first quarter of 2012).

    MSDeploy Provider Update

    We originally blogged about our plans to deliver an MSDeploy provider that could be used to apply migrations to a remote server. After many long hours iterating on this and working with the MSDeploy team we’ve concluded that we can’t deliver a good MSDeploy story for Migrations at this stage.

    The primary issues arise from us needing to execute code from your application assemblies on the remote server, in order to calculate the SQL to apply to the database. This is a requirement that other MSDeploy providers have not had in the past. We are going to continue working with the MSDeploy team to see if we can deliver something in the future, but unfortunately we won’t be shipping an MSDeploy provider in the immediate future.

    If you are able to connect to the remote database from the machine you are deploying from, then you can use the Update-Database.exe command line tool to perform the upgrade process. You can also use the System.Data.Entity.Migrations.DbMigrator class to write your own code that performs the migration process.

    EF 5.0 (Enum support is coming… finally!)

    We’ve been working on a number of features that required updates to some assemblies that are still part of the .NET Framework. These features include enums, spatial data types and some serious performance improvements.

    As soon as the next preview of the .NET Framework 4.5 is available we will be shipping EF 5.0 Beta 1, which will include all these new features.

      Support

      This is a preview of features that will be available in future releases and is designed to allow you to provide feedback on the design of these features. It is not intended or licensed for use in production. If you need assistance we have an Entity Framework Pre-Release Forum.


      Jan van der Haegen posted LightSwitch Achievements: Chapter one (Introduction) – The idea on 1/9/2012:

      imageIn this blog post series, we’re going to create a fun extension that will stimulate the users of our LightSwitch applications by rewarding them with points and achievements when they use the application…

      History 101

      If you want to fight a battle, you need an army. An army full of brave people, trained, ready to kill, and more importantly, ready to get killed.

      It’s not completely unnatural to put yourself into a such a dangerous situation for “the greater good”, however if I would have to live with the decision to be in the front line, that “greater good” better be “damn good, and damn worth the risk”.

      The Romans understood that their soldiers needed some additional motivation, besides their scanty wages, and granted any soldier freedom and a substantial piece of land, for four years of their loyalty.

      Hundreds of years later, Napoleon needed to motivate his soldiers as well, but quickly found himself running out of land to give away. He was the first one to create a rewarding system based on medals, which – probably to his own surprise – worked remarkably well. Medals don’t soften the blow of losing a relative on the field, or ease the pain of having one of your limbs shot to pieces, however the visual aspect of those shiny medal coins on a colorful bow, did motivate his soldiers by meeting their “lower esteem needs” (from the fourth tier of Maslow’s pyramid): the need for the respect of others, the need for status, recognition, fame, prestige, and attention.

      Today’s plan

      History lessons aside, the rewarding / motivation system of “Points”, “Achievements” and/or “Medals” is anything but history, it is used today in all kinds of areas, from employee retention management, games (MMORPG, XBox gamer profile, …), community sites (MSDN, forums, …), that obnoxious web site that stimulates users to spam their connections on social media, each time they “check in” somewhere, … to even the boy scout badges…

      imageAs old as the system may be, it is effective, motivating and fun for the rewardees, so why not make a reusable LightSwitch extension that any developer can set up and start using in no time, to give his/her users some extra incentive. We will keep track of events that happen in our LightSwitch application, create a system to reward those events to our users, create a visual effect to show each time a medal is awarded, a summary page per user and a “Hall of fame”.

      In this blog post series, we will continue our LightSwitch hacking quest, explore some of LightSwitch’s undocumented areas and have a lot of fun in the process…


      Return to section navigation list>

      Windows Azure Infrastructure and DevOps

      Wely Lau (@wely_live) wrote Comparing IAAS and PAAS: A Developer’s Perspective, which ACloudyPlace posted on 1/13/2012:

      imageIn my previous article, I discussed the basic concepts behind Cloud Computing including definitions, characteristics, and various service models. In this article I will discuss service models in more detail, and in particular the comparison between IAAS and PAAS from a developer’s standpoint.

      I’m using two giant cloud players for illustrative purposes: Amazon Web Service representing IAAS and Windows Azure Platform representing PAAS. Nonetheless, please be informed that the emphasis is on the service models and not the actual cloud players.

      Figure 1: IAAS VS PAAS

      Infrastructure as a Service (IAAS)

      IAAS refers to the cloud service model that provides on-demand infrastructure services to the customer. The infrastructure may refer to rentable resources such as computation power, storage, load-balancer, and etc.

      As you can see on the left-hand side of Table 1, the IAAS provider will be responsible for managing physical resources, for example network, servers, and clustered machines. Additionally, they typically will also manage virtualization technology enabling customers to run VMs (virtual machines). When it comes to the Operating System (OS), it is often arguable whether it’s managed by the provider or customer. In most cases, the IAAS provider will be responsible for customer VM Images with a preloaded OS but the customer will need to subsequently manage it. Using AWS as an example, AMI (Amazon Machine Image) offers customers several types of Operating Systems such as Windows Server, Linux SUSE, and Linux Red Hat. Although the OS is preloaded, AWS will not maintain or update it.

      Other stacks of software including middleware (such as IIS, Tomcat, Caching Services), runtime (JRE and .NET Framework), and databases (SQL Server, Oracle, MySQL) are normally not provided in the VM Image. That’s because the IAAS provider won’t know and won’t care what customers are going to do with the VM. Customers are responsible for taking care of them. When all of the above mentioned software has been settled, customers will finally deploy the application and data on the VM. …

      Wely continues with:

      Step-by-step: Setting-up an Application on IAAS Environment

      Step-by-step: Setting-up an Application on PAAS Environment

      Summary

      To summarize, we have investigated different service models and provisioning steps of IAAS and PAAS solutions. PAAS providers indeed take on much more responsibility for your solution than an IAAS provider would. On the other side, IAAS may offer more flexibility at lower level (example: public IP addresses, load-balancer, etc.).

      There’s no one-size-fits-all here. As a developer or architect, you should understand a customer’s need and determine the correct model to get the best possible outcome.

      Full disclosure: I’m a paid contributor to Red Gate Software’s ACloudyPlace.com.


      Tim Anderson (@timanderson) described interest of PHP developers in cloud providers (including Windows Azure) in his PHP Developer survey shows dominance of mobile, social media and cloud post of 1/12/2011:

      imageZend, a company which specialises in PHP frameworks and tools, has released the results of a developer survey from November 2011.

      The survey attracted 3,335 respondents drawn, it says, from “enterprise, SMB and independent developers worldwide.” I have a quibble with this, since I believe the survey should state that these were PHP developers. Why? Because I have an email from November which asked me to participate and said:

      Zend is taking the pulse of PHP developers. What’s hot and what matters most in your view of PHP?

      There is a difference between “developers” and “PHP developers”, and much though I love PHP the survey should make this clear. Nevertheless, If you participated, but mainly use Java or some other language, your input is still included. Later the survey states that “more than 50% of enterprise developers and more than 65% of SMB developers surveyed report spending more than half of their time working in PHP.” But if they are already identified as PHP developers, that is not a valuable statistic.

      Caveat aside, the results make good reading. Some highlights:

      • 66% of those surveyed are working on mobile development.
      • 45% are integrating with social media
      • 41% are doing cloud-based development

      Those are huge figures, and demonstrate how far in the past was the era when mobile was some little niche compared to mainstream development. It is the mainstream now – though you would get a less mobile-oriented picture if you surveyed enterprise developers alone. Similar thoughts apply to social media and cloud deployment.

      The next figures that caught my eye relate to cloud deployment specifically.

      • 30% plan to use Amazon
      • 28% will use cloud but are undecided which to use
      • 10% plan to use Rackspace
      • image6% plan to use Microsoft Azure
      • 5% have another public cloud in mind (Google? Heroku?)
      • 3% plan to use IBM Smart Cloud

      The main message here is: look how much business Amazon is getting, and how little is going to giants like Microsoft, IBM and Google. Then again, these are PHP developers, in which light 6% for Microsoft Azure is not bad – or are these PHP developer who also work in .NET?

      I was also interested in the “other languages used” section. 82% use JavaScript, which is no surprise given that PHP is a web technology, but more striking is that 24% also use Java, well ahead of C/C++ at 17%, C# at 15% and Python at 11%.

      Finally, the really important stuff. 86% of developers listen to music while coding, and the most popular artists are:

      1. Metallica
      2. = Pink Floyd and Linkin Park

      Wow.

      It’s obvious that the Windows Azure marketing team has their work cut out for them.


      <Return to section navigation list>

      Windows Azure Platform Appliance (WAPA), Hyper-V and Private/Hybrid Clouds

      Bill Kleyman described Load balancing servers in a private cloud in a 1/13/2012 post to the SearchCloudComputing.com blog:

      In many ways, managing a private cloud is no different than managing an on-premises data center. IT admins still must take important steps to monitor and balance the infrastructure. But the success of a cloud environment depends on several components: security, server density, network planning and workload management.

      imageBefore placing any workload on a cloud-ready server, administrators must plan their physical server environment. During this planning phase, cloud managers can size the environment, know what workloads they are delivering and truly understand available resources.

      Distributed computing allows users to log in from any device, anywhere, at any time. This means an organization’s cloud environment must be able to handle user fluctuations -- particularly for international companies, whose users log in from various time zones. Without good server load balancing, a cloud environment can experience degraded performance as cloud servers take on more workloads than they’re capable of. …

      Administrators must take time to evaluate which workloads are being deployed into the cloud, because each will have different effects on the cloud-based server. For example, if an environment is looking to deploy a virtual desktop environment, it must know the image size and how many users can safely reside on one physical server. Load balancing determines size and properly configures hardware at the server-level. If a server becomes overloaded, a resource lock will occur, which can degrade performance and affect the end-user experience.

      Visibility into the cloud
      A company with multiple cloud locations must have visibility into remote data centers to avoid complications and maintain server health. By monitoring what’s running on cloud servers and setting up alerts when issues arise, IT admins are able to take proactive measures to load-balance the entire environment.

      Deploying end-point monitoring tools can help with visibility. If a server’s resources are being consumed at a dangerously high rate, an engineer needs to know so he can resolve the issue quickly. Constant visibility -- monitoring who is accessing cloud machines and how dense the user count is-- can help alleviate load balancing issues.

      Having visibility into your cloud presence can help you understand how resources are being used. Results can be used to determine how to properly allocate user numbers or recognize if the environment need additional servers to support workloads.

      Load-balancing tactics within a private cloud
      One misconception among data center and cloud managers is that load balancing is primarily a server-based function. The reality is that admins must monitor and load balance multiple devices within a cloud environment. Server load balancing is not a difficult process -- as long as it’s done proactively.

      Servers. Physical resources on a server are finite. Without proper monitoring and load balancing, an entire system can become overloaded by workloads and users. When working with data centers in the cloud, it’s important to look at the physical hosts and virtual servers running on them. …

      If a company is running a private cloud and pushing out applications using Citrix’s XenApp, for example, it must know how many apps are installed on the server and how many users it can safely support. By sizing the machine based on this information, administrators can set a cap on user count and disable additional logons once the threshold is met. Any new users will log in to a different server that has been made available for load balancing purposes.

      Access gateways. If an access gateway breaks down, so will the ability to launch cloud workloads. Global Server Load Balancing (GSLB) is one feature available on Citrix’s NetScaler appliance that can help administrators create a robust and redundant environment. If one location goes down, GSLB detects the connection loss and immediately load balances to the next available appliance, allowing continuous access into an environment -- even if a device has failed.

      Security devices. Each security device only accepts a certain amount of connections; having a backup device in case of failure is important. Properly sizing a security appliance will depend on the cloud environment and the number of users accessing it. The ability to authenticate users across the WAN is important to maintain uptime and environment stability.

      Network infrastructure. Cloud traffic bottlenecks that occur due to a poorly designed switching infrastructure can cost a company money in degraded performance and can result in man hours spent troubleshooting and fixing the issue. Network admins should start with a good core switch and have a secondary switch available. By monitoring the amount of traffic passing through the network, admins will know if the environment is properly sized or if it needs more hardware.

      Full disclosure: I’m a paid contributor to SearchCloudComputing.com.


      Bill Claybrook explained Self-service, security and storage tools for the private cloud in a 1/10/2012 article for the SearchCloudComputing.com blog:

      imagePrivate clouds often require the use of third-party tools for tasks such as migrating applications, automating virtual machine provisioning and monitoring the environment. Three other facets of private cloud that could benefit from use of third-party tools include service catalogs, security and storage.

      imageService catalogs, or self-service portals, are the crux of the private cloud. They put the power in the end users’ hands by allowing them to choose from a list of available cloud services. Without proper management or visibility into service use, your cloud can get out of control.

      Security is a major consideration in any virtualized environment, and the cloud is no different. But native security measures may not be enough and traditional security tactics won’t properly protect a cloud. And while cloud may seem to offer unlimited storage capabilities, mismanagement or improper allocation can actually increase storage use.

      Service catalogs and the self-service portal
      Service catalogs and self-service portals sometimes are treated as different entities, wherein the self-service portal acts as the interface to the service catalog. In cloud, however, these technologies are a single entity.

      A service catalog typically contains a list of services being automated and made available to users. It is the source of record for the services that IT offers to internal users. A service catalog can contain the name, description, cost and information for services delivered by the back-office IT infrastructure. It allows users to serve themselves from a menu of cloud service offerings. A well-designed and integrated service catalog is an essential ingredient of a cloud.

      When Suncorp, a financial services provider in Brisbane, Australia, was building its private cloud, an initial step was to create a service catalog. Suncorp’s service catalog contains the list of cloud services being automated for internal use and made available to business users via a self-service portal.

      Service catalogs not only provide the list of services and their characteristics to users in cloud environments, they can also be integrated with a configuration management database (CMDB). For example, if you use your service catalog to provision virtual servers and a change in physical servers -- as denoted in the CMDB via a configuration management ticket -- impacts the number of CPUs available for these virtual servers, then this change would also be reflected in the service catalog.

      The following is a list of companies that provide service catalogs and self-service portals:

      • newScale, which Cisco acquired in April 2011, is the basis for Cisco’s Intelligent Automation tools for IT portals, service catalogs and lifecycle management software. This software helps IT organizations create self-service storefronts for data center and workspace services across physical, virtual and cloud environments.
      • CA Service Catalog from CA Technologies enables organizations to define service offerings. Native multi-tenancy allows multiple physical catalogs to support multiple business models across physical, virtual and cloud environments. It uses a billing engine to automatically associate service usage with departments, cost centers and customers and can send out invoices.
      • Nimsoft Service Desk module is a component of the Nimsoft Unified Manager offering that enables users to access the service catalog and submit change requests, report incidents, etc. Nimsoft Service Catalog uses ticket templates that allow users to enter requests for a cloud service. A workflow engine automatically routes all tickets to the appropriate group based on a combination of the requesters’ information and ticket information.

      Where cloud security matters
      Companies that move from physical to virtual environments, such as clouds, need to update their security. You can’t install a traditional firewall or antivirus software on a cloud-based virtual environment; physical firewalls aren’t designed to inspect and filter the traffic originating from a hypervisor that’s running several virtual servers. Whatever protection you have, it must be able to handle various activities like starting and stopping virtual servers and moving them.

      There is little to say about the importance of security in the cloud that hasn’t been said already. However, many admins tend to overlook where security is important. Hypervisor security, for example, is both critical and overlooked. If an intruder gains control of a virtual server, he may be able to gain control of the hypervisor. A whole new set of security issues are coming to the fore as enterprises allow employees to access corporate data with smartphones and tablets, such as Apple’s iPad.

      Security problems will be exacerbated if employees access back-office databases on mobile devices. Mobile clouds can help to resolve these security problems as they allow IT admins to centrally control security.

      Important security facets in the cloud include auditing, intrusion detection, access controls and antivirus protection. A number of vendors provide the distinctive security protection that clouds require:

      • Catbird’s vSecurity provides automated monitoring and enforcement for seven control areas: auditing, inventory management, configuration management, change management, access control, vulnerability management and incident response.
      • Juniper Networks’ Altor VF integrates Altor’s virtual firewall technology with Juniper Networks’Network and Security Manager and Juniper Network’s STRM Series Security Threat Response Managers. It enables users to secure their virtual servers and cloud environments.
      • AppRiver SecureSurf cloud security suite includes email hosting, email security, archiving and Web protection services. SecureSurf, which is a relatively new addition to the AppRiver portfolio, is a Web filtering and malware protection offering. AppRiver provides its security services as a Software as a Service (SaaS).
      • Barracuda Networks' Email Security Serviceprovides a cloud-based email filtering service that can be used as a cloud protection layer for the Barracuda Spam and Virus Firewall.
      • McAfee Cloud Security suite secures email, identity traffic and Web traffic. The McAfee Cloud Security Platform offers a variety of deployment options, ranging from on-premises solutions to SaaS solutions, to a hybrid combination of both.

      Keeping cloud storage under control
      Server virtualization has lowered IT costs and improved server utilization, but its proliferation has increased the amount of storage required. Some IT managers have discovered that money saved with server virtualization is now being spent on storage.

      Virtual servers can consume up to 30% more disk space than physical servers. And VM sprawl, an unfortunate result of improperly managed virtual servers, has forced many enterprises to overhaul their data backup and disaster recovery (DR) strategies.

      Some companies have indicated that they had to upgrade storage devices to handle the extra storage required for virtual server environments such as clouds. Other companies, such as Concur Technologies, a travel and management solutions provider headquartered in Redmond, Wash., not only moved storage up a tier from Serial ATA to Integrated Drive Electronics (IDE) to resolve performance issues, it also used data deduplication.

      When creating virtual servers in a private cloud become more commonplace in enterprises and IT organizations begin supporting mobile devices, the amount of required storage will increase significantly. This increased storage use will push us to take a more serious look at storage virtualization, data deduplication and thin provisioning as well as a second look at data backup.

      Enterprises have a few options for handling storage issues that crop up in virtualized environments. Technologies such as storage virtualization, deduplication and thin provisioning can optimize the storage requirements of a cloud environment. And several vendors offer tools that address the increase in storage use in cloud environments.

      Some tools in this area include:

      • NetApp MultiStore, which lets users create isolated logical partitions on a single storage system such that unauthorized users cannot access information on a secured virtual partition. MultiStore allows you to easily move virtual partitions between storage systems and provide DR in the cloud.
      • DataCore SANsymphony-V storage hypervisor is a portable software package that’s used to enhance multiple disk storage systems by supplementing individual capabilities with extended provisioning, replication and performance. It offers a transparent virtual layer across consolidated disk pools, which can improve storage utilization.
      • FalconStor FDS is a LAN-based deduplication tool that reduces storage capacity. It uses a centralized management graphical user interface (GUI) that allows users to define deduplication policies. FalconStor FDS scales from a small footprint to rack-size installments that support petabytes of logical storage capacity.
      • Syncplicity's Virtual Private Cloud automatically synchronizes an unlimited number of files and folders across PCs, Macs, file servers, Google Docs and other cloud applications. It ensures that every file and file version is backed up to your own Virtual Private Cloud automatically -- on or off the corporate network.
      • Axcient RapidRestore is a hybrid storage model that includes a storage appliance and an Internet storage service. Customers can back up storage locally and online for archiving. Axcient’s RapidRestore storage appliances have capacities ranging from 500 GB to 10 TB.
      • Riverbed Whitewater appliance focuses on data security, accelerates transmission of data over the Internet and ensures data availability in cloud environments. Security of data and slow speeds of transmission of data to and from clouds are major concerns of cloud users.

      Full disclosure: I’m a paid contributor to SearchCloudComputing.com.


      <Return to section navigation list>

      Cloud Security and Governance

      Doug Rehnstrom posted Compare Cloud Security to Your Security to the Learning Tree blog on 1/11/2012:

      imageThere’s an assumption people make that if they put their data in the cloud it is less secure. There are three aspects to security: confidentiality, integrity, and availability. They are known as the CIA security model.

      Confidentiality

      imagePrivate data is kept confidential using encryption. This might require encrypting the data in the database. When transporting data across the internet, it requires using the HTTPS protocol. Whether using the cloud or local servers this does not change. It is our responsibility to secure our data no matter where it is physically stored.

      Integrity

      Integrity is maintained in distributed systems by verifying messages sent between computers have not tampered with. This is also achieved by using the HTTPS protocol. Again, this does not change when using the cloud.

      Availability

      Data should only be made available to those who are allowed to see it. This is done through some sort of authentication process, along with rules that govern access to the data. Authentication can be done using passwords, digital certificates, biometrics, passcodes, keys etc.

      Securing the Infrastructure

      Without a secure infrastructure, you can’t achieve the CIA’s of security. Servers must be patched, firewalls need to be configured, access to physical hardware needs to be limited, intrusion-detection systems need to be put in place, etc. Securing the infrastructure is very expensive and requires a great deal of administration.

      This is where we can take advantage of a cloud provider’s economies of scale and expertise, to make our systems more secure! The fact is, very few people can afford to do what Microsoft and Amazon do to secure their data centers. And even if you can afford it, do you have the people who know how to do it?

      To better understand why this is so, read the links below which describe what Microsoft and Amazon do to secure their data centers. Then, compare what they do, to what your organization does. You will likely realize that your data would be considerably MORE secure in the cloud than it is in your computer room.

      Links

      Windows Azure Security Overview – Microsoft

      AWS Security and Compliance Center – Amazon Web Services

      If you want to learn more about cloud computing and how it can benefit your organization, come to one of the courses in Learning Tree’s Cloud Computing curriculum.


      David Navetta (@DavidNavetta) described The Legal Implications of Social Networking Part Three: Data Security in a 1/9/2012 post to the Info Law Group blog:

      imageIn 2011, InfoLawGroup began its “Legal Implications” series for social media by posting Part One (The Basics) and Part Two (Privacy). Well, after 4th quarter year-end madness and a few holidays Part Three is ready to go. In this post, we explore how security concerns and legal risk arise and interact in the social media environment. Again, the intended audience for this blogpost are organizations seeking to leverage social media, and understand and address the risks associated with its use.

      imageAs might be expected criminals view social media networks as fertile ground for committing fraud. There are three main security-related issues that pose potential security-related legal risk. First, to the extent that employees are accessing and using social media sites from company computers (or increasingly from personal computer devices connected to company networks or storing sensitive company data), malware, phishing and social engineering attacks could result in security breaches and legal liability. Second, spoofing and impersonation attacks on social networks could pose legal risks. In this case, the risk includes fake fan pages or fraudulent social media personas that appear to be legitimately operated. Third, information leakage is a risk in the social media context that could result in an adverse business and legal impact when confidential information is compromised.

      Social Media = Social Engineering

      One of the biggest social media security risks reveals itself in the name of the medium itself: social media yields social engineering. In short, when it comes to social media attacks, an organization’s own employees may be its worst enemy. Fraudsters leverage the central component of social media that makes it so attractive: trust between “friends.” Social media users may be tricked into downloading applications infected with malware because a posting was “recommended” by a friend. For example, almost immediately after Osama Bin Laden was killed by U.S. troops, one Facebook scam inserted malware on computers using a malicious (and false) link to the “real” Osama Bin Laden dead body photo that looked like it was posted on a friend’s wall. In addition, some scams have used messaging capabilities within social media platforms to initiate computer attacks. Unfortunately, if a company's employee is scammed and downloads malware from a social media network to the company network, it may be the company that faces legal liability.

      In addition, fraudsters use the trust users place in the social media platform itself to effectuate security breaches. For example, most would feel fairly comfortable clicking on an advertisement displayed on Facebook. However, in some cases that click could result in a “malvertisement” infection.

      Another common attack technique is phishing. Criminals create fake email notices that appear to come from social media sites. Unsuspecting users that click on links in these emails may end up providing sensitive information to fake websites that look like the social media site they belong to, or downloading malware onto a company’s system. Unfortunately, even an employee just giving up his or her personal social media passwords can be risky for a company. Many individuals use the same passwords at multiple sites and disclosing a social media password could also amount to providing the password to the network of an employee’s employer.

      There is increasing evidence that criminals are using social media to target key company personnel in order to burrow into company networks and steal trade secrets and other sensitive information. The wealth of personal information users share on social media sites provides ammunition for such attacks. Fraudsters can gather details about a user before engaging in an attack (e.g. employer, address, phone number, friends, affiliated companies, etc.) and then use the details to target the attack specifically at the individual(s) (such as a phishing email). In fact, this very technique appears to have been used in one of the biggest breaches of 2011, the RSA breach.

      With regard to legal risk, companies suffering a breach arising out of social media face the same risks for any security breach. If malware infects a system or an employee is tricked into providing his or her login-credentials, and confidential or personal information is stolen, the employer may face lawsuits or regulatory scrutiny. Actions alleging breaches of NDAs may also come from third parties whose trade secrets or other confidential information a company holds. Moreover, if personal information is accessed or acquired due to the social media security breach, notification may be necessary and related costs would have to be incurred by the employer.

      Social Media Spoofing and Hijacking

      Companies may also face legal liability for failing to detect and notify social media users of scams associated with the company’s social media site or key personnel with social media presences. If an organization becomes aware of a spoofed fan page that looks like its own, or a criminal disseminating a malware-infested social media application that looks like it is sponsored by the organization, legal repercussions could arise. Similarly, fraudsters could create fake profiles of key company personnel in order to commit crimes.

      Security and legal risks can also arise if hackers are able to take over a company’s fan page or social media profiles of key company personnel. By creating a fake fan page or profile, or hijacking an existing fan page or profile, fraudsters could send out messages with malware to all of the individuals who joined the fan page or trick customers into disclosing sensitive information. From the legal risk perspective, while case law is sparse, companies that fail to have fake fan pages removed or that fail to warn their customers of scams that look like they come from the company, could face legal liability.

      Confidential Information Leakage

      Another important business and legal risk arises out of potential confidential information leakage on social media sites.

      Imagine a company that is heavily reliant on traditional sales methods and has built up a customer list (a trade secret) with key, difficult-to-find contacts. Oftentimes, companies like this rely on key sales people to bring in large portions of their revenue. Perhaps seeking to be on top of modern marketing practices some of these salespeople establish LinkedIn accounts, and naturally begin linking to dozens or perhaps hundreds of friends, colleagues and customers. On LinkedIn, if settings are not set properly, all of the contacts related to these key salespeople could be publicly viewable. That being the case, it would not be difficult for a competitor to simply view and record those contacts, thereby potentially exposing the company’s customer list and key customer contacts.

      Take it one step further. Suppose one of the key sales persons leaves with the customer list and the company sues alleging misappropriation of trade secret. One of the elements for establishing a trade secret are efforts to keep the secret confidential. However, by allowing the sales person to display all of his contacts on LinkedIn, has the company effectively failed to maintain that confidentiality and lost its trade secret protection?

      In 2010, we saw an Eastern District of New York case that looked at this issue and ruled that trade secret protection was unavailable for a company where the customer list information at issue could be readily ascertained using sites like Google and by viewing LinkedIn profiles. In contrast, in 2011, the court in Syncsort Incorporated v. Innovative Routines, International, Inc., looked at the issue of whether a trade secret posted on the Internet loses its protection. While the court ruled that trade secret protection was not lost under the facts of Syncsort (where only a portion of the trade secret was available for a limited time), it appears that a different set of facts could yield a decision going the other direction.

      The inadvertent disclosure of confidential information by employees may also be problematic for organizations. This problem can arise when employees mistakenly or unknowingly disclosing sensitive information. For example, in September 2011 a Hewlett-Packard executive updated his LinkedIn status and revealed previously undisclosed details of HP's cloud-computing services. If he had instead posted confidential information about one of HP’s clients it may have resulted in legal liability. Moreover, for publicly-traded companies, certain inadvertent disclosures of financial information could lead to violations of securities laws and regulations.

      Even if confidential information is not directly put into a single status update or other post, the aggregated social media postings of multiple employees could yield valuable competitive information. Companies (on their own or through third party service providers) are actively data mining social media sites with the hope of gathering enough bits and pieces of information to provide a competitive edge. Employees may be unwittingly posting what they think is a single piece of non-sensitive data. However, when combined with multiple data points from other employees and sources, those innocent disclosures could suddenly reveal company or client confidential information.

      Conclusion

      In summary, the key security-related legal concerns associated with social media start with the fact that social media provides a rich target environment for criminals. Social media users are literally volunteering information that may be sensitive, and the disclosure of which could lead to legal risk. The culture of sharing present on social media sites itself can lead to over-disclosure by employees, and the pure volume of data that can be mined from social media sites may allow competitors and criminals to connect-the-dots to reveal confidential or sensitive information. Moreover, the sense of trust that comes with social media environments provides an opportunity for criminals to breach security. People may be tricked into providing certain information or downloading malware because they think they are having legitimate communications with colleagues or friends. Finally, the ability to easily spoof or create fake sites or pages in social media sites that look legitimate can lead to increased security risk. With this increased security risk, comes increased legal and liability risk (in an area of law that is very unsettled in terms of who can be liable for a security breach, and to what extent).

      How can these risks be addressed and mitigated? First, it is key to understand the social media environment and how the various social media platforms work. The unique characteristics of a particular social media platform may present risks specific to that platform. Second, organizations need to develop a social media strategy to maximize their leveraging of social media while minimizing risk (Are employees allowed to use their social media sites from work computers? Can they talk about the company and its plans on social media sites? What company information can they share on social media sites? Should only a handful of marketing-oriented employees be allowed to post about or on behalf of an organization? Can the company monitor social media usage?) Once strategy is developed, social media policies need to be drafted to reflect the strategy and address risks. In the security context, a big part of minimizing risk is educating and training employees and providing guidance on how to avoid or minimize it. Technology solutions may also exist that can allow for monitoring and tracking of social media usage by employees. Ultimately, however, like social media itself, it comes down to people -- risk can only be addressed appropriately if the individuals using social media are equipped to identify and mitigate against it.


      Chris Hoff (@Beaker) posted a QuickQuip: Vint Cerf “Internet Access Is Not a Human Right” < Agreed… on 1/9/2012:

      Wow, what a doozy of an OpEd!

      Vint Cerf wrote an article for the NY Times with the title “Internet Access Is Not a Human Right.” wherein he suggests that Internet access and the technology that provides it is “…an enabler of rights, not a right itself” and “…it is a mistake to place any particular technology in this exalted category [human right,] since over time we will end up valuing the wrong things.”

      This article is so rich in very interesting points that I could spend hours both highlighting points to both agree with as well as squint sternly at many of them.

      It made me think and in conclusion, I find myself in overall agreement. This topic inflames passionate debate — some really interesting debate — such as that from Rob Graham (@erratarob) here [although I'm not sure how a discussion on Human rights became anchored on U.S. centric constitutional elements which don't, by definition, apply to all humans...only Americans...]

      This ends up being much more of a complex moral issue than I expected in reviewing others’ arguments.

      I’ve positioned this point for discussion in many forums without stating my position and have generally become fascinated by the results.

      What do you think — is Internet access (not the Internet itself) a basic human right?

      /Hoff

      The image is Hoff boxing with Vint Cerf. As far as I’m concerned, Internet access is not a basic human right (nor is postal service.) Both are conveniences.


      <Return to section navigation list>

      Cloud Computing Events

      Brent Stineman (@BrentCodeMonkey) posted Windows Azure & PHP (for nubs)–Side note from CodeMash 2.0.1.2 on 1/13/2012:

      imageWriting your from CodeMash 2.0.1.2. First time here as an attendee and a presenter. Having a great time meeting many great people. Yesterday I presented on Windows Azure with PHP (see deck and demo script in the resource area) and it went smoothly except for a minor demo hiccup.

      I tell folks that even after 3 years of working with the platform, I still learn things. Well once I dug into the root cause of my demo hiccup, it turns out I learned something about the Azure SDK tools (something that my use of Visual Studio always hid from me).

      During the demo, I built and deployed a project locally. Doing a “package create” with “-dev=true”. It turns out the PHP SDK is using the Azure SDK tools cspack and csrun to get my project packed and deployed to the Azure Development Emulator. And when its doing this for the local emulator, the package isn’t encrypted as it would be for deployment to a hosted Windows Azure service. Combine that with my system settings to automatically open zip files as if they were folders, and the upload dialog at windows.azure.com simply wouldn’t let me select the file for deployment.

      opps!

      But by re-running the”package create” again with “-dev=false”, it then uses cspack to create the encrypted file for deployment to the cloud.

      I’ve already updated the presentation materials listed above. So the demo script now makes it really clear that this is the case. As I prepare to pack up and leave CodeMash, its knowing that I learned not just from other awesome presenters, but I also managed to teach myself something in the process.

      PS – if you have the opportunity to attend CodeMash next year, I couldn’t recommend it enough!


      The Microsoft Server and Cloud Platform Team reported on 1/12/2012 a Transforming IT with Microsoft Private Cloud - Live Webcast with Satya Nadella and Brad Anderson on 1/17:

      As we kick off the new year, Microsoft will be holding a live webcast, “Transforming IT with Microsoft Private Cloud”, on January 17 at 8:30 AM PST. Join this event to hear more from Microsoft president Satya Nadella and corporate vice president Brad Anderson on how Microsoft’s private cloud solution can help organizations drive greater results and gain maximum competitive advantage in their cloud computing journey. In addition, customers already utilizing the Microsoft private cloud solution will share how it has helped them better meet their business needs today.

      Register here to attend the event. We hope you’ll take advantage of this opportunity to take a closer look at Microsoft’s private cloud solution – Built for the Future. Ready Now. You can learn more about private cloud today by visiting the Microsoft Private Cloud Web page.

      To register, close the page with the video ad to display the registration page under it.


      Mick Badran (@mickba) announced I’m presenting this month at the Windows Azure Sydney User Group (WASUG) in a 1/8/2011 post:

      imageThought I’d start off the year with a bang around Azure and what’s been happening in the land of Integration.

      So I contacted a Conor Brady to see what was cooking.

      imageThe user group is meeting next Thursday 19th Jan 2012.

      Here’s the blurb…..

      'Integration using Windows Azure Application Integration Services'

      Local Integration & Training guru Mick Badran CTO at Breeze Training & Consulting and veteran BizTalk Server MVP will present on 'Integration using Windows Azure Application Integration Services'

      The presentation will show how to use Microsoft Windows Azure to be the cornerstone of your integration strategy, whether it’s a small piece or larger deployment. Find out what new tools you can use to extend your existing toolbox and the best way to use them.

      This session will cover:

      • Strategies on complementing your on-premise <-> cloud integration and what tool to use when.
      • High availability solutions with a demo of fault tolerance.
      • Casting an eye what’s around the corner to new features coming out of Azure Labs such as EAI, EAI Bridges, EDI – azure style and new XML over HTTP endpoints.

      Here’s the link to REGISTER - http://www.eventbrite.com/event/2739308345


      <Return to section navigation list>

      Other Cloud Computing Platforms and Services

      Tony Baer asserted “Oracle has announced the general availability of a big data appliance” in a deck for his Oracle Fills Another Gap in Its Big Data Offering article of 1/13/2011 to Dana Gardner’s Briefings Direct blog:

      When we last left Oracle’s big data plans, there was definitely a missing piece. Oracle’s Big Data Appliance as initially disclosed at last fall’s OpenWorld was a vague plan that appeared to be positioned primarily as an appliance that would accompany and feed data to Exadata. Oracle did specify some utilities, such as an enterprise version of the open source R statistical processing program that was designed for multithreaded execution, plus a distribution of a NoSQL database based on Oracle’s BerkeleyDB as an alternative to Apache Hive. But the emphasis appeared to be extraction and transformation of data for Exadata via Oracle’s own utilities that were optimized for its platform.

      imageWith Oracle’s announcement of general availability of the big data appliance, it is filling in the blanks.

      As such, Oracle’s plan for Hadoop was competition, not for Cloudera (or Hortonworks), which featured a full Apache Hadoop platform, but EMC, which offered a comparable, appliance-based strategy that pairs Hadoop with an Advanced SQL data store; and IBM, which took a different approach by emphasizing Hadoop as an analytics platform destination enhanced with text and predictive analytics engines, and other features such as unique query languages and file systems.

      Oracle’s initial Hadoop blueprint lacked explicit support of many pieces of the Hadoop stack such as HBase, Hive, Pig, Zookeeper, and Avro. No more. With Oracle’s announcement of general availability of the big data appliance, it is filling in the blanks by disclosing that it is OEM’ing Cloudera’s CDH Hadoop distribution, and more importantly, the management tooling that is key to its revenue stream. For Oracle, OEM’ing Cloudera’s Hadoop offering fully fleshes out its Hadoop distribution and positions it as a full-fledged analytic platform in its own right; for Cloudera, the deal is a coup that will help establish its distribution as the reference. It is fully consistent with Cloudera’s goal to become the Red Hat of Hadoop as it does not aspire to spread its footprint into applications or frameworks.

      Question of acquisition

      Of course, whenever you put Oracle in the same sentence as OEM deal, the question of acquisition inevitably pops up. There are several reasons why an Oracle acquisition of Cloudera is unlikely.

      1. Little upside for Oracle. While Oracle likes to assert maximum control of the stack, from software to hardware, its foray into productizing its own support for Red Hat Enterprise Linux has been strictly defensive; its offering has not weakened Red Hat.
      2. Scant leverage. Compare Hadoop to MySQL and you have a Tale of Two Open Source projects. One is hosted and controlled by Apache, the other is hosted and controlled by Oracle. As a result, while Oracle can change licensing terms for MySQL, which it owns, it has no such control over Hadoop. Were Oracle to buy Cloudera, another provider could easily move in to fill the vacuum. The same would happen to Cloudera if, as a prelude to such a deal, it began forking from the Apache project with its own proprietary adds-ons or substitutions.

      OEMs deals are a major stage of building the market. Cloudera has used its first mover advantage with Hadoop well with deals Dell, and now Oracle. Microsoft in turn has decided to keep the “competition” honest by signing up Hortonworks to (eventually) deliver the Hadoop engine for Azure.

      The final piece of the trifecta will be commitments from the Accentures and Deloittes of the world to develop practices based on specific Hadoop platforms.

      OEM deals are important for attaining another key goal in developing the Hadoop market: defining the core stack – as we’ve ranted about previously. Just as Linux took off once a robust kernel was defined, the script will be identical for Hadoop. With IBM and EMC/MapR forking the Apache stack at the core file system level, and with niche providers like Hadapt offering replacement for HBase and Hive, there is growing variability in the Hadoop stack. However, to develop the third party ecosystem that will be vital to the development of Hadoop, a common target (and APIs for where the forks occur) must emerge. A year from now, the outlines of the market’s decision on what makes Hadoop Hadoop will become clear.

      The final piece of the trifecta will be commitments from the Accentures and Deloittes of the world to develop practices based on specific Hadoop platforms. For now they are still keeping their cards close to their vests.

      You may also be interested in:


      Werner Vogels (@werner) invited you on 1/13/2012 to a Countdown to What is Next in AWS on 1/18/2012:

      imageJoin me at 9AM PST on Wednesday January 18, 2012 to find out what is next in the AWS Cloud. Registration required.

      Watch live streaming video from AWSCloudEvent at livestream.com


      Adron Hall (@adron) posted Devops Invades with PaaS & CloudFoundry on 1/11/2012:

      imageI have jumped head first into CloudFoundry over the last few weeks. In doing so I’ve started working with AppFog, IronFoundry, VMware and other devops tools. There are several avenues I’m taking to get more familiar with CloudFoundry based PaaS technology. Here’s a short review:

      Writing

      imageI started writing a series which is being published by New Relic around “Removing the Operating System Barrier with Platform as a Service“. Part 1 is live NOW – so go give it a read! :)

      Working

      Currently I’ve been working up some Enterprise Prototypes using the IronFoundry Technology. The idea is to provide a seamless deployment option for Enterprises that may have a very mixed environment of public and private computing options, virtual and non-virtualized environments, and any array of other capabilities. I’ve also been toying around with Windows 2008 Server Core, which I’ll have more about shortly.

      Public Cloud AppFog

      AppFog provides a public facing PaaS supporting PHP, Ruby on Rails, Java, MongoDB and a lot of other packages. They’re currently in beta right now, which I was fortunate enough to snag access to, but I’m sure the covers will come off soon enough! The underlying technology is built on CloudFoundry, providing a robust, scalable, and capable infrastructure connection to provide PaaS on.

      In addition to AppFog there is the CloudFoundry.com offering, which I’ve tested out a little bit, but mostly focused on AppFog and on building out…

      Private Cloud Capabilities w/ Public Cloud Style Infrastructure

      I’ve built out some images to test out how CloudFoundry and IronFoundry works. I did pull down the provided virtual machines but I’m also building out my own to understand it better. The Ruby + C# that I’ve seen from the VMware crew & Tier 3 team has been great so far (I always dig reading some solid code).

      That’s it for this short review, more to come, and let me know what you think of my entry “Removing the Operating System Barrier with Platform as a Service” over on New Relic’s Blog.


      Guy Harrison described Oracle's Public Cloud in a 1/11/2012 article for Database Trends & Applications’ January 2012 issue:

      imageAlong with thousands of IT professionals, I was in the San Francisco Moscone Center main hall last October listening to Larry Ellison's 2011 Oracle Open world keynote. Larry [pictured at right] can always be relied upon to give an entertaining presentation, a unique blend of both technology insights and amusingly disparaging remarks about competitors.

      imageLarry had made his major technical announcements and was performing a long live demonstration of the new Fusion applications. Like many, I turned to Twitter for distraction and was stunned to see the first breaking news of Steve Job's death. You could virtually feel the shock wave and sadness build in the hall in those last 10 minutes, as an increasing number us of became aware of the sad news through our iPads, laptops or phones.

      Steve Jobs passing - which must have been particularly distressing for Larry Ellison, who was a close personal friend - overshadowed what would normally have been a very big announcement for our industry: The announcement of the Oracle public cloud. Cloud computing being the dominant IT buzzword - and Oracle being the software behemoth that it is - Oracle's entry into the public cloud space is very big news, indeed.

      imageOracle and Larry Ellison have been famously sceptical about cloud computing mania, at one point claiming, "It's absurdity - it's nonsense ... What are you talking about? It's not water vapor. It's a computer attached to a network!" Despite that, it's been obvious for some time that Oracle has been patiently assembling foundation technologies that would position them to compete in enterprise cloud computing.

      The standard cloud computing taxonomy identifies three types of public clouds:

      • Infrastructure as a Service (IaaS) is the provision of raw compute and storage across the Internet, effectively the ability to create virtual machines and storage devices. Amazon Web Services is the canonical example.
      • Platform as a Service (PaaS) provides a complete application framework. You "drop" your code into the service, and computing resources are made available on demand. Microsoft Azure offers this sort of cloud for .NET applications.
      • Software as a Service (SaaS) provides a complete packaged application across the internet. Salesforce.com is a well-known example of a SaaS application.

      The Oracle Public cloud is both a SaaS offering of Oracle's Fusion applications, and a PaaS for Java applications.

      The SaaS side of the cloud runs standard versions of Oracle's new CRM and HCM applications fully hosted in an Oracle data center. The new breed of Oracle Fusion applications, though long overdue, are impressively complete, integrated, and boast modern web 2.0 features such as integrated social networking.

      The Java Cloud Service offers the ability to deploy a Java application to WebLogic services hosted by Oracle. Standard Java Enterprise applications should work with little or no modification and, of course, Oracle extensions are also provided.

      Oracle also offers a limited Database as a Service (DBaaS) offering, which allows users to directly manage a cloud-based database schema, although in a fairly limited way (no transactions are allowed from remote clients, for instance).

      All of these offerings will be made available on a monthly subscription basis.

      The WebLogic and database servers that power the Oracle public cloud are hosted on Oracle Exalogic Java servers and Exadata databases servers. Oracle has claimed, very credibly, that these servers can provide cost effective consolidation of disparate workloads. By using these servers as the basis for their own public cloud, Oracle is demonstrating its own confidence in the cost effectiveness and scalability of the "Exa" product line. Furthermore, Oracle is establishing a hardware/software architecture that can be mirrored in the customer's own data center - raising the possibility of powering a hybrid public/private cloud in the future.


      Jeff Barr (@jeffbarr) reported AWS Direct Connect - Now Available in Four Additional Locations on 1/10/2012:

      imageAWS Direct Connect lets you create a dedicated network connection between your office, data center, or colocation facility to an AWS Region. You might want to do this for privacy, to reduce your network costs, or to get a more consistent network experience than is possible across the Internet.

      imageWe launched AWS Direct Connect in US East (Northern Virginia) this past summer and we expanded it to Silicon Valley shortly thereafter.

      Today we are making Direct Connect available in four more locations. Here's the complete list of Regions and the associated data centers:

      Two of the locations listed above are not in the same city as the associated AWS Region. These locations provide you with additional flexibility when connecting to AWS from those cities.

      You can initiate the Direct Connect provisioning process by simply filling out a form:


      Jeff Barr (@jeffbarr) announced Additional Reserved Instance Options for Amazon RDS on 1/9/2012:

      imageHot on the heels of our announcement of Additional Reserved Instance Options for Amazon EC2, I would like to tell you about a similar option for the Amazon Relational Database Service.

      We have added Light and Heavy Utilization Reserved Instances for the MySQL and Oracle database engines. You can save 30% to 55% of your On-Demand DB Instance costs, depending on your usage.

      imageLight Utilization Reserved Instances offer the lowers upfront payment, and ideal for DB instances that are used sporadically for development and testing, or for short-term projects. You can save up to 30% on a 1-year term and 35% on a 3-year term when compared to the same instance on an On-Demand basis.

      Medium Utilization Reserved Instances have a higher upfront payment than Light Utilization Reserved Instances, but a much lower hourly usage fee. They are suitable for workloads that run most of the time, with some variability in usage. Savings range up to 35% for a 1-year term and 48% for a 3-year term when compared to On-Demand. These are the same Reserved Instances that we have offered since August 2010.

      Heavy Utilization Reserved Instances are the best value for steady-state production database instances that are destinated to be running 24x7. With this type of Reserved Instance you pay an upfront fee and a low hourly rate for every hour of the one or three year term. You can save 41% for a 1-year term and 55% for a 3-year term.

      These Reserved Instance offerings allow you to optimize your costs depending on your workload. The table below shows which Amazon RDS offerings you can use to lower your RDS costs. For example, if you need a DB instance for 5 months, a Light Utilization Reserved Instance will provide you the lowest effective cost.

      image

      Learn more about this feature and other RDS pricing options on the Amazon RDS pricing page.

      As always, we enjoy lowering our prices so that AWS becomes an even better value for you.


      <Return to section navigation list>

      0 comments: