Amazon Web ServicesCloudFront Update - Trends, Metrics, Charts, More Timely Logs

The Amazon CloudFront team has added a slew of analytics and reporting features this year. I would like to recap a pair of recent releases and then introduce you to the features that we are releasing today. As you probably know, CloudFront is a content delivery web service that integrates with the other parts of AWS for easy and efficient low-latency delivery of content to end users.

CloudFront Usage Charts
We launched a set of CloudFront Usage Charts back in March. The charts let you track trends in data transfer and requests (both HTTP and HTTPS) for each of your active CloudFront web distributions. Data is shown with daily or hourly granularity. These charts are available to you at no extra charge. You don't have to make any changes to your distribution in order to collect the data or to view the charts. Here is a month's worth of data for one of my distributions:

You can easily choose the distribution of interest, the desired time period and the reporting granularity:

You can also narrow down the reports by billing region:

Operational Metrics
Earlier this month CloudFront began to publish a set of Operational Metrics to AWS CloudWatch. These metrics are published every minute and reflect activity that's just a few minutes old, giving you information that is almost real-time in nature. As is the case with any CloudWatch metric, you can display and alarm on any of the items. The following metrics are available for each of your distributions:

  • Requests - Number of requests for all HTTP methods and for both HTTP and HTTPS requests.
  • BytesDownloaded - Number of bytes downloaded by viewers for GET, HEAD, and OPTIONS requests.
  • BytesUploaded - Number of bytes uploaded to the origin with CloudFront using POST and PUT requests.
  • TotalErrorRate - Percentage of all requests for which the HTTP status code is 4xx or 5xx.
  • 4xxErrorRate - Percentage of all requests for which the HTTP status code is 4xx.
  • 5xxErrorRate - Percentage of all requests for which the HTTP status code is 5xx.

The first three metrics are absolute values and make the most sense when you view the Sum statistic. For example, here is the hourly request rate for my distribution:

The other three metrics are percentages and the Average statistic is appropriate. Here is the error rate for my distribution (I had no idea that it was so high and need to spend some time investigating):

Once I track this down (a task that will have to wait until after AWS re:Invent, I will set an Alarm as follows:

The metrics are always delivered to the US East (Northern Virginia) Region; you'll want to make sure that it is selected in the Console's drop-down menu. Metrics are not emitted if the distribution has no traffic. As a consequence, the metric may not appear in CloudWatch if it has no requests.

New - More Timely Logs
Today we are improving the timeliness of the CloudFront logs. There are two aspects to this change. First, we are increasing the frequency with which CloudFront delivers log files to your Amazon Simple Storage Service (S3) bucket. Second, we are reducing the delay between data collection and data delivery. With these changes, the newest log files in your bucket will reflect events that have happened as recently as an hour ago.

We have also improved the batching model as part of this release. As a result, many applications will see fewer files now than they did in the past, despite the increased delivery frequency.

New - Cache Statistics & Popular Objects Report
We are also launching a set of new Cache Statistics reports today. These reports are based on the entries in your log files and are available on a per-distribution and all-distribution basis with day-level granularity for any time frame within a 60-day period and hour-level granularity for any 14-day interval the same 60-day period. These reports allow filtering by viewer location. You can, for example, filter by continent in order to gain a better understanding of traffic characteristics that are dependent on the geographic location of your viewer.

The following reports are available:

  • Total Requests - This report shows the total number of requests for all HTTP status codes and all methods.
  • Percentage of Viewer Requests by Result Type - This report shows cache hits, misses, and errors as percentages of total viewer requests.
  • Bytes Transferred to Viewers - This report shows the total number of bytes that CloudFront served to viewers in response to all requests for all HTTP methods. It also shows the number of bytes served to viewers for objects that were not in the edge cache (CloudFront node) at the time of the request. This is a good approximation for the number of bytes transferred from the origin.
  • HTTP Status Codes - This report shows the number of viewer requests by HTTP status code (2xx, 3xx, 4xx, and 5xx).
  • Unfinished GET Requests - This report shows the percentage of GET requests that didn't finish downloading the requested object, as a percentage of the total requests.

Here are the reports:

The new Popular Objects report shows request count, cache hit and cache miss counts, as well as error rates for the 50 most popular objects during the specified period. This helps you understand which content is most popular among your viewers, or identify any issues (such as high error rates) with your most requested objects. Here's a sample report from one of my distributions:

Available Now
The new reports and the more timely logs are available now. Data is collected in all public AWS Regions.

-- Jeff;

If you want to learn even more about these cool new features, please join us at 10:00 AM (PT) on November 20th for our Introduction to CloudFront Reporting Features webinar.

ProgrammableWebToday in APIs: Breathometer Connects Drunks to Uber with API

Breathometer, the maker of a bluetooth connected breathalyzer, uses Uber's API to connect people who need a ride. Newsly, an overnight hack, provides tailored news using machine learning APIs. Plus: hackathons head to rural colleges, and Infected Flight is the timely winner at Disrupt Europe hackathon.

ProgrammableWeb6 Essential BaaS Features Every Mobile App Needs

Whether you’re building a new mobile app or updating an existing one, adding BaaS features will drive an increase in user engagement and retention, not to mention provide competitive edge over other apps.

Amazon Web ServicesCloudWatch Update - Enhanced Support for Windows Log Files

Earlier this year, we launched a log storage and monitoring feature for AWS CloudWatch. As a quick recap, this feature allows you to upload log files from your Amazon Elastic Compute Cloud (EC2) instances to CloudWatch, where they are stored durably and easily monitored for specific symbols or messages.

The EC2Config service runs on Microsoft Windows instances on EC2, and takes on a number of important tasks. For example it is responsible for uploading log files to CloudWatch. Today we are enhancing this service with support for Windows Performance Counter data and ETW (Event Tracing for Windows) logs. We are also adding support for custom log files.

In order to use this feature, you must enable CloudWatch logs integration and then tell it which files to upload. You can do this from the instance by running EC2Config and checking Enable CloudWatch Logs integration:

The file %PROGRAMFILES%\Amazon\Ec2ConfigService\Settings\AWS.EC2.Windows.CloudWatch.json specifies the files to be uploaded.

To learn more about how this feature works and how to configure it, head on over to the AWS Application Management Blog and read about Using CloudWatch Logs with Amazon EC2 Running Microsoft Windows Server.

-- Jeff;

Amazon Web ServicesSpeak to Amazon Kinesis in Python

My colleague Rahul Patil sent me a nice guest post. In the post Rahul shows you how to use the new Kinesis Client Library (KCL) for Python developers.

-- Jeff;


The Amazon Kinesis team is excited to release the Kinesis Client Library (KCL) for Python developers! Developers can use the KCL to build distributed applications that process streaming data reliably at scale. The KCL takes care of many of the complex tasks associated with distributed computing, such as load-balancing across multiple instances, responding to instance failures, checkpointing processed records, and reacting to changes in stream volume.

You can download the KCL for Python using Github, or PyPi.

Getting Started
Once you are familiar with key concepts of Kinesis and KCL, you are ready to write your first application. Your code has the following duties:

  1. Set up application configuration parameters.
  2. Implement a record processor.

The application configuration parameters are specified by adding a properties file. For example:

# The python executable script 
executableName = sample_kclpy_app.py

# The name of an Amazon Kinesis stream to process.
streamName = words

# Unique KCL application name
applicationName = PythonKCLSample

# Read from the beginning of the stream
initialPositionInStream = TRIM_HORIZON

The above example configures KCL to process a Kinesis stream called "words" using the record processor supplied in sample_kclpy_app.py. The unique application name is used to coordinate amongst workers running on multiple instances.

Developers have to implement the following three methods in their record processor:

initialize(self, shard_id)
process_records(self, records, checkpointer)
shutdown(self, checkpointer, reason)

initialize() and shutdown() are self-explanatory; they are called once in the lifecycle of the record processor to initialize and clean up the record processor respectively. If the shutdown reason is TERMINATE (because the shard has ended due to split/merge operations), then you must also take care to checkpoint all of the processed records.

You implement the record processing logic inside the process_records() method. The code should loop through the batch of records and checkpoint at the end of the call. The KCL assumes that all of the records have been processed. In the event the worker fails, the checkpointing information is used by KCL to restart the processing of the shard at the last checkpointed record.

# Process records and checkpoint at the end of the batch
    def process_records(self, records, checkpointer):
        for record in records:
            # record data is base64 encoded
            data = base64.b64decode(record.get('data'))
            ####################################       
            # Insert your processing logic here#
            ####################################       
       
        #checkpoint after you are done processing the batch  
        checkpointer.checkpoint()

The KCL connects to the stream, enumerates shards, and instantiates a record processor for each shard. It pulls data records from the stream and pushes them into the corresponding record processor. The record processor is also responsible for checkpointing processed records.

Since each record processor is associated with a unique shard, multiple record processors can run in parallel. To take advantage of multiple CPUs on the machine, each Python record processor runs in a separate process. If you run the same KCL application on multiple machines, the record processors will be load-balanced across these machines. This way, KCL enables you to seamlessly change machine types or alter the size of the fleet.

Running the Sample
The release also comes with a sample word counting application. Navigate to the amazon_kclpy directory and install the package.

$ python setup.py download_jars
$ python setup.py install

A sample putter is provided to create a Kinesis stream called "words" and put random words into that stream. To start the sample putter, run:

$ sample_kinesis_wordputter.py --stream words .p 1 -w cat -w dog -w bird

You can now run the sample python application that processes records from the stream we just created:

$ amazon_kclpy_helper.py --print_command --java <path-to-java> --properties samples/sample.properties

Before running the samples, you'll want to make sure that your environment is configured to allow the samples to use your AWS credentials via the default AWS Credentials Provider Chain.

Under the Hood - What You Should Know
KCL for Python uses KCL for Java. We have implemented a Java based daemon, called MultiLangDaemon that does all the heavy lifting. Our approach has the daemon spawn a sub-process, which in turn runs the record processor, which can be written in any language. The MultiLangDaemon process and the record processor sub-process communicate with each other over STDIN and STDOUT using a defined protocol. There will be a one to one correspondence amongst record processors, child processes, and shards. For Python developers specifically, we have abstracted these implementation details away and expose an interface that enables you to focus on writing record processing logic in Python. This approach enables KCL to be language agnostic, while providing identical features and similar parallel processing model across all languages.

Join the Kinesis Team
The Amazon Kinesis team is looking for talented Web Developers and Software Development Engineers to push the boundaries of stream data processing! Here are some of our open positions:

-- Rahul Patil

ProgrammableWebKPIs for APIs: Developer Experience Can Make or Break Your API

This is the second post of a three-part series covering key performance indicators for APIs, based on John Musser's presentation at the Business of APIs Conference.

ProgrammableWebMendix Moves to Simplify Mobile App Development

Mendix today moved to simplify the development of mobile applications with enhancements to its cloud platform that enables developers to tie components together to create a mobile application that can be instantly deployed on any number of mobile computing devices.

Amazon Web ServicesNext Generation Genomics With AWS

My colleague Matt Wood wrote a great guest post to announce new support for one of our genomics partners.

-- Jeff;


I am happy to announce that AWS will be supporting the work of our partner, Seven Bridges Genomics, who has been selected as one of the National Cancer Institute (NCI) Cancer Genomics Cloud Pilots. The cloud has become the new normal for genomics workloads, and AWS has been actively involved since the earliest days, from being the first cloud vendor to host the 1000 Genomes Project, to newer projects like designing synthetic microbes, and development of novel genomics algorithms that work at population scale. The NCI Cancer Genomics Cloud Pilots are focused on how the cloud has the potential to be a game changer in terms of scientific discovery and innovation in the diagnosis and treatment of cancer.

The NCI Cancer Genomics Cloud Pilots will help address a problem in cancer genomics that is all too familiar to the wider genomics community: data portability. Today's typical research workflow involves downloading large data sets, (such as the previously mentioned 1000 Genomes Project or The Cancer Genome Atlas (TCGA)) to on-premises hardware, and running the analysis locally. Genomic datasets are growing at an exponential rate and becoming more complex as phenotype-genotype discoveries are made, making the current workflow slow and cumbersome for researchers. This data is difficult to maintain locally and share between organizations. As a result, genomic research and collaborations have become limited by the available IT infrastructure at any given institution.

The NCI Cancer Genomics Cloud Pilots will take the natural step to solve this problem, by bringing the computation to where the data is, rather than the other way around. The goal of the NCI Cancer Genomics Cloud Pilots are to create cloud-hosted repositories for cancer genome data that reside alongside the tools, algorithms, and data analysis pipelines needed to make use of the data. These Pilots will provide ways to provision computational resources within the cloud so that researchers can analyze the data in place. By collocating data in the cloud with the necessary interface, algorithms, and self-provisioned resources, these Pilots will remove barriers to entry, allowing researchers to more easily participate in cancer research and accelerating the pace of discovery. This means more life-saving discoveries such as better ways to diagnose stomach cancer, or the identification of novel mutations in lung cancer that allow for new drug targets.

The Pilots will also allow cancer researchers to provision compute clusters that change as their research needs change. They will have the necessary infrastructure to support their research when they need it, rather than make a guess at the resources that they will need in the future every time grant writing season starts. They will also be able to ask many more novel questions of the data, now that they are no longer constrained by a static set of computational resources.

Finally, the NCI Cancer Genomics Pilots will help researchers collaborate. When data sets are publicly shared, it becomes simple to exchange and share all the tools necessary to reproduce and expand upon another lab's work. Other researchers will then be able to leverage that software within the community, or perhaps even in an unrelated field of study, resulting in even more ideas be generated.

Since 2009, Seven Bridges Genomics has developed a platform to allow biomedical researchers to leverage AWS's cloud infrastructure to focus on their science rather than managing computational resources for storage and execution. Additionally, Seven Bridges has developed security measures to ensure compliance with Health Insurance Portability and Accountability Act (HIPAA) for all data stored in the cloud. For the NCI Cancer Genomics Cloud Pilots, the team will adapt the platform to meet the specific needs of the cancer research community as the develop over the course of the Pilot. If you are interested in following the work being done by Seven Bridges Genomics or giving feedback as their work on the NCI Cancer Genomics Cloud Pilots progresses, you can do so here.

We look forward to the journey ahead with Seven Bridges Genomics. You can learn more about AWS and Genomics here.

-- Matt Wood, General Manager, Data Science

Jeremy Keith (Adactio)Indie web building blocks

I was back in Nürnberg last week for the second border:none. Joschi tried an interesting format for this year’s event. The first day was a small conference-like gathering with an interesting mix of speakers, but the second day was much more collaborative, with people working together in “creator units”—part workshop, part round-table discussion.

I teamed up with Aaron to lead the session on all things indie web. It turned out to be a lot of fun. Throughout the day, we introduced the little building blocks, one by one. By the end of the day, it was amazing to see how much progress people made by taking this layered approach of small pieces, loosely stacked.

relme

The first step is: do you have a domain name?

Okay, next step: are you linking from that domain to other profiles of you on the web? Twitter, Instagram, Github, Dribbble, whatever. If so, here’s the first bit of hands-on work: add rel="me" to those links.

<a rel="me" href="https://twitter.com/adactio">Twitter</a>
<a rel="me" href="https://github.com/adactio">Github</a>
<a rel="me" href="https://www.flickr.com/people/adactio">Flickr</a>

If you don’t have any profiles on other sites, you can still mark up your telephone number or email address with rel="me". You might want to do this in a link element in the head of your HTML.

<link rel="me" href="mailto:jeremy@adactio.com" />
<link rel="me" href="sms:+447792069292" />

IndieAuth

As soon as you’ve done that, you can make use of IndieAuth. This is a technique that demonstrates a recurring theme in indie web building blocks: take advantage of the strengths of existing third-party sites. In this case, IndieAuth piggybacks on top of the fact that many third-party sites have some kind of authentication mechanism, usually through OAuth. The fact that you’re “claiming” a profile on a third-party site using rel="me"—and the third-party profile in turn links back to your site—means that we can use all the smart work that went into their authentication flow.

You can see IndieAuth in action by logging into the Indie Web Camp wiki. It’s pretty nifty.

If you’ve used rel="me" to link to a profile on something like Twitter, Github, or Flickr, you can authenticate with their OAuth flow. If you’ve used rel="me" for your email address or phone number, you can authenticate by email or SMS.

h-entry

Next question: are you publishing stuff on your site? If so, mark it up using h-entry. This involves adding a few classes to your existing markup.

<article class="h-entry">
  <div class="e-content">
    <p>Having fun with @aaronpk, helping @border_none attendees mark up their sites with rel="me" links, h-entry classes, and webmention endpoints.</p>
  </div>
  <time class="dt-published" datetime="2014-10-18 08:42:37">8:42am</time>
</article>

Now, the reason for doing this isn’t for some theoretical benefit from search engines, or browsers, but simply to make the content you’re publishing machine-parsable (which will come in handy in the next steps).

Aaron published a note on his website, inviting everyone to leave a comment. The trick is though, to leave a comment on Aaron’s site, you need to publish it on your own site.

Webmention

Here’s my response to Aaron’s post. As well as being published on my own site, it also shows up on Aaron’s. That’s because I sent a webmention to Aaron.

Webmention is basically a reimplementation of pingback, but without any of the XML silliness; it’s just a POST request with two values—the URL of the origin post, and the URL of the response.

My site doesn’t automatically send webmentions to any links I reference in my posts—I should really fix that—but that’s okay; Aaron—like me—has a form under each of his posts where you can paste in the URL of your response.

This is where those h-entry classes come in. If your post is marked up with h-entry, then it can be parsed to figure out which bit of your post is the body, which bit is the author, and so on. If your response isn’t marked up as h-entry, Aaron just displays a link back to your post. But if it is marked up in h-entry, Aaron can show the whole post on his site.

Okay. By this point, we’ve already come really far, and all people had to do was edit their HTML to add some rel attributes and class values.

For true site-to-site communication, you’ll need to have a webmention endpoint. That’s a bit trickier to add to your own site; it requires some programming. Here’s my minimum viable webmention that I wrote in PHP. But there are plenty of existing implentations you can use, like this webmention plug-in for WordPress.

Or you could request an account on webmention.io, which is basically webmention-as-a-service. Handy!

Once you have a webmention endpoint, you can point to it from the head of your HTML using a link element:

<link rel="mention" href="https://adactio.com/webmention" />

Now you can receive responses to your posts.

Here’s the really cool bit: if you sign up for Bridgy, you can start receiving responses from third-party sites like Twitter, Facebook, etc. Bridgy just needs to know who you are on those networks, looks at your website, and figures everything out from there. And it automatically turns the responses from those networks into h-entry. It feels like magic!

Here are responses from Twitter to my posts, as captured by Bridgy.

POSSE

That was mostly what Aaron and I covered in our one-day introduction to the indie web. I think that’s pretty good going.

The next step would be implementing the idea of POSSE: Publish on your Own Site, Syndicate Elsewhere.

You could do this using something as simple as If This, Then That e.g. everytime something crops up in your RSS feed, post it to Twitter, or Facebook, or both. If you don’t have an RSS feed, don’t worry: because you’re already marking your HTML up in h-entry, it can be converted to RSS easily.

I’m doing my own POSSEing to Twitter, which I’ve written about already. Since then, I’ve also started publishing photos here, which I sometimes POSSE to Twitter, and always POSSE to Flickr. Here’s my code for posting to Flickr.

I’d really like to POSSE my photos to Instagram, but that’s impossible. Instagram is a data roach-motel. The API provides no method for posting photos. The only way to post a picture to Instagram is with the Instagram app.

My only option is to do the opposite of POSSEing, which is PESOS: Publish Elsewhere, and Syndicate to your Own Site. To do that, I need to have an endpoint on my own site that can receive posts.

Micropub

Working side by side with Aaron at border:none inspired me to finally implement one more indie web building block I needed: micropub.

Having a micropub endpoint here on my own site means that I can publish from third-party sites …or even from native apps. The reason why I didn’t have one already was that I thought it would be really complicated to implement. But it turns out that, once again, the trick is to let other services do all the hard work.

First of all, I need to have something to manage authentication. Well, I already have that with IndieAuth. I got that for free just by adding rel="me" to my links to other profiles. So now I can declare indieauth.com as my authorization endpoint in the head of my HTML:

<link rel="authorization_endpoint" href="https://indieauth.com/auth" />

Now I need some way of creating and issuing authentation tokens. See what I mean about it sounding like hard work? Creating a token endpoint seems complicated.

But once again, someone else has done the hard work so I don’t have to. Tokens-as-a-service:

<link rel="token_endpoint" href="https://tokens.indieauth.com/token" />

The last piece of the puzzle is to point to my own micropub endpoint:

<link rel="micropub" href="https://adactio.com/micropub" />

That URL is where I will receive posts from third-party sites and apps (sent through a POST request with an access token in the header). It’s up to me to verify that the post is authenticated properly with a valid access token. Here’s the PHP code I’m using.

It wasn’t nearly as complicated as I thought it would be. By the time a post and a token hits the micropub endpoint, most of the hard work has already been done (authenticating, issuing a token, etc.). But there are still a few steps that I have to do:

  1. Make a GET request (I’m using cURL) back to the token endpoint I specified—sending the access token I’ve been sent in a header—verifying the token.
  2. Check that the “me” value that I get back corresponds to my identity, which is https://adactio.com
  3. Take the h-entry values that have been sent as POST variables and create a new post on my site.

I tested my micropub endpoint using Quill, a nice little posting interface that Aaron built. It comes with great documentation, including a guide to creating a micropub endpoint.

It worked.

Here’s another example: Ben Roberts has a posting interface that publishes to micropub, which means I can authenticate myself and post to my site from his interface.

Finally, there’s OwnYourGram, a service that monitors your Instagram account and posts to your micropub endpoint whenever there’s a new photo.

That worked too. And I can also hook up Bridgy to my Instagram account so that any activity on my Instagram photos also gets sent to my webmention endpoint.

Indie Web Camp

Each one of these building blocks unlocks greater and greater power:

Each one of those building blocks you implement unlocks more and more powerful tools:

But its worth remembering that these are just implementation details. What really matters is that you’re publishing your stuff on your website. If you want to use different formats and protocols to do that, that’s absolutely fine. The whole point is that this is the independent web—you can do whatever you please on your own website.

Still, if you decide to start using these tools and technologies, you’ll get the benefit of all the other people who are working on this stuff. If you have the chance to attend an Indie Web Camp, you should definitely take it: I’m always amazed by how much is accomplished in one weekend.

Some people have started referring to the indie web movement. I understand where they’re coming from; it certainly looks like a “movement” from the outside, and if you attend an Indie Web Camp, there’s a great spirit of sharing. But my underlying motivations are entirely selfish. In the same way that I don’t really care about particular formats or protocols, I don’t really care about being part of any kind of “movement.” I care about my website.

As it happens, my selfish motivations align perfectly with the principles of an indie web.

ProgrammableWebCOWL Project Promises to Better Secure JavaScript Applications

Modern Web applications by definition are an amalgamation of JavaScript code typically mashed together to create something greater than the sum of its parts. The challenge is that every developer has to trust that the sensitive data won’t inadvertently leak out.

ProgrammableWebInfinigon Launches API for Access to ECHO Platform

Infinigon Group, real-time social analytics solution provider, has launched an API for access to its ECHO platform. ECHO constitutes a real-time analytics tool that captures market data from Tweets and provides actionable data to traders. The platform processes millions of Tweets each day, and the new API provides programmatic access to the data. 

ProgrammableWeb5 Ways To Increase Developer Onboarding

Having developers adopt an API can be a difficult task, but one that can be overcome by framing the API as a product, with developers being the number one customer. If an API provider is noticing a lag in general interest, it could be due to any one of 5 major problems.

ProgrammableWebFIWARE Open API Platform Makes 80 Million EUR Available to Startups

Open API platform FIWARE is inviting applications from businesses, small enterprise and startups to participate in a range of accelerator programs aimed at creating a new wave of innovative tech for agriculture, smart cities, e-health, manufacturing and the Internet of Things. FIWARE is a European Commission-funded initiative under the Future Internet program.

ProgrammableWebPresentation at KeenCon on the &quot;APIs of Things&quot;

Hugo Fiennes, a designer of the Nest, and Co-Fonder and CEO of Eletricimp, speaks at KeenCon regarding the APIs of Things. Coming from both a hardware and software background, his half-hour talk is an informative overview on his experience of designing and implementing IoT devices.

ProgrammableWebThe IoT Enters a 20 Year Prototyping Phase

The number of "things" plugging into the IoT realm are changing rapidly, not only in number but in scale and complexity. The IoT is part of an evolving consumer space determined largely by taste, style, and culture. One needs only to look at the varieties of multi-colored leather bands offered with Apple's Watch to see that IoT commodities are intermingled with fashion trends. This early into the game, one might go as far to call all IoT devices prototpyes.

ProgrammableWebLinguaSys launching GlobalNLP API for natural language processing in the cloud

Human language technology company LinguaSys is this week launching an API offering that allows developers to use its GlobalNLP natural language processing software in the cloud.

Paul Downey (British Telecom)One CSV, thirty stories: 7. Prices redux

This is day 7 of One CSV, 30 stories a series of articles exploring price paid data from the Land Registry found on GOV.UK. The code for this and the other articles is available as open source from GitHub

Continuing on from yesterday’s foray into prices, today sees more of the same with more or less the same gnuplot script.

The prices file from Day 2 contains almost 150,000 different prices:

$ wc -l price.tsv
141464
Count Price (£)
208199 250000
185912 125000
163323 120000
159519 60000
147645 110000
145214 150000
140833 115000
134731 135000
131334 175000
131223 85000
129597 130000
129336 105000
126161 165000
126004 95000
124379 145000
123968 75000
123893 140000
123451 160000
123340 90000
120306 100000
119776 80000

which when plotted by rank using the gnuplot pseudo-column zero :

plot "/dev/stdin" using 0:1 with boxes lc rgb "black"

shows how the prices are distributed in quite a steep power-curve, a long-tail if you will:

Price rank

A quick awk script to collate prices, modulo 10:

cut -f1 < data/pp.tsv | awk '{ print $1 % 10 }' | sort | uniq -c | sort -rn

gives us the distribution of the last digit in the prices:

Count Price (£1)
18437019 0
715633 5
56195 9
21890 2
17549 6
17395 3
16889 1
16235 7
14888 8
11878 4

Last digit of the price

and can be tweaked to show the last two digits:

Count Price (£10)
16282411 0
2087949 50
636253 95
45710 99
22419 75
20194 25
11271 45
11121 60
9890 20
9425 80
9235 40
7677 90
6855 70
6532 10
6519 55
5924 30

Last two digits of the price

and the last three digits in the prices:

Count Price (£100)
3682320 0
3332503 5000
980975 8000
897786 2000
835579 7000
765799 3000
732587 9950
713121 6000
707063 4000
687129 9000
596687 7500
567882 2500
503076 1000
298398 8500
294878 4950
267618 9995

Last three digits of the price

A logarithmic scale can help see patterns in the lower values whilst showing the peaks on the same page; it’s a bit like squinting at the chart from a low angle:

Last 3 digits of the price on a log scale

I think tomorrow will be pretty average.

David MegginsonA different kind of data standard

This year, the UN gave me the chance to bring people together to work on data standards for humanitarian crises (like the Ebola or Syria crisis). We put together a working group from multiple humanitarian agencies and NGOs and got to work in February.  The result is the alpha version of the Humanitarian Exchange Language (HXL, pronounced /HEX-el/), a very different kind of data standard.

<section id="whats-wrong">

What’s wrong with data standards these days?

Unlike most data standards, HXL is cooperative rather than competitive. A competitive standard typically considers the way you currently work to be a problem, and starts by presenting you with a list of demands:

  • Switch to a different data format (and acquire and learn new software tools).
  • Change the information you share (and the way your organisation collects and uses that information).
  • Abandon what is valuable and unique about your organisation’s data (and conform to the common denominator).

For HXL, we reversed the process and started by asking humanitarian responders how they’re actually working right now, then thought out how we could build a cooperative standard to enhance what they already do.

</section> <section id="not-json">

Not JSON or XML

Given the conditions under which humanitarian responders work in the field (iffy connectivity, time pressure, lots to do besides putting together data reports), we realised that an XML-, JSON-, or RDF-based standard wasn’t going to work.

The one data tool people already know is the infamous spreadsheet program, so HXL would have to work with spreadsheets.  But it also had to be able to accommodate local naming conventions (e.g. we couldn’t force everyone to use “ADM1″ as a header for what is a province in Guinea, a departamento in Colombia, or a governorate in Syria). So in the end, we decided to come up to add a row of hashtags below the human-readable headers to signal the common meaning of the columns and the start of the actual data. It looks a bit like this:

Location name Location code People affected
#loc #loc_id #aff_num
Town A 01000001 2000
Town B 01000002 750
Town C 01000003 1920

The tagging conventions are slightly more-sophisticated than that, also including special support for repeated fields, multiple languages, and compact-disaggregated data (e.g. time-series data).

</section> <section id="uptake">

HXL in action

While HXL is still in the alpha stage, the Standby Task Force is already using it as part of the international Ebola response, and we’re running informal interoperability trials with the International Aid Transparency initiative, with planned more-formal trials with UNHCR and IOM.

We also have an interactive HXL showcase demo site, and a collection of public-domain HXL libraries available on GitHub. More news soon.

</section> <section id="credits">

Credits

Thanks to the Humanitarian Data Exchange project (managed by Sarah Telford) at the UN’s Office for the Coordination of Humanitarian Affairs for giving me the opportunity and support to do this kind of work, to the Humanitarian Innovation Fund for backing it financially, to the HXL Working Group for coming together to figure this stuff out, and especially to CJ Hendrix and Carsten Keßler for their excellent work on an earlier incarnation of HXL and for their ongoing support.

</section>
Tagged: hxl

ProgrammableWebToday in APIs: Facebook Doubles Bug Bounty for Advertising

Facebook ponies up even more for developers finding bugs. Apple squeezes out more performance with Metal. Plus: Stitch engineers win $1 million Salesforce hackathon, and tips on creating APIs from Netbean's creator.

ProgrammableWebVodafone India Searches for Best API Use via appStar 2014

Vodafone India, India's leading telecommunications service provider, has launched appStar 2014. appStar 2014 offers brands, e-tailers, app developers, and more the opportunity to show off API skills through app development that utilize Vodafone's network APIs.

Amazon Web ServicesAWS Week in Review - October 13, 2014

Let's take a quick look at what happened in AWS-land last week:

Monday, October 13
Tuesday, October 14
Wednesday, October 15
Thursday, October 16
Friday, October 17

Here are some of the events that we have on tap for the next week or two:

Stay tuned for next week! In the meantime, follow me on Twitter and subscribe to the RSS feed.

-- Jeff;

Amazon Web ServicesFast, Easy, Free Data Sync from RDS MySQL to Amazon Redshift

As you know, I'm a big fan of Amazon RDS. I love the fact that it allows you focus on your applications and not on keeping your database up and running. I'm also excited by the disruptive price, performance, and ease of use of Amazon Redshift, our petabyte-scale, fully managed data warehouse service that lets you get started for $0.25 per hour and costs less than $1,000 per TB per year. Many customers agree, as you can see from recent posts by Pinterest, Monetate, and Upworthy.

Many AWS customers want to get their operational and transactional data from RDS into Redshift in order to run analytics. Until recently, it's been a somewhat complicated process. A few week ago, the RDS team simplified the process by enabling row-based binary logging, which in turn has allowed our AWS Partner Network (APN) partners to build products that continuously replicate data from RDS MySQL to Redshift.

Two APN data integration partners, FlyData and Attunity, currently leverage row-based binary logging to continuously replicate data from RDS MySQL to Redshift. Both offer free trials of their software in conjunction with Redshift's two month free trial. After a few simple configuration steps, these products will automatically copy schemas and data from RDS MySQL to Redshift and keep them in sync. This will allow you to run high performance reports and analytics on up-to-date data in Redshift without having to design a complex data loading process or put unnecessary load on your RDS database instances.

If you're using RDS MySQL 5.6, you can replicate directly from your database instance by enabling row-based logging, as shown below. If you're using RDS MySQL 5.5, you'll need to set up a MySQL 5.6 read replica and configure the replication tools to use the replica to sync your data to Redshift. To learn more about these two solutions, see FlyData's Free Trial Guide for RDS MySQL to Redshift as well as Attunity's Free Trial and the RDS MySQL to Redshift Guide. Attunity's trial is available through the AWS Marketplace, where you can find and immediately start using software with Redshift with just a few clicks.

Informatica and SnapLogic also enable data integration between RDS and Redshift, using a SQL-based mechanism that queries your database to identify data to transfer to your Amazon Redshift clusters. Informatica is offering a 60-day free trial and SnapLogic has a 30 day free trial.

All four data integration solutions discussed above can be used with all RDS database engines (MySQL, SQL Server, PostgreSQL, and Oracle). You can also use AWS Data Pipeline (which added some recent Redshift enhancements), to move data between your RDS database instances and Redshift clusters. If you have analytics workloads, now is a great time to take advantage of these tools and begin continuously loading and analyzing data in Redshift.

Enabling Amazon RDS MySQL 5.6 Row Based Logging
Here's how you enable row based logging for MySQL 5.6:

  1. Go to the Amazon RDS Console and click Parameter Groups in the left pane:
  2. Click on the Create DB Parameter Group button and create a new parameter group in the mysql5.6 family:
  3. Once in the detail view, click the Edit Parameters button. Then set the binlog_format parameter to ROW:
For more details please see Working with MySQL Database Log Files.

Free Trials for Continuous RDS to Redshift Replication from APN Partners
FlyData has published a step by step guide and a video demo in order to show you how to continuously and automatically sync your RDS MySQL 5.6 data to Redshift and you can get started for free for 30 days. You will need to create a new parameter group with binlog_format set to ROW and binlog_checksum set to NONE, and adjust a few other parameters as described in the guide above.

AWS customers are already using FlyData for continuous replication to Redshift from RDS. For example, rideshare startup Sidecar seamlessly syncs tens of millions of records per day to Redshift from two RDS instances in order to analyze how customers utilize Sidecar's custom ride services. According to Sidecar, their analytics run 3x faster and the near-real-time access to data helps them to provide a great experience for riders and drivers. Here's the data flow when using FlyData:

Attunity CloudBeam has published a configuration guide that describes how you can enable continuous, incremental change data capture from RDS MySQL 5.6 to Redshift (you can get started for free for 5 days directly from the AWS Marketplace. You will need to create a new parameter group with binlog_format set to ROW and binlog_checksum set to NONE.

For additional information on configuring Attunity for use with Redshift please see this quick start guide.

Redshift Free Trial
If you are new to Amazon Redshift, you’re eligible for a free trial and can get 750 free hours for each of two months to try a dw2.large node (16 GB of RAM, 2 virtual cores, and 160 GB of compressed SSD storage). This gives you enough hours to continuously run a single node for two months. You can also build clusters with multiple dw2.large nodes to test larger data sets; this will consume your free hours more quickly. Each month's 750 free hours are shared across all running dw2.large nodes in all regions.

To start using Redshift for free, simply go to the Redshift Console, launch a cluster, and select dw2.large for the Node Type:

Big Data Webinar
If you want to learn more, do not miss the AWS Big Data Webinar showcasing how startup Couchsurfing used Attunity’s continuous CDC to reduce their ETL process from 3 months to 3 hours and cut costs by nearly $40K.

-- Jeff;

ProgrammableWeb: APIsDiffbot Discussion

Diffbot provides developers tools that can identify, analyze, and extract the main content and sections from any web page. The Diffbot Discussion API extracts discussions and posting information from web pages. It can return information about all identified objects on a submitted page and the Discussion API returns all post data in a single object. The Diffbot Discussion API is currently in Beta.
Date Updated: 2014-10-20
Tags: [field_primary_category], [field_secondary_categories]

ProgrammableWeb: APIsDiffbot Image

Diffbot provides developers tools that can identify, analyze, and extract the main content and sections from any web page. The purpose of Diffbot’s Image API is to extract the main images from web pages. The Image API can analyze a web page and return full details on the extracted images.
Date Updated: 2014-10-20
Tags: [field_primary_category], [field_secondary_categories]

ProgrammableWeb: APIsDiffbot Analyze

Diffbot provides developers tools that can identify, analyze, and extract the main content and sections from any web page. The Diffbot Analyze API can analyze a web page visually, and take a URL and identify what type of page it is. Diffbot’s Analyze API can then decide which Diffbot extraction API (article, discussion, image, or product) may be appropriate, and through automatic extraction, will be returned in the Diffbot Analyze API call.
Date Updated: 2014-10-20
Tags: [field_primary_category], [field_secondary_categories]

Paul Downey (British Telecom)One CSV, thirty stories: 6. Prices

This is day 6 of One CSV, 30 stories a series of articles exploring price paid data from the Land Registry found on GOV.UK. The code for this and the other articles is available as open source from GitHub

I was confident today was going to be “Talk like a statistician day” but my laptop was tied up for most of it whilst Yosemite installed itself, meaning I didn’t have time to play with R after all. Instead let’s continue to dig into how property is priced.

We saw in yesterday’s scatter plots how prices clump around integer values, and then skip around where stamp duty kicks in, £60k in this section:

Zooming in on the prices scatterplot

I didn’t have much time, so grabbed gnuplot again to make another scatter plot, this time using the prices file we made on Day 2:

 #!/usr/bin/env gnuplot
set terminal png font "helvetica,14" size 1600,1200 transparent truecolor
set output "/dev/stdout"
set key off
set xlabel "Price paid (£)"
set xrange [0:1500000]
set format x "%.0s%c"
set ylabel "Number of transactions"
set yrange [0:150000]
set format y "%.0s%c"
set style circle radius 4500
plot "/dev/stdin" using 2:1 \
    with circles lc rgb "black" \
    fs transparent \
    solid 0.5 noborder
$ price.gpi < price.tsv > price.png

Transactions by price

Maybe the same plot with boxes will be clearer:

 plot "/dev/stdin" using 2:1 with boxes lc rgb "black"

Frequency of prices

So even more confirmation that people prefer whole numbers and multiples of 10 when pricing houses, and market them either just below a stamp duty band or some way beyond it. The interference lines at the lower prices look interesting. More on that tomorrow.

Paul Downey (British Telecom)One CSV, thirty stories: 5. Axes

This is day 5 of One CSV, 30 stories a series of articles exploring price paid data from the Land Registry found on GOV.UK. The code for this and the other articles is available as open source from GitHub

I’m falling behind on the schedule to write a post each day thanks to falling into a time sink hand-coding PostScript code to generate axes. As fun as that was, it wasn’t helping us towards the goal of better understanding the data. I had literally lost the plot. Returning to the brief, the scatter plots from yesterday need axes to understand when the dips occurred and at at what price the horizontal bands are at.

So time to break out gnuplot a great package for generating charts from scripts. I found gnuplotting.org extremely helpful when it came to remembering how to drive this venerable beast, and trying to fathom new features for transparency:

#!/usr/bin/env gnuplot
set terminal png \
    font "helvetica,14" \
    size 1600,1200 \
    transparent truecolor
set output "/dev/stdout"
set key off
set xlabel "Date"
set xdata time
set timefmt "%Y-%m-%d"
set xrange ["1994-10-01":"2015-01-01"]
set format x "%Y"
set ylabel "Price paid (£)"
set yrange [0:300000]
set format y "%.0s%c"
set style circle radius 100
plot "/dev/stdin" using 1:2 \
    with circles lc rgb "black" fs transparent solid 0.01 noborder

Ignoring the outliers, and digging into the lower popular prices:

Scatter plot of lower house prices

The axes help us confirm the dip of the recession in 2009, and reveals seasonal peaks in summer and strong vertical gaps each new year. Horizontal bands show how property prices bunch between round numbers. Prices below 50k start to disappear from 2004, and skip around stamp duty bands, particularly noticeably at £250k and £60k, which was withdrawn in 2005 when the gap closes, and then opens up again at £125 which was introduced in 2006. Finally, there’s a prominent gap to correlate with the £175k band which ran between 2008 and 2010.

The seasonal trends are worth exploring further, but I think we first need to dig deeper into the horizontal banding, so I’m 82.3% confident tomorrow will be “Talk like a statistician day”.

Daniel Glazman (Disruptive Innovations)La France et Ebola

Je ne comprends absolument pas ce qui se passe en France dans les aéroports avec cette mesure débile de contrôle des passagers arrivant de Guinée-Conarkry, et uniquement eux. Il est tellement facile de faire une escale, et il est tellement facile pour un douanier de savoir, à la présentation du passeport, d'où arrive exactement le passager. Tout passager, sur quelque vol que ce soit, en provenance d'un pays à risque, devrait être contrôlé plus minutieusement qu'on ne le fait aujourd'hui. La mesure en cours est une cautère sur une jambe de bois ; elle ne couvre qu'une petite partie des précautions qui devraient s'imposer. Quand on connait un peu Ebola, l'absence de tout symptôme durant la relativement longue période d'incubation, il faudrait faire mieux et plus que cela. C'est, je trouve, lamentable et dangereux.

ProgrammableWebToday in APIs: PowerClerk Interconnect Reduces Solar Install Cost, Integrates Via API

Clean Power Research launches PowerClerk Interconnect to dramatically reduce software costs associated with solar power installs. Google has announced the launch of a new Tag Manager API. Plus: Tech Crunch and Makers Academy search for rockstar developers at London hackathon. 

ProgrammableWebGoogle Announces Improvements to Google Tag Manager Including New API

Google has just announced improvements to the Google Tag Manager tool which includes a new intuitive interface, more 3rd-party templates, and the new Google Tag Manager API. Google Tag Manager is a free tool designed primarily for marketers that makes it easy to add and update website and mobile app tags including conversion tracking, site analytics, and remarketing tags.

ProgrammableWebGoogle Launches Android 5.0 Lollipop SDK

Google today delivered an updated set of tools to Android app writers. It published the latest preview images of Android 5.0 Lollipop for the Nexus 5 smartphone and Nexus 7 tablets, as well as released the Android 5.0 SDK. With these tools, developers have all they need to get their apps up to speed.

Google made early versions of both Android 5.0 and the SDK available in June. Now, with Android 5.0's commercial release just several weeks away, it's time to get your apps in shape.

Paul Downey (British Telecom)One CSV, thirty stories: 4. Scattering

This is day 4 of One CSV, 30 stories a series of articles exploring price paid data from the Land Registry found on GOV.UK. The code for this and the other articles is available as open source from GitHub

I had some feedback after yesterday mostly from people enjoying my low-tech approach, which was nice. Today I wanted to look at the price paid for property. All 19 million prices on a single page in a hope to see any apparent trends or anomalies.

To do this we only need the date and the price columns, and we might as well sort them by date as I’m pretty sure that’ll be useful later:

awk -F'⋯' '{print $2 "⋯" $1}' < data/pp.tsv | sort > prices.tsv

Now to scatter the prices with time on the x-axis, and the price paid on the y-axis. We’ll use yet another awk script to do this:

cat prices.tsv | {
cat <<!
%!
%%Orientation: Landscape
%%Page: 1 1
0 0 0 setrgbcolor
/p {
    1 0 360 arc fill
} def
!
awk -F'	' -v max=15000000 '
    function epoch(s) {
        gsub(/[:-]/, " ", s);
        s = s " 00 00 00"
        return mktime(s);
    }
    NR == 1 {
        first = epoch($1);
        last = systime() - first;
    }
    {
        this = epoch($1) - first;
        x = 600 * this / last;
        y = 600 * $2 / max;
        printf "%d %d p\n", x, y;
    }'
echo showpage
}

which generates a rather large PostScript document:

%!
%%Orientation: Landscape
%%Page: 1 1
0 0 0 setrgbcolor
/p {
    1 0 360 arc fill
} def
0 4 p
0 0 p
   ... [19 million lines removed] ...
595 3 p
595 13 p
showpage

Back in the day the quickest way to see the output would be to attach a laser printer to the parallel port on the back of a server and cat prices.ps > /dev/lp but these days we have a raft of ways of executing PostScript. Most anything that can render a PDF can usually also run the older PostScript language — it’s a little bit weird how we bat executable programs back and forth when we’re exchanging text and images. Just to emphasise the capacity for mischief, the generated 1.5 Gig PostScript reliably crashes the Apple OS X preview application, so it’s best to use something more solid, such as the open source ImageMagick in this case to make a raster image:

scatterps.sh < data/prices.tsv | convert -density 300 - out.png

This image is intriguing, but we should be able to differentiate the density of points if we make them slightly transparent. PostScript is notoriously poor at rendering opacity, but luckily ImageMagick has its own drawing language which makes png files directly and it’s fairly straightforward to tweak the awk to generate MVG:

We can see from this a general, apparently slow trend in the bulk of house prices, with seasonal and a marked dip at what looks like 2009. There’s also a strange vertical gap in higher priced properties towards the right which along with the horizontal bands more apparent on the first plot could be down to bunching around the stamp duty bands.

So there’s a few stories to delve into. I completely mismanaged my time writing this post, so will leave adding axis to the graphs until tomorrow.

ProgrammableWebHow To Develop an Android Wear Application

Android Wear from Google is a platform that connects your Android phone to your wrist. Since its release earlier this year, Android Wear has garnered a lot of attention, both from a consumer angle and also from developers, who want to ensure that they understand and have their applications ready to take advantage of a new way in which users will be interacting with contextual information.

This article will give a brief introduction to Android Wear and then jump into the platform vis-a-vis the developer.

ProgrammableWebKPIs for APIs: API Calls Are the New Web Hits

This is the first post of a three-part series covering key performance indicators (KPIs) for APIs, based on John Musser's presentation at the Business of APIs Conference.

ProgrammableWeb: APIsbx.in.th

bx.in.th is a Thailand-based Bitcoin and cryptocurrency exchange platform operated by Bitcoin Exchange Thailand (Bitcoin Co. Ltd.). Their API accessibility is divided into Public and Private. The bx.in.th Public API allows anyone to view market data from the exchange, including rates, orderbook, currency pairing for comparison, high and low trades, average Bitcoin pricing, and more. The Private API requires an API key for use. HTTP POST requests can be made to place orders and manage existing orders. Private account data may be returned, such as balances, order history, transaction history, and withdrawal requests. All requests made to the API will return JSON encoded data as a response.
Date Updated: 2014-10-17
Tags: [field_primary_category], [field_secondary_categories]

ProgrammableWeb: APIsVIDAL Group

VIDAL Group is a French healthcare informatics group specializing in databasing and distributing healthcare data, pharmaceutical information, treatment specifications, and scientific publications for patients and healthcare practitioners in the European continent and worldwide. VIDAL Group also supports a medical software application under the same name. VIDAL's database may be accessed by 3rd party developers to construct healthcare-related applications and websites. After acquiring an app ID and API key from MIDAL, users can query the VIDAL server to return data on drug scores, allergies, product information, ingredients, related documents, and more.
Date Updated: 2014-10-17
Tags: [field_primary_category], [field_secondary_categories]

ProgrammableWeb: APIsCrowdfunder

Crowdfunder is a UK based platform where people can crowdsource funding for unique projects. Crowdfunder projects typically involve social endeavors related to community, charity, environment, art, music, publishing, film, and theatre. Currently in an open beta, HTTP GET calls to the Crowdfunder API can be made to request JSON lists of all current campaigns filtered by project name and category. Implementing the API, users may have programmatic access to specific details on individual projects, including all project fields: biography, description, URL, current funding amount, last pledge amount, project video, image, category, and additional details. As the API is in beta, Crowdfunder is accepting any feedback users may have while implementing the API.
Date Updated: 2014-10-17
Tags: [field_primary_category], [field_secondary_categories]

ProgrammableWeb: APIsCompany Check

The Company Check API provides direct access to a wealth of information on companies and directors. The API platform is useful to developers to incorporate company, director, financial, credit data, and many more data fields into software and business apps. By applying for the API Key, developers can choose between different levels of account plans.
Date Updated: 2014-10-17
Tags: [field_primary_category], [field_secondary_categories]

ProgrammableWeb: APIsGlobalNLP

Via RESTful connectivity, GlobalNLP handles a wide variety of natural language processing. Currently, the API supports many NLP processes including: stemming, morphological synthesis, word sense disambiguation, entity extraction, and automatic translation. A full list of supported processes is listed in the documentation along with code samples in JavaScript, C#, PHP, Python, Ruby, and more. A free account guarantees 20 API calls a minute and 500 calls a month with higher volumes available with a paid account.
Date Updated: 2014-10-17
Tags: [field_primary_category], [field_secondary_categories]

Bob DuCharme (Innodata Isogen)Dropping OPTIONAL blocks from SPARQL CONSTRUCT queries

And retrieving those triples much, much faster.

animals taxonomy

While preparing a demo for the upcoming Taxonomy Boot Camp conference, I hit upon a trick for revising SPARQL CONSTRUCT queries so that they don't need OPTIONAL blocks. As I wrote in the new "Query Efficiency and Debugging" chapter in the second edition of Learning SPARQL, "Academic papers on SPARQL query optimization agree: OPTIONAL is the guiltiest party in slowing down queries, adding the most complexity to the job that the SPARQL processor must do to find the relevant data and return it." My new trick not only made the retrieval much faster; it also made it possible to retrieve a lot more data from a remote endpoint.

First, let's look at a simple version of the use case. DBpedia has a lot of SKOS taxonomy data in it, and at Taxonomy Boot Camp I'm going to show how you can pull down and use that data. Now, imagine that a little animal taxonomy like the one shown in the illustration here is stored on an endpoint and I want to write a query to retrieve all the triples showing preferred labels and "has broader" values up to three levels down from the Mammal concept, assuming that the taxonomy's structure uses SKOS to represent its structure. The following query asks for all three levels of the taxonomy below Mammal, but it won't get the whole taxonomy:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level1 skos:prefLabel ?level1label . 
  ?level2 skos:broader ?level1 ;
          skos:prefLabel ?level2label . 
  ?level3 skos:broader ?level2 ;
          skos:prefLabel ?level3label . 
}
WHERE {
  ?level1 skos:broader v:Mammal ;
          skos:prefLabel ?level1label . 
  ?level2 skos:broader ?level1 ;
          skos:prefLabel ?level2label .
  ?level3 skos:broader ?level2 ;
          skos:prefLabel ?level3label . 
}

As with any SPARQL query, it's only going to return triples for which <emph id="id123932">all</emph> the triple patterns in the WHERE clause match. While Horse may have a broader value of Mammal and therefore match the triple pattern {?level1 skos:broader v:Mammal}, there are no nodes that have Horse as a broader value, so there will be no match for {?level2 skos:broader v:Horse}. So, the Horse triples won't be in the output. The same thing will happen with the Cat triples; only the Dog ones, which go down three levels below Mammal, will match the graph pattern in the WHERE clause above.

If we want a CONSTRUCT query that retrieves all the triples of the subtree under Mammal, we need a way to retrieve the Horse and Cat concepts and any descendants they have, even if they have no descendants, and OPTIONAL makes this possible. The following will do this:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level1 skos:prefLabel ?level1label . 
  ?level2 skos:broader ?level1 ;
          skos:prefLabel ?level2label . 
  ?level3 skos:broader ?level2 ;
          skos:prefLabel ?level3label . 
}
WHERE {
  ?level1 skos:broader v:Mammal ;
          skos:prefLabel ?level1label . 
  OPTIONAL {
    ?level2 skos:broader ?level1 ;
            skos:prefLabel ?level2label .
  }
  OPTIONAL {
    ?level3 skos:broader ?level2 ;
            skos:prefLabel ?level3label . 
  }
}

The problem: this doesn't scale. When I sent a nearly identical query to DBpedia to ask for the triples representing the hierarchy three levels down from <http://dbpedia.org/resource/Category:Mammals>, it timed out after 20 minutes, because the two OPTIONAL graph patterns gave DBpedia too much work to do.

As a review, let's restate the problem: we want the identified concept and the preferred labels and broader values of concepts up to three levels down from that concept, but without using the OPTIONAL keyword. How can we do this?

By asking for each level in a separate query. When I split the DBpedia version of the query above into the following three queries, each retrieved its data in under a second, retrieving a total of 2,597 triples representing a taxonomy of 1,107 concepts:

# query 1
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  <http://dbpedia.org/resource/Category:Mammals> a skos:Concept . 
  ?level1 a skos:Concept ;
          skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
}
WHERE {
  ?level1 skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
}

# query 2
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level2 a skos:Concept ;
          skos:broader ?level1 ;  
          skos:prefLabel ?level2label .  
}
WHERE {
  ?level1 skos:broader <http://dbpedia.org/resource/Category:Mammals> .
  ?level2 skos:broader ?level1 ;  
            skos:prefLabel ?level2label .  
}

# query 3
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level3 a skos:Concept ;
          skos:broader ?level2 ;  
          skos:prefLabel ?level3label .  
}
WHERE {
?level2 skos:broader/skos:broader <http://dbpedia.org/resource/Category:Mammals> .
  ?level3 skos:broader ?level2 ;  
          skos:prefLabel ?level3label .  
}

Going from timing out after 20 minutes to successful execution in under 3 seconds is quite a performance improvement. Below, you can see how the beginning of a small piece of this taxonomy looks in TopQuadrant's TopBraid EVN vocabulary manager. At the first level down, you can only see Afrosoricida, Australosphenida, and Bats in the picture; I then drilled down three more levels from there to show that Fictional bats has the single subcategory Silverwing.

As you can tell from the Mammals URI in the queries above, these taxonomy concepts are categories, and each category has at least one member (for example, Bats as food) in Wikipedia and is therefore represented as triples in DBpedia, ready for you to retrieve with SPARQL CONSTRUCT queries. I didn't retrieve any instance triples here, but it's great to know that they're available, and that this technique for avoiding CONSTRUCT graph patterns will serve me for much more than SKOS taxonomy work.

There has been plenty of talk lately on Twitter and in blogs about how it's not a good idea for important applications to have serious dependencies on public SPARQL endpoints such as DBpedia. (Orri Erling has one of the most level-head discussions of this that I've seen in SEMANTiCS 2014 (part 3 of 3): Conversations; in my posting Semantic Web Journal article on DBpedia on this blog I described a great article that lists other options.) There's all this great data to use in DBpedia, and besides spinning up an Amazon Web Services image with your own copy of DBpedia, as Orri suggests, you can pull down the data you need to store locally when it is up. If you're unsure about the structure and connections of the data you're pulling down, OPTIONAL graph patterns seems like an obvious fix, but this trick for splitting up CONSTRUCT queries to avoid the use of OPTIONAL graph patterns means that you can pull down a lot more data lot more efficiently.

Stickin' to the UNION

October 16th update: Once I split out the pieces of the original query into separate files, it should have occurred to me to at least try joining them back up into a single query with UNION instead of OPTIONAL, but it didn't. Luckily for me, John Walker suggested in the comments for this blog entry that I try this, so I did. It worked great, with the benefit of being simpler to read and maintain than using a collection of queries to retrieve a single set of triples. This version only took three seconds to retrieve the triples:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  <http://dbpedia.org/resource/Category:Mammals> a skos:Concept . 
  ?level1 a skos:Concept ;
          skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
  ?level2 a skos:Concept ;
          skos:broader ?level1 ;  
          skos:prefLabel ?level2label .  
  ?level3 a skos:Concept ;
          skos:broader ?level2 ;  
          skos:prefLabel ?level3label .  

}
WHERE {
  ?level1 skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
  {
    ?level2 skos:broader ?level1 ;  
    skos:prefLabel ?level2label .  
  }
  UNION
  {
    ?level2 skos:broader ?level1 .
    ?level3 skos:broader ?level2 ;  
            skos:prefLabel ?level3label .  
  }
}

There are two lessons here:

  • If you've figured out a way to do something better, don't be too satisfied too quickly—keep trying to make it even better.

  • UNION is going to be useful in more situations than I originally thought it would.

  • animals taxonomy

    Please add any comments to this Google+ post.

ProgrammableWebToday in APIs: BBC Worldwide Chooses Apigee Edge API to Manage Traffic

BBC Worldwide chooses Apigee Edge to manage its store traffic. API-based GroundWork Network Hub for Software defined networks is launched. Plus: the MTS Bonds.com API is being used by buy-side firms for trading, and apps using Facebook Graph need upgrading by December 25.

Apigee API Edge Chosen by BBC as Store Traffic Manager

BBC Worldwide has chosen Apigee's Edge API to manage traffic for its BBC store. Consumers will be able to buy and store content, including from its extensive archives.

ProgrammableWebApple Watch SDK, WatchKit, Will Be Released in November

Apple just announced that they will release WatchKit, an SDK that allows third-party developers to work within the Apple Watch platform, in November. Apple has been working with select partners to develop applications for the platform since well before the announcement of their newest product; with the release of WatchKit, thousands of other developers will be able to get in on the action.

ProgrammableWebURBAN4M Releases aboutPLACE API to Shake Up Hyperlocal Product Design

URBAN4M has released a beta version of its aboutPLACE API, bringing a new approach to serving location analytics within apps and Web-based products. Founder and CEO Hillit Meidar-Alfi spoke with ProgrammableWeb about how URBAN4M's delivery of urban data will transform hyperlocal products and services.

ProgrammableWebHello Doctor Raises New Funding for API-driven Health Data Aggregation

Hello Doctor, a mobile app designed to empower people to control their health, has raised a new round of funding to continue building its API-driven health data aggregation platform. Through its API, Hello Doctor pulls clinical data from hospitals and consolidates the data into a single, easily digestible format and location. The app is designed to break down walls that keep patient data segregated across different medical institutions.

Amazon Web ServicesAmazon WorkSpaces Supports PCoIP Zero Clients

Amazon WorkSpaces provides a persistent, cloud-based desktop experience that can be accessed from a variety of devices including PC and Mac desktops and laptops, iPads, Kindle Fires, and Android tablets.

Support for PCoIP Zero Clients
Today we are making WorkSpaces even more flexible by adding support for PCoIP zero clients. WorkSpaces desktops are rendered on the server and then transmitted to the endpoint as a highly compressed bitmap via the PCoIP protocol.

Zero clients are simple, secure, single-purpose clients that are equipped with a monitor, keyboard, mouse, and other peripherals. The clients use a dedicated PCoIP chipset for bitmap decompression and decoding and require very little in the way of local software maintenance (there is no operating system running on the device), making them a great match for Amazon WorkSpaces.

You can use any zero client device that contains the Teradici Tera 2 zero client chipset. Currently, over 30 hardware manufacturers provide such devices; check Teradici's supported devices list for more information.

Getting Started
In order to connect your existing zero clients to Amazon WorkSpaces, first verify that they are running version 4.6.0 (or newer) of the PCoIP firmware.

You will need to run the PCoIP Connection Manager authentication appliance in a Virtual Private Cloud. The Connection Manager is built on Ubuntu 12.04 LTS and is available as an HVM AMI. It brokers the authentication process and enables the creation of streaming sessions from WorkSpaces to the clients, thereby offloading all non-streaming work from the clients. The Connection Manager must be run in the VPC that hosts your Amazon WorkSpaces endpoint.

To learn more about this important new AWS feature, read the PCoIP Zero Client Admin Guide.

-- Jeff;

ProgrammableWebWhat The Snappening Taught Us About API Security

Snapchat's recent hack of over 200,000 explicit photos reiterates the importance of comprehensive app & API security measures. Though Snapchat is blaming illegal unofficial 3rd party apps for the hack, the fact is that Snapchat's API was too frail from the start.

ProgrammableWebToday in APIs: PayPal Flubs Response to Security Flaw

PayPal, along with other companies, responds poorly to discovery of security flaws. Duo Security offers an API for its two-factor identification service. Plus: 5 tips on how to craft APIs for developers, and will Apple make a social TV play?

Amazon Web ServicesNew AWS Quick Start - Cloudera Enterprise Data Hub

date: 2014-10-15 2:03:16 PM The new Quick Start Reference Deployment Guide for Cloudera Enterprise Data Hub does exactly what the title suggests! The comprehensive (20 page) guide includes the architectural considerations and configuration steps that will help you to launch the new Cloudera Director and an associated Cloudera Enterprise Data Hub (EDH) in a matter of minutes. As the folks at Cloudera said in their blog post, "Cloudera Director delivers an enterprise-class, elastic, self-service experience for Hadoop in cloud environments."

The reference deployment takes the form of a twelve-node cluster that will cost between $12 and $82 per hour in the US East (Northern Virginia) Region, depending on the instance type that you choose to deploy.

The cluster runs within a Virtual Private Cloud that includes public and private subnets, a NAT instance, security groups, a placement group for low-latency networking within the cluster, and an IAM role. The EDH cluster is fully customizable and includes worker nodes, edge nodes, and management nodes, each running on the EC2 instance type that you designate:

The entire cluster is launched and configured by way of a parameterized AWS CloudFormation template that is fully described in the guide.

-- Jeff;

Paul Downey (British Telecom)One CSV, thirty stories: 3. Minimal viable histograms

This is day 3 of One CSV, 30 stories a series of articles exploring price paid data from the Land Registry found on GOV.UK. The code for this and the other articles is available as open source from GitHub

Yesterday we counted the occurrence of values in each column and made some frequency data. Processing the files and writing up the findings took me a bit longer than I’d have liked. Today I’ve a seat on the train and aim to use my commute home to timebox some quick and dirty histograms. These are a cinch in awk for anyone who grew up playing with BASIC which in my case involved chewing up rolls of paper on my VI form college’s teletype link to a Univac 1100/60 in Teesside Polytechnic:

$ head -25 price.tsv | awk 'NR == 1 {
    width = 60
    max = $1;
}
{
    title = $2;
    size = width * ($1 / max);
    bar = "";
    for (i = 0; i < size; i++)
        bar = bar"#";
    count = sprintf("(%d)", $1);
    printf "%-10s %10s %s\n", title, count, bar;
}'

which gives:

250000            (208199) #################################################################
125000            (185912) ###########################################################
120000            (163323) ###################################################
60000             (159519) ##################################################
110000            (147645) ###############################################
150000            (145214) ##############################################
115000            (140833) ############################################
135000            (134731) ###########################################
175000            (131334) ##########################################
85000             (131223) #########################################
130000            (129597) #########################################
105000            (129336) #########################################
165000            (126161) ########################################
95000             (126004) ########################################
145000            (124379) #######################################
75000             (123968) #######################################
140000            (123893) #######################################
160000            (123451) #######################################
90000             (123340) #######################################
100000            (120306) ######################################
80000             (119776) ######################################
155000            (115309) ####################################
185000            (111410) ###################################
180000            (111090) ###################################
65000             (109939) ###################################

Putting this awk script into it’s own file histogram.awk gives us a command we can use again and again, allowing us to compare new versus old builds:

$ bin/histogram.awk < data/new.tsv
N               (17351417) ############################################################
Y                (1974154) #######

freehold (F) versus leaseholds (L) versus uncategorised (U):

$ head -10 data/duration.tsv | bin/histogram.awk 
F               (14871813) #################################################################
L                (4450166) ####################
U                   (3592) #

and the distribution of prices paid is a typical long-tail:

$ head -80 data/price.tsv | bin/histogram.awk 
250000            (208199) #################################################################
125000            (185912) ###########################################################
120000            (163323) ###################################################
60000             (159519) ##################################################
110000            (147645) ###############################################
150000            (145214) ##############################################
115000            (140833) ############################################
135000            (134731) ###########################################
175000            (131334) ##########################################
85000             (131223) #########################################
130000            (129597) #########################################
105000            (129336) #########################################
165000            (126161) ########################################
95000             (126004) ########################################
145000            (124379) #######################################
75000             (123968) #######################################
140000            (123893) #######################################
160000            (123451) #######################################
90000             (123340) #######################################
100000            (120306) ######################################
80000             (119776) ######################################
155000            (115309) ####################################
185000            (111410) ###################################
180000            (111090) ###################################
65000             (109939) ###################################
170000            (109751) ###################################
70000             (106529) ##################################
55000             (103816) #################################
200000            (100245) ################################
50000              (98030) ###############################
210000             (96644) ###############################
195000             (96202) ###############################
190000             (95358) ##############################
220000             (94951) ##############################
225000             (94851) ##############################
45000              (94356) ##############################
40000              (88038) ############################
215000             (85757) ###########################
230000             (82970) ##########################
245000             (82390) ##########################
235000             (79704) #########################
240000             (78959) #########################
205000             (76377) ########################
59950              (75494) ########################
35000              (70778) #######################
58000              (65743) #####################
30000              (64087) #####################
52000              (63009) ####################
68000              (61675) ####################
275000             (60762) ###################
78000              (60281) ###################
249950             (60125) ###################
72000              (58979) ###################
42000              (58826) ###################
48000              (58184) ###################
59000              (57247) ##################
57000              (56613) ##################
54000              (56264) ##################
118000             (56147) ##################
82000              (55993) ##################
56000              (55033) ##################
53000              (55025) ##################
38000              (54752) ##################
67000              (52000) #################
87000              (51853) #################
112000             (51682) #################
285000             (51649) #################
92000              (51467) #################
88000              (51394) #################
300000             (51274) #################
46000              (51099) ################
280000             (50749) ################
84000              (49414) ################
43000              (49405) ################
47000              (49340) ################
64000              (49208) ################
83000              (49149) ################
74000              (48966) ################
73000              (48934) ################
270000             (48631) ################

Streets are something worth exploring another day, there’s real history in them names:

$ head -20 street.tsv  | ../bin/histogram.awk 
                  (284236) #################################################################
HIGH STREET       (111407) ##########################
STATION ROAD       (61918) ###############
LONDON ROAD        (41652) ##########
CHURCH ROAD        (35596) #########
CHURCH STREET      (35022) #########
MAIN STREET        (34270) ########
PARK ROAD          (29846) #######
VICTORIA ROAD      (25819) ######
CHURCH LANE        (23414) ######
QUEENS ROAD        (21434) #####
NEW ROAD           (21423) #####
MAIN ROAD          (21308) #####
MANOR ROAD         (21069) #####
THE STREET         (21068) #####
WEST STREET        (17359) ####
GREEN LANE         (17231) ####
MILL LANE          (16908) ####
THE GREEN          (16886) ####
THE AVENUE         (16853) ####

and the other address fields are something we also need to dig into further because they’re pretty inconsistent. I assume that’s due to the longevity of this data, and differences in how the data is recorded by different people over the years:

$ head -10 data/locality.tsv | bin/histogram.awk
                 (4354411) #################################################################
LONDON            (915332) ##############
BIRMINGHAM        (112836) ##
MANCHESTER        (102982) ##
LIVERPOOL         (101127) ##
LEEDS              (90163) ##
BRISTOL            (89995) ##
SHEFFIELD          (77372) ##
BOURNEMOUTH        (61337) #
SOUTHAMPTON        (57342) #
$ head -10 town.tsv  | ../bin/histogram.awk
LONDON           (1499904) #################################################################
MANCHESTER        (312555) ##############
BRISTOL           (296232) #############
BIRMINGHAM        (285479) #############
NOTTINGHAM        (251575) ###########
LEEDS             (217870) ##########
LIVERPOOL         (190190) #########
SHEFFIELD         (183048) ########
LEICESTER         (169283) ########
SOUTHAMPTON       (161070) #######
$ head -10 district.tsv  | ../bin/histogram.awk
BIRMINGHAM        (287828) #################################################################
LEEDS             (257632) ###########################################################
BRADFORD          (173120) ########################################
SHEFFIELD         (155900) ####################################
MANCHESTER        (154451) ###################################
CITY OF BRISTOL   (149031) ##################################
WANDSWORTH        (134851) ###############################
KIRKLEES          (131368) ##############################
LIVERPOOL         (126932) #############################
EAST RIDING OF YO (125271) #############################
$ head -10 county.tsv  | ../bin/histogram.awk 
GREATER LONDON   (2520251) #################################################################
GREATER MANCHESTE (843778) ######################
WEST MIDLANDS     (739206) ####################
WEST YORKSHIRE    (733941) ###################
KENT              (544938) ###############
ESSEX             (544377) ###############
HAMPSHIRE         (516377) ##############
SURREY            (450990) ############
LANCASHIRE        (435968) ############
HERTFORDSHIRE     (426834) ############

Time’s up! Tomorrow we should make some timelines and dig more deeply into those prices.

ProgrammableWebInteractive Brokers Hopes to Encourage Developer Engagement With New API strategy

Interactive Brokers, a leading forex broker-dealer, has begun tests for a Mobile API. The API will allow third-party developers to integrate with Interactive Brokers accounts.

ProgrammableWebOrchestrate Announces New Enterprise Multimodel Database Service and Updated Pricing

Last week, Orchestrate announced the addition of geospatial search to its database service, allowing developers to build Web, mobile and Internet of Things (IoT) applications that utilize a complete database and data processing solution. Orchestrate has just announced the availability of the new multimodel database service portfolio for enterprises.

ProgrammableWebWhy APIs Help Drive Today&#039;s Corporate Acquisitions

APIs continue to create tremendous value for organizations. In a recent article written for Nordic APIs, Mark Boyd considered the transformative power of cloud computing, and the disruption it is stirring in the corporate acquisition market. With today's ability to include location, payment, social networking, messaging, CRM, and more services into one platform through the use of specific APIs, these interfaces are playing a more important role in influencing corporate acquisitions, valuations, and garnering interest in new technologies.

ProgrammableWebWise.io API Puts Machine Learning Functionality at Developers&#039; Fingertips

For business, the power of machine learning is a game-changing tool in the understanding of customer behavior. The ability to find patterns and trends buried deep in complicated data makes creating a more optimized customer experience an exciting reality. Wise.io leverages the power of machine learning in its applications, providing users with the ability to do just that. The Wise.io API gives developers access to this functionality.

ProgrammableWebSnapchat Not Blameless for Snappening After Snapsaved Accepts Guilt

Last week, popular photo-sharing service Snapchat made headlines after photos its users had sent were leaked on 4chan, an online forum where photos obtained in other high-profile breaches have been previously released. Over the weekend, Snapsaved.com, an unauthorized website that enabled Snapchat users to save snaps, acknowledged that it was the source of the incident.

ProgrammableWebNew Ziftr API Will Support Major CryptoCurrencies

Ziftr, an eCommerce payment provider, recently announced that their soon-to-be-released improved API will allow merchants to not only accept BitCoin for transactions, but will add support for other major cryptocurrencies on the market as well. So far, they have confirmed LiteCoin, and though not officially released, the industry supposes others such as PeerCoin, DogeCoin, and DarkCoin will likely join the family of alternative payments options for Ziftr merchants and users. 

Shelley Powers (Burningbird)Koster's Right to...collect large campaign contributions from big Agribusiness Interests

Koster and Nixon

Photo of Chris Koster, left, and Jay Nixon by Missouri News Horizon, used without edit, shared under CC by 2.0

I was not surprised to read that Missouri's Attorney General, Chris Koster, has come out in support of Amendment 1, the so-called Right to Farm Amendment. He used Missouri tax payer money to sue the state of California on behalf of a few large egg producers in the state. It's pretty obvious that Mr. Koster is regretting that whole "I'm now a Democrat" thing, especially among the rural, large agribusiness types.

It must also sadden Mr. Koster to realize that the egg lawsuit isn't doing all that well. California and other intervenor defendants moved to dismiss the lawsuit under the reasonable claim that Missouri can't sue because it doesn't have standing. States can sue, but only if a significant percentage of the population of the state is impacted by the lawsuit. I don't work in the chicken or egg industry, and I have a strong suspicion neither does a significant number of other Missourians.

I strongly doubt the lawsuit will survive, and I believe Koster knows this, which is why he claims the costs will be less than $10,000. But it doesn't matter in the end, because it makes Koster look real good to large agribusiness interests in the state. Large agribusiness interests that are known to donate big bucks to election campaigns.

Koster's support for Amendment 1 is more of the same. You'd think a state Attorney General would know the costly, negative impacts from such a vaguely worded piece of legislation. Legal analysis has demonstrated that the Right to Farm Amendment is an awful piece of drivel that will clutter up an already cluttered up state Constitution and costs millions to defend in court. At best. At worst it can mire critical decisions in uncertainty and contentiousness.

But there you go, Koster supports Amendment 1, and he supports his ill-considered lawsuit against California. So does a new libertarian legal entity in Missouri called the Missouri Liberty Project. They filed an Amicus Curiae brief in opposition to California's motion to dismiss. The group was founded by Joshua Hawley, and if you don't recognize that name, he's one of the lawyers who represented Hobby Lobby in its successful drive to get the Supreme Court to recognize that corporations can go to church on Sunday.

Josh Hawley's name might also be familiar to those interested in the Amendment 1 vote, because Hawley wrote a piece in favor of the amendment, just before Nixon decided to put it to the vote in August instead of November. The piece contains the usual references to farming and how it hasn't changed all that much since Jefferson's time (just ignore those CAFOs with rivers of pink manure, and genetically altered corn implanted with some kind of bug DNA). According to him, all this innocent little Amendment will do is ensure that farmers can continue doing what farmers have been doing since the dawn of time.

But then he slips a little and writes:

Unfortunately, Missouri agriculture is under attack from government bureaucrats and outside interest groups that want to tie down farmers with burdensome regulations.

But, but...don't all industries have to conform to one regulation or another? After all, we don't allow oil companies to drill anywhere they want, nor can coal-fired utilities dump their waste into waters, putting wildlife, livestock, and people at risk—not without suffering consequences. Car makers have to ensure air bags inflate when they're supposed to, planes really do need to stay in the air, and most of us just hate it when banks—or cable companies—rip us off.

Come to think of it, we're not all that happy when we find out that rare beef hamburger we just ate comes with a generous dose of E.coli, or that the cow that supplies the milk the little ones drink can glow in the dark. And yeah, I don't really want to eat the meat of a animal that's so sick, some idiot in a fork lift has to lift her up to get her ready to be killed. It's that whole, not wanting to die because I lost the luck of the draw in the food safety game, thing.

Then there's that whole issue of clean water. I've seen some creeks and streams in Missouri that are so clear, you can see the fins on the tiny fish that inhabit them. I'd really hate to see these streams turned pink and murky, and all those cute little fish killed off before they have a chance to develop into nice big trout.

Though I live in the city, I'm also sympathetic to the small, organic farmer, fighting to keep pesticide off his or her field, and the country home dweller who once lived next to a corn field, only to wake up to 5,000 hogs the next day. Right next door.

And yeah, I like puppies. I like dogs, cats, horses, elephants, bats, bees, deer, and whole host of critters. I hate to think of any of them suffering or dying unnecessarily because of greed, stupidity, and cruelty. True, some animals we make into pets, some we eat, and some we leave alone, but that doesn't mean any of them deserve abuse. I'd like to think we humans are better than that.

Regulations may sound scary, but they ensure we all have at least a fighting chance for a decent life. And a fighting chance to be decent human beings.

Amendment 1 isn't about outside interest groups—it's about people who live here, in Missouri. It's about our interests, our concerns, and our responsibilities. Right to Farm sounds innocent, but it's a backdoor method to undermine every county, city, and state-based regulation that impacts on anything even remotely related to agriculture. And it's a way of making it almost impossible to adapt our laws to new information, new concerns, and new discoveries. It permanently enshrines the worst of behavior into the State Constitution for one single industry.

I'm thankful that Josh Hawley at least had the decency to come out and say that Amendment 1 is about undermining state and other regulations. That's more than you'll get from Chris Koster.

After reading all this, do you still think it's all about the farmers? How's this then: Go move a mile down river from a CAFO with 5,000 hogs and then tell me you're going to vote for Amendment 1. Just be careful not to step in the pink stuff on your way to the poll.

Shelley Powers (Burningbird)Posse

old photo of posse

1890 photo of posse taken by Barbara Harris, uploaded by Chuck Coker. Uploaded photo cc by-nd 2.0

I was disappointed to see Kathy Sierra leave Twitter, but respect her decision to do so. I read her writing about why she left, but I also dug through past Twitter postings with the individual she references, Rob Graham. Twitter is not the best of places for thoughtful discussion when parties agree, but it is especially bad when two people have views that are diametrically opposed.

I knew the people involved with Kathy's original leaving years ago. Or I should say, I knew a group of people who got conflated with others in a case of rotten timing.

Three different events happened in the same period.

1. A group of people wanted to start a site where people could speak freely, even critically. Abusive, childish photos were posted related to Kathy, as well as racist comments made about another well known woman in the tech community. The site was immediately shut down by the originators. Rogers Cadenhead wrote a good summary of this event.

2. In comments to a weblog post Kathy posted, a man suggested the worst, most violent act be committed on Kathy. Later, we discovered he was a British ex-pat who lived in Spain.

3. Another individual posted personal information about Kathy, including her Social Security Number and address. He did so in a highly fabricated context, making the act that much worse. In a 2008 New York Times article, a man who goes by the name "weev" took credit for the posting. Weev's real name is Andrew Auernheimer. Auernheimer also took credit for the posting in an article for Esquire.

Individually, these three acts would be enough to stress any individual, but coming at the same time, it could feel like a conspiracy to the impacted person. But it was not a conspiracy. Each was an individual act, not some form of black internet ops of the unknowns against the famous. It is important to understand that the acts were independent of each other.

Andrew "weev" Auernheimer was arrested and convicted for violations of the Computer Fraud and Abuse Act (CFAA) for an unrelated incident, but was exonerated and released earlier this year. He was somewhat of a cause célèbre in tech and transparency circles, where the CFAA is universally loathed. Understandably, Kathy was less than happy about the celebration of a man who claimed responsibility for a posting that caused her so much pain.

Fast forward to recent events. October 4, Rob Graham, who tweets as ErrataRob got into a Twitter discussion with Kathy Sierra (@seriouspony) and other individuals. I managed to capture a PDF of the tweets and replies, though by this time Kathy's tweets are gone. You'll have to dig through the recent postings until you get to the right day (October 4). The links to conversations work, so you can expand the discussions if you wish.

Graham believes, strongly, that weev was incorrectly prosecuted for violations of the CFAA. Evidently, one or more individuals expressed an opinion to Graham that weev should be jailed for what happened to Kathy. He disagrees with this because, as he later wrote. "there is no evidence supporting such a conviction".

As I pointed out on Twitter, we can’t believe Weev either way. He is notoriously unreliable. We can’t trust his denials today, but at the same time, we can’t trust his statements from 2008. As I pointed out on Twitter, Weev has claimed credit for trolls that he was at best only peripherally involved in. Yet, Kathy Sierra insultingly claims this means I somehow believe Weev.

Kathy wrote of her reaction:

But a few days ago, in the middle of one of those “discussions”, this time with @erratarob, I realized it wasn’t worth it. He concluded that I was just trolling so people would troll me back. I asked him what he thought I should have done. And his answer was “don’t feed the trolls.” “Ignore it and move on.” Perhaps Rob didn’t know that I'd already tried that for six years, but that it was weev who kept that damn thing alive no matter how gone I was. He managed to tweet to my social security number not long before he went to prison, and well before I resurfaced. No, I didn't troll him into that. I didn't "engage".

But Rob didn’t do anything wrong. He was saying what he truly believes. What, sadly, a whole lot of people in tech believe. Rob just happened to be the last “you asked for it” message I wanted to hear. So I just stopped.

What Graham had said was:

@seriouspony you are a passive-aggressive troll, a different kind of troll than weev's naked aggression, but a troll nonetheless.

Graham stated he politely responded to Kathy's Twitter posts; I can't quite see the politeness in this response. Regardless, it's important to understand the context of Kathy's "you asked for it".

Rob Graham and Kathy Sierra approached this Twitter discussion from positions that are black and white. Graham doesn't believe weev's claims, and definitely doesn't believe that weev should be prosecuted for something without proof. Kathy believes the claims weev made in the past, and while she isn't advocating jail time for him, she is not happy with the acclaim weev is receiving in tech circles. There is no middle ground, no gray area where they can meet and find some commonality.

This really is the end of the story. Rob Graham did not drive Kathy off of Twitter, the web, or the internet. Kathy decided Twitter was not a healthy place for her, and she left. They disagree on whether weev is the man responsible for the posting of her personal information. They disagree in how trolls should be handled.

There is no need of a posse. Nothing needs to be done about this specific event. No change needs to be made, and no larger story needs to be told.

That tech women have been the recipients of harassment is a larger story, and continues to be written. The never ending flow of naughty boys and girls who infest our online lives is another larger story, and I don't see an ending for this one. But the exchange between Kathy and Rob is not a chapter in either story. It's just two people who don't know each other and who profoundly disagree discovering no number of 140 character or less posts will change these circumstances.

If you respect and/or care for the individuals, you should support them whatever the cause of pain and discomfort, but that doesn't mean you have to find someone to hang over the nearest branch. Not every difficult event that happens to people we care about requires a posse.

Jeremy Keith (Adactio)Stupid brain

I went to the States to speak at the Artifact conference in Providence (which was great). I extended the trip so that I caould make it to Science Hack Day in San Francisco (which was also great). Then I made my way back with a stopover in New York for the fifth and final Brooklyn Beta (which was, you guessed it, great).

The last day of Brooklyn Beta was a big friendly affair with close to a thousand people descending on a hangar-like building in Brooklyn’s naval yard. But it was the preceding two days in the much cosier environs of The Invisible Dog that really captured the spirit of the event.

The talks were great—John Maeda! David Lowery!—but the real reason for going all the way to Brooklyn for this event was to hang out with the people. Old friends, new friends; just nice people all ‘round.

But it felt strange this year, and not just because it was the last time.

At the end of the second day, people were encouraged to spontaneously get up on stage, introduce themselves, and then introduce someone that they think is a great person, working on something interesting (that twist was Sam’s idea).

I didn’t get up on stage. The person I would’ve introduced wasn’t there. I wish she had been. Mind you, she would’ve absolutely hated being called out like that.

Chloe wasn’t there. Of course she wasn’t there. How could she be there?

But there was this stupid, stupid part of my brain that kept expecting to see her walk into the room. That stupid, stupid part of my brain that still expected that I’d spend Brooklyn Beta sitting next to Chloe because, after all, we always ended up sitting together.

(I think it must be the same stupid part of my brain that still expects to see her name pop up in my chat client every morning.)

By the time the third day rolled around in the bigger venue, I thought it wouldn’t be so bad, what with it not being in the same location. But that stupid, stupid part of my brain just wouldn’t give up. Every time I looked around the room and caught a glimpse of someone in the distance who had the same length hair as Chloe, or dressed like her, or just had a bag slung over hip just so …that stupid, stupid part of my brain would trigger a jolt of recognition, and then I’d have that horrible sinking feeling (literally, like something inside of me was sinking down) when the rational part of my brain corrected the stupid, stupid part.

I think that deep down, there’s a part of me—a stupid, stupid part of me—that still doesn’t quite believe that she’s gone.

Paul Downey (British Telecom)One CSV, thirty stories: 2. Counting things

This is day 2 of One CSV, 30 stories a series of articles exploring price paid data from the Land Registry found on GOV.UK. The code for this and the other articles is available as open source from GitHub

Statistics: The science of producing unreliable facts from reliable figures — Evan Esar

The file we made yesterday contains 19 million property transactions. Let’s use awk to find some basic information:

$ cut -f1 pp.tsv | awk 'NR == 1 {
    min = $1;
    max = $1;
}
{
    if ($1 < min) min = $1;
    if ($1 > max) max = $1;
    sum += $1;
    sumsq += $1 * $1
}
END {
    printf "count\t%d\n", NR
    printf "min\t%d\n", min
    printf "max\t%d\n", max
    printf "sum\t%d\n", sum
    printf "mean\t%f\n", sum/NR
    printf "sd\t%f\n", sqrt(sumsq/NR - (sum/NR)**2)
}' > stats.tsv

That gives us some basic statistics about the property prices contained within the file:

$ cat stats.tsv
count  19325571
min    5050
max    54959000
sum    3078384329109
mean   159290.730872
sd     180368.570700

Which tells us our file contains a record of more than £3 trillion transacted over the course of a number of years, but over how many years? We can find that out by chopping out the date column, removing the month and year and counting the uniquely sorted result:

$ cut -f2 < data/pp.tsv | sed 's/-.*$//' | sort | uniq | wc -l
20

The standard deviation makes me think the median price would be useful. We can use sort to find that, along with the count of records:

$ cut -f1 < pp.tsv | sort | sed -n $(expr $(grep count stats.tsv|cut -f2) / 2)p
265000

Judging from enterprisey emails in my inbox, some people are quite excited about Data Warehousing and cool things like Hadoop, but for this kind of experimental hackery Unix sort is great. I guess it’s had 40 odd years of computer scientists showing off by optimising the heck out of the open source code. There’s an idiom of sort which I use a lot to find the distribution of a data item, for example we can quickly find the busiest days using:

cut -f2 pp.tsv | sort | uniq -c | sort -rn | head

I say quickly, but even with the wonder of sort, counting the occurrences of every value in such a large dataset is a reasonably expensive operation and we’re sorting things a lot, so let’s create some index files as a one-off activity. I’m sure they’ll come in handy later:

while read column title
do
    cat data/pp.tsv |
        cut -d'	' -f$column |
        sort |
        uniq -c |
        sort -rn |
        sed -e 's/^ *//' -e 's/  */⋯/g' > $title.tsv
done <<-!
1   price
2   date
3   postcode
4   type
5   new
6   duration
7   paon
8   saon
9   street
10  locality
11  town
12  district
13  county
14  status
!

You might like to make some tea whist that happens. You could probably use your laptop to warm the pot. When it’s complete we have a file for each column, allowing us to find the busiest days:

$ head date.tsv 
26299 2001-06-29
26154 2002-06-28
26141 2002-05-31
25454 2003-11-28
24889 2007-06-29
24749 2000-06-30
24146 2006-06-30
24138 1999-05-28
23195 2000-03-31
22870 2003-12-19

and the most popular prices:

$ head price.tsv
208199 250000
185912 125000
163323 120000
159519 60000
147645 110000
145214 150000
140833 115000
134731 135000
131334 175000
131223 85000

So that’s the mode for each column, and a breakdown of categories such as the number of recorded transactions on new versus old builds:

$ cat new.tsv
17351417 N
1974154 Y

and the most active postcodes:

$ head postcode.tsv
29913⋯
280⋯TR8 4LX
274⋯CM21 9PF
266⋯B5 4TR
260⋯BS3 3NG
258⋯CM7 1WG
255⋯N7 6JT
253⋯HA1 2EX
248⋯W2 6HP
242⋯M3 5AS

Shame the most popular postcode is blank. That could be for legitimate reasons, after all not every parcel of land bought or sold has a postal address. We’ll get to that another day.

I’ve gone this far without looking into any particular record. That’s because the data contains addresses and it feels to a little strange to highlight what is after all probably someone’s home. Ideally I’d cite somewhere such as Admiralty Arch or the Land Registry Head Office but the dataset excludes quite a few transactions including those on commercial properties, and leaseholds. That’s definitely a thing I should talk to people about.

To be fair and reduce the risk of weirding someone out I need to pick on a property almost at random. I was quite interested in the maximum price paid for a property. Let’s look for that one:

$ grep "54959000"  data/pp.tsv
54959000    2012-03-26  SW10 9SU    S   N   F   20      THE BOLTONS     LONDON  KENSINGTON AND CHELSEA  GREATER LONDON  A

It’s a very large house in Chelsea! No surprise there then. If I was interested I could now go to the Land Registry view title service, pay £3 and find out who lives there, details of others with an interest in the property including any mortgage details, a title plan and possibly other restrictions such as any rights of way and if they have to pay to repair the local church spire. This is what Theodore Burton Fox Ruoff called looking behind the curtain.

Anyway, back to the spelunking. Something that is noticeable is how slow that grep command is operating on our large file:

$ time grep "54959000"  data/pp.tsv
...
real    0m46.950s
user    0m45.926s
sys 0m0.837s

Maybe we can speed that up a little using a regular expression:

$ time egrep "^54959000"  data/pp.tsv
...
real    0m30.036s
user    0m29.130s
sys 0m0.727s

Which is still quite slow. I hope I’ve shown how plugging around with the Unix pipeline can be quick and fun, but it’s easy to forget how quick a programming language can be, even one as simplistic as awk:

$ time awk '$1 == "54959000" { print }' <  data/pp.tsv 
...
real    0m11.475s
user    0m8.553s
sys 0m1.086s

With that my lap is suffering from local warming, it’s very late, and I’m quite tired so I think that’s probably enough command line bashing for now. Statistics are great, but it’s quite hard to grok numbers. Tomorrow we should probably draw some charts.

Norman Walsh (Sun)Tarot scoring

<article class="essay" id="content" lang="en">

Score keeping for the French card game Tarot is way too difficult, especially after a couple of cocktails. Here's my attempt to fix that problem.

Tarot is a French trick-taking card game. There are variations for three, four, or five players. Play requires a special deck of cards with 21 trumps, four suits of 14 cards, and a fool. The cards are an aesthetically pleasing shape and the decks I've seen have all been beautiful. You can find the rest of the rules on the Wikipedia French tarot page, of course.

It's a fun game, but the scoring is…complicated. Given, as I said, that we usually play after a couple of cocktails, often too complicated.

The answer, clearly, is a web app to do the scoring! Well, we all thought that was the clear answer, especially if I was tapped to write it.

So I have. It's hosted on tarotscore.nwalsh.com and GitHub.

The current state of play is that I have the scoring page for four players but not three or five. Also, it's just about as ugly as sin. They're called “pull requests”, folks.

I plan to implement it as an “offline” app at some point, but I haven't had the time yet.

Share and enjoy!

P.S. If you're struggling for cocktail recipes, I've got that covered to.

</article>

Footnotes

Updated: .  Michael(tm) Smith <mike@w3.org>