Tuesday, January 17, 2017

GitHub Growth Appears Scale Free

In 2013, a blogger claimed that the growth of GitHub (GH) users follows a certain type of diffusion model called Bass diffusion. Here, growth refers to the number of unique user IDs as a function of time, not the number project repositories, which can have a high degree of multiplicity.

In a response, I tweeted a plot that suggested GH growth might be following a power law, aka scale free growth. The tell-tale sign is the asymptotic linearity of the growth data on double-log axes, which the original blog post did not discuss. The periods on the x-axis correspond to years, with the first period representing calendar year 2008 and the fifth period being the year 2012.

Scale free networks can arise from preferential attachment to super-nodes that have a higher vertex degree and therefore more connections to other nodes, i.e., a kind of rich-get-richer effect. Similarly for GH growth viewed as a particular kind of social network. The interaction between software developers using GH can be thought of as involving super-nodes that correspond to influential users attracting prospective GH users to open a new account and contribute to their project.

On this basis, I predicted GH would reach 4 million users during October 2013 and 5 million users during April 2014 (yellow points in the Linear axes plot below). In fact, GH reached those values slightly earlier than predicted by the power law model, and slightly later than the dates predicted by the diffusion model (modulo unreported errors in the data).

Since 2013, new data has been reported so, I extended my previous analysis. Details of the respective models are contained in the R script at the end of this post. In the Linear axes plot below, both the diffusion model and power model essentially form an envelope around the newer data: diffusive on the upper side (red curve) and power law on the lower side (blue curve). In thise sense, it could be argued that the jury is still out on which model offers the more reliable predictions.

However, there is an aspect of the diffusion model that was overlooked in 2013. It predicts that GH growth will eventually plateau at 20 million users in 2020 (the 12th period, not shown) because it is a type of logistic function that has a characteristic sigmoidal or 'S' shape. The beginnings of this leveling off (the top of the 'S') is apparent in the 10th period (i.e., 2017). By contrast, the power law model predicts that GH will reach 23.65 million users by the end of 2017 (yellow point). Whereas the two curves envelope the more recent data in periods 6–9, they start to diverge significantly in the 10th period.

"GitHub is not the only player in the market. Other companies like GitLab are doing a good job but GitHub has a huge head start and the advantage of the network effect around public repositories. Although GitHub’s network effect is weaker compared to the likes of Facebook/Twitter or Lyft/Uber, they are the default choice right now."  —GitHub is Doing Much Better Than Bloomberg Thinks
Although there will inevitably be an equilibrium bound on the number of active GH users, it seems unlikely to be as small as 20 million, given the combination of GH's first-mover advantage and its current popularity. Presumably the private investors in GH also hope it will be a large number. This year will tell.

# Data source ... https://classic.scraperwiki.com/scrapers/github_users_each_year/

#LINEAR axes plot
plot(df.gh3$index, df.gh3$users, xlab="Period (years)", 
     ylab="Users (million)", col="gray", 
     ylim=c(0, 3e7), xaxt="n", yaxt="n")
axis(side=1, tck=1, at=c(0, seq(12,120,12)), labels=0:10, 
     col.ticks="lightgray", lty="dotted")
axis(side=2, tck=1, at=c(0, 10e6, 20e6, 30e6), labels=c(0,10,20,30), 
     col.ticks="lightgray", lty="dotted")

# Simple exp model
curve(coef(gh.exp)[2] * exp(coef(gh.exp)[1] * (x/13)), 
      from=1, to=108, add=TRUE, col="red2", lty="dot dash")

# Super-exp model 
curve(49100 * (x/13) * exp(0.54 * (x/13)), 
      from=1, to=120, add=TRUE, col="red", lty="dashed")

# Bass diffusion model
curve(21e6 * ( 1 - exp(-(0.003 + 0.83) * (x/13)) ) / ( 1 + (0.83 / 0.003) * exp(-(0.003 + 0.83) * (x/13)) ), 
      from=1, to=120, add=TRUE, col="red")

# Power law model
curve(10^coef(gh.fit)[2] * (x/13)^coef(gh.fit)[1], from=1, to=120, add=TRUE, 

title(main="Linear axes: GitHub Growth 2008-2017")
  legend=c("Original data", "New data",  "Predictions", "Exponentital", "Super exp", "Bass diffusion", "Scale free"), 
       lty=c(NA,NA,NA,4,2,1,1), pch=c(1,19,21,NA,NA,NA,NA), 
  col=c("gray", "black", "yellow", "red", "red", "red", "blue"), 
  pt.bg = c(NA,NA,"yellow",NA,NA,NA,NA),
  cex=0.75, inset=0.05)

Saturday, October 8, 2016

Crib Sheet for Emulating Web Traffic

Our paper entitled, How to Emulate Web Traffic Using Standard Load Testing Tools is now available online and will be presented at the upcoming CMG conference in November.

Presenter: James Brady
Session Number: 436
Subject Area: APM
Session Date: WED, November 9, 2016
Session Time: 1:00 PM - 2:00 PM
Session Room: PortofinoB

Saturday, October 1, 2016

A Clue for Remembering Little's Law

During the Guerrilla Data Analysis class last week, alumnus Jeff P. came up with this novel mnemonic device for remembering all three forms of Little's law for a queue with mean arrival rate $\lambda$.

The right-hand side of each equation representing a version of Little's law is written vertically in the order $R, W, S$, which matches the expression $R=W+S$ for the mean residence time, viz., the sum of the mean waiting time ($W$) and the mean service time ($S$).

The letters on the left-hand side: $Q, L, U$ (reading vertically) respectively correspond to the queueing metrics: queue-length, waiting-line length, and utilization, which can be read as the word clue.

Incidentally, the middle formula is the version that appears in the title of John Little's original paper:

J. D. C. Little, ``A Proof for the Queuing Formula: L = λ W,''
Operations Research, 9 (3): 383—387 (1961)

Wednesday, August 3, 2016

PDQ as a Performance Periscope

This is a guest post by performance modeling enthusiast, Mohit Chawla, who brought the following very interesting example to my attention. In contrast to many of the examples in my Perl::PDQ book, these performance data come from a production web server, not a test rig. —NJG

Performance analysts usually build performance models based on their understanding of the software application's behavior. However, this post describes how a PDQ model acted like a periscope and also turned out to be pedagogical by uncovering otherwise hidden details about the application's inner workings and performance characteristics.

Some Background

Thursday, July 28, 2016

Erlang Redux Resolved! (This time for real)

As I show in my Perl::PDQ book, the residence time at an M/M/1 queue is trivial to derive and (unlike most queueing theory texts) does not require any probability theory arguments. Great for Guerrillas! However, by simply adding another server (i.e., M/M/2), that same Guerrilla approach falls apart. This situation has always bothered me profoundly and on several occasions I thought I saw how to get to the exact formula—the Erlang C formula—Guerrilla style. But, on later review, I always found something wrong.

Although I've certainly had correct pieces of the puzzle, at various times, I could never get everything to fit in a completely consistent way. No matter how creative I got, I always found a fly in the ointment. The best I had been able to come up with is what I call the "morphing model" approximation where you start out with $m$ parallel queues at low loads and it morphs into a single $m$-times faster M/M/1 queue at high loads.

That model is also exact for $m = 2$ servers—which is some kind of progress, but not much. Consequently, despite a few misplaced enthusiastic announcements in the past, I've never been able to publish the fully corrected morphing model.

Previous misfires have included:

  • Falsely claimed in this 2008 blog post. There, I show a Table of how complicated the Erlang B and C functions are when expressed as rational functions of $\rho$ (the per-server utilization) for $m = 1, 2, \ldots, 6$ service facilities.
  • As a side note, it's precisely those impenetrable polynomials with unfathomable coefficients that completely put me off the approach I am about to describe here.
  • In the Comments section of that same post, Boris Solovyov asked in 2013: "Did you ever write this out in full?" I sheepishly had to report that I hadn't had time to pursue it (which is mostly true). I hope he reads this post.
  • A more intuitive explanation of the motivation for the morphing model was given in this 2011 blog post, but no advance on the long sought-after correction terms.
  • Falsely claimed again in this 2012 blog post. There's a photo of one of my whiteboards, deliberately kept inscrutably small—a good choice, as it turned out.

I think I know how Kepler must've felt. The difference between an M/M/1 queue and an M/M/m queue is like the difference between a circle and an ellipse. Just by changing the width slightly, the calculation of the perimeter becomes enormously complicated—in fact, it doesn't even have an analytic solution! Now, however, I believe I finally have the correct approach for sure. Really!... Yep... Uh huh... No, seriously. Well, see for yourself

The Starting Point

Start with the morphing approximation for the M/M/m waiting time in my Perl::PDQ book—Chap. 2 in the original 2004 edition or Chap. 4 in the 2nd 2011 edition. \begin{equation} \Phi_W^{approx} (m, \rho) \, = \, \frac{1}{\rho^{-m} \, \sum_{k=0}^{m-1} \rho^k} \label{eqn:morphfun} \end{equation} This definition has a denominator involving a truncated geometric series in $\rho$. It is used to derive the Guerrilla approximation for the residence time at an M/M/m queue with mean service time $S$ \begin{equation*} R^{approx} \, = \, \frac{S}{1 - \rho^m} \end{equation*} which is exact for $m = 1, 2$. The exact general formula for the residence time is given by \begin{equation*} R^{exact} \, = \, S \; + \; \frac{C(m, \rho)}{m( 1 - \rho)} \, S \end{equation*} where $C(m, \rho)$ is the Erlang C function defined in (\ref{sec:realEC}).

Five Easy Pieces

The following 5 steps provide the corrections to the approximation in (\ref{eqn:morphfun}) and result in the exact Erlang C function without resorting to the usual probability arguments found in almost all queueing theory textbooks. Along the way, and quite unexpectedly (for me), I derive the Erlang B function as an intermediate result. Originally, I thought I would have to introduce that function as a kind of axiomatic probability. I wasn't thinking of it as a major waypoint. As far as I'm aware, this derivation has never been presented before because you have to be sufficiently perverse to have thought up the morphing approximation in the first place.
  1. Replace $\rho$ in (\ref{eqn:morphfun}) by the traffic intensity $a = m \rho$ \begin{equation} \Phi_W^{approx} (m, a) \, = \, \frac{1}{a^{-m} \, \sum_{k=0}^{m-1} ~a^k} \label{eqn:morpha} \end{equation}
  2. Convert $\Phi_W^{approx}$ to a truncated exponential series by applying $a^n \mapsto a^n / n!$ \begin{equation} \frac{1}{m! a^{-m} \, \sum_{k=0}^{m-1}~\frac{a^k}{k!}} \end{equation}
  3. Extend the summation over all $m$ servers to yield the famous Erlang B function for an M/M/m/m queue \begin{equation} B(m, a) \, = \, \frac{\frac{a^m}{m!}}{\sum_{k=0}^{m}~\frac{a^k}{k!}} \label{eqn:EB} \end{equation} Historically, $B(m, a)$ has been associated with the probability that an incoming telephone call is blocked and lost from the system, e.g., gets an engaged signal. I can't remember the last time I heard an engaged signal. I think they've been replaced by voice-mail and Muzak.
  4. Using the more compact notation: $A_m = a^m / m!$ and $\Sigma_k$ for the reduced sum over $(m - 1)$ terms, rewrite (\ref{eqn:EB}) as \begin{equation} B(m, a) \, = \, \frac{A_m}{A_m + \Sigma_k} \end{equation}
  5. Scale $A_m$ by $(1 - \rho)^{-1}$ to introduce the infinite possible wait states and arrive at \begin{equation} \Phi_W^{exact} (m, a) \, = \, \frac{1}{1 \, + \, (1 - \rho) \, m! \, a^{-m} \, \sum_{k=0}^{m-1}~\frac{a^k}{k!}} \label{eqn:myEC} \end{equation} which is the fully corrected version of (\ref{eqn:morphfun}).      Q.E.D.
Furthermore, it can be shown that (\ref{eqn:myEC}) is identical to the famous Erlang C function \begin{equation} C(m, a) \, = \, \frac{ \frac{a^m}{m!} \, \big(\frac{m}{m-a}\big) }{1 \, + \, \frac{a}{1!} \, + \, \frac{a^2}{2!} \, + \, \cdots \, + \, \frac{a^{m-1}}{(m-1)!} \, + \, \frac{a^m}{m!} \, \big(\frac{m}{m-a}\big) } \label{sec:realEC} \end{equation} Historically, $C(m, a)$ has been associated with the probability that an incoming telephone call must wait to get a connection (aka "call waiting"), rather than being dropped, as it is in B(m, a). However, since $C(m, a)$ determines the mean waiting time in any M/M/m queue, it applies to any multi-server system, e.g., modern multi-threaded applications.

Equation (\ref{sec:realEC}) was first published by A. K. Erlang in 1917 (he of the Copenhagen Telephone Company), along with (\ref{eqn:EB}). Thus, next year will be its centennial. Nice timing on my part (although that was never the plan—there never being any plan).

It's worth emphasizing that the morphing approximation (\ref{eqn:morphfun}) accounts for about 90% of what is going on with $R^{exact}$ in the M/M/m queue. The remaining 10% contains the minutiae regarding how the waiting line actually forms. But, as you can see from the above transformations, it's a rather subtle 10%.

Next Steps

This is just a sketch proof. I've suppressed a lot of details because there are many and I have a 100 pages of typeset notes to back that up. I know a lot about what doesn't work. (Sigh!) Now that I have the correct mathematical logic sketched out, I'm quite confident that it can also be supplemented with a more visual representation of how the corrections to the morphing function (\ref{eqn:morphfun}) arise.

Wednesday, June 8, 2016

2016 Guerrilla Training Schedule 2016

After a six month hiatus working on a major consulting gig, Guerrilla training classes are back in business with the three classic courses: Guerrilla Bootcamp (GBOOT), Guerrilla Capacity Planning (GCAP) and Guerrilla Data Analysis Techniques (GDAT).

See what graduates are saying about these courses.

Some course highlights:

  • There are only 3 performance metrics you need to know
  • How to quantify scalability with the Universal Scalability Law
  • Hadoop performance and capacity management
  • Virtualization Spectrum from hyper-threads to cloud services
  • How to detect bad data
  • Statistical forecasting techniques
  • Machine learning algorithms applied to performance data

Register online. Early-bird discounts run through the end of July.

As usual, Sheraton Four Points has bedrooms available at the Performance Dynamics discounted rate.

Tell a friend and see you in September!

Saturday, May 14, 2016

PDQ 7.0 Dev is Underway

The primary goal for this release is to make PDQ acceptable for uploading to CRAN. This is a non-trivial exercise because there is some legacy C code in the PDQ library that needs to be reorganized while, at the same time, keeping it consistent for programmatically porting to other languages besides R—chiefly Perl (for the book) and Python.

To get there, the following steps have been identified:

  1. High Priority

    1. Migrate from SourceForge to GitHub.
    2. Change the return type for these functions from int to void:
      • PDQ_CreateOpen()
      • PDQ_CreateClosed()
      • PDQ_CreateNode()
      • PDQ_CreateMultiNode()
      Using the returned int as a counter was deprecated in version 6.1.1.
    3. Convert PDQ-R to Rcpp interface.
    4. Clean out the Examples directory and other contributed code directories leaving only Examples that actually use the PDQ C library.
    5. Add unit tests for PDQ C library, as well as the Perl, Python, and R languages.
    6. Get interface accepted on CRAN
    7. Add the ability to solve multi-server queueing nodes servicing an arbitrary number of workloads.

  2. Low Priority

    1. Get interface accepted on CPAN and PyPI.
    2. Convert to build system from makefiles to Cmake.

Stay tuned!

—njg and pjp