What is Spark?

Spark is a platform for running computations or tasks on large amounts of data (we are talking petabytes of data) – these tasks can range from map-reduce sorts of tasks to streaming and machine learning applications. The real power of Spark comes from the extensive APIs and supported languages (Python, Java, Scala, and R) that developers can use to manage and create data-based workflows, plus the fact that it supports integrations which pretty much any data store that you’d want to use.

Why is Spark used today? Companies like Databricks has poured tons of money into the technology, keeping the project alive and up to date. It is also extremely fast – at its core, Spark does a great job of distributing computations to multiple nodes, which in turn compute and cache whatever operation is being requested, using a driver to communicate requests and results. It is often said to be faster and easier to use than competitors who use Hadoop for big data management.

Image result for spark architecture

However, there are reportedly many issues with Spark as well. For instance, it can take lots of debugging to make sure configurations are right so that memory errors are no encountered, and not all of the languages that are supported are updated in unison. Overall, things seem to work well when going through tutorials and typical use cases, but large scale usage can be finicky.

GAN – Generative Adversarial Networks

TLDR; GANs are cool! If you have a problem where you need to use training data to generate new data, as well as verify if some input is similar to the training data, they may prove useful.

GANs are a type of machine learning technique that uses two neural networks to generate and verify data. These two networks are:

  1. A generative neural network uses random data to generate some new piece of data
  2. A discriminative network uses that generated data (and real data from training set) to determine if the generated data is likely to be from the training set.

By iterating through a process of generating data and verifying data, over time we get two networks that do useful things – one is trained to mimic the training set data, and the other can be used to classify data as being “real” or “fake”. These models have a tremendous number of uses in the wild. In the case of the classic task of digit recognition, one network is trained to generate digits, and the other is trained to classify whether some image is a digit (possibly of a specific digit as well).

GANs

What’s interesting is that this technique can be applied to so many different neural network architectures, since it is pretty much a wrapper around two separate networks. For instance, we might use CNNs to generate and discriminate against faces, or LSTMs to generate and discriminate against human text.

What are some regulatory implications of this kind of work? GANs have been used to generate fake faces and human text, for instance, which may be wielded by harmful bots or scammers.

Want to dive deeper? Check out some example code and more details in this post.

Quantum Computing – Hype vs. Reality

TLDR; Be pessimistic, and let reputable sources such as Scott Aaronson’s blog inform you about the real progress being made.

With recent developments in quantum computing, as well as a surge of media around the progress and use of this technology, it can be difficult to determine exactly what is hype and what is reality. Most of the discussion is around the idea of quantum supremacy – the point at which we can perform a specific type of computation on a quantum computer much faster than on a classical computer.

What makes this discussion difficult? For one, the choice of problem / computation to evaluate on a given quantum computer versus a classical one is entirely up to the researcher. In other words, the problem doesn’t need to be a “useful” one – any problem that exhibits that the QC can perform better than the classical computer will do.

Second, the evaluation is very hardware dependent. As Aaronson summarizes in his blog post, a group that aims to show quantum supremacy can say “oh look, we computed this result on our QC which would take a million years on a large classical computer.” However, by creating better classical algorithms, or using larger machines, another group can say “your QC on this problem is not that great.”

Why is there this discussion in the first place, isn’t it obvious that quantum computers are better than classical computers? Not quite – at the time of writing, it has not been proven (in a theoretical sense) that QCs are actually any better than classical computers. In fact, a common theme we see is that breakthroughs in QC algorithms lead to improvements in classical algorithms, which sometimes even beat their quantum counterparts.

If I were you, I would remain pessimistic when seeing titles such as “Quantum Computer Solves Impossible Problem.” Just stick with the facts – yes, quantum computers exist, and yes there are some quantum algorithms that are better than their known classical counterparts, but these QC companies will claim tons of progress, and it’s best to see what the experts say.

P.S. I would definitely take a look at the Quantum Bullshit Detector – I have no idea who runs this account, but their classifications seem very plausible.

Code Reusability – ViewModels and RxJava

TLDR; Think about how the code you are writing today can be reused. Will this exact code being used in another project or module, or is there a boilerplate pattern? If so, write some helper functions and classes.

Code reusability and organization is an obvious concern to always think about, so I won’t go into the details here. However, it’s not always clear when it is the right time to abstract away code that may be used multiple times. Should we wait until the code is about to be used multiple times? Or should the code be abstracted and organized as soon as we write its original usage? How do we even know if what we are writing can be expressed in a re-usable way?

In order to avoid technical debt, I always like to consider the following questions whenever I am about to write a piece of code:

  • Am I going to use this exact piece of code multiple times through this project, or across multiple projects?
  • Even though the code may be entirely specific to a piece of your application, does it have a lot of boilerplate code within?

If either of these bullets are true, I immediately think about how to write this code in a reusable way.

A great example of this comes from my good friend Elijah Stiles regarding Android ViewModels and Reactive Streams. When using an Android ViewModels with RxJava, you know that each ViewModel is very specific to the business logic for part of your application, but there is still a lot of reusable boilerplate concepts that can be abstracted away – disposing of Observables, setting threads to subscribe on, registering on a specific Android lifecycle, etc… By building a few helper methods within a base class which all ViewModels can extend, not only do you organize your code, but you reduce a ton of mental load when writing code later in your application. In the example below, we never really have to worry about disposing of our reactive streams – we simply wrap our streams with disposeOnCleared.

import androidx.lifecycle.ViewModel
import io.reactivex.disposables.CompositeDisposable
import io.reactivex.disposables.Disposable

/** Base view model class used for making common functionality easier. */
abstract class BaseViewModel : ViewModel() {
    private val compositeDisposable = CompositeDisposable()

    override fun onCleared() {
        compositeDisposable.dispose()
    }

    /** Disposes of the given disposable when the view model is cleared. */
    fun disposeOnCleared(disposable: Disposable?) {
        disposable ?: return
        compositeDisposable.add(disposable)
    }
}

We no longer need to maintain pointers to our disposables within our ViewModels – the CompositeDisposable handles everything for us, giving us the freedom to write the logic that matters.

Getting Started with NLP – spaCy

TLDR; Think about how you might create value add by understanding natural human text – try spaCy to hit the ground running with a few simple experiments.

spaCy is an easy to use natural language understanding platform, providing both out-of-box capabilities and extensibility features. Whether you are building a consumer mobile app or enterprise software, spaCy may help with the following capabilities:

I’ve been using spaCy for some time now, and I’ve found that the time to go from idea to experiment is extremely fast. Let’s say you have a mobile app where users upload text, and you simply want to gather some metadata or summaries of the text as a small feature add to your product – you can probably play around with a few of spaCy’s features and get a small prototype ready within a day.

If you start to find success with this framework, I highly recommend checking out spaCy Universe, a set of community-made plugins for spaCy, ranging from techniques like entity coreference to extensions for chat bots.