[OPINION] How AI will change Information Security


AI is become more more prevalent in basically every single research area; that is to my mind undeniable. I remember when using neural nets use to be experimental (or hip and cool), now you can download a python package that handles building and training them for you! So there is definitely a significant up trend in the prevalence of AI and machine learning based technology in research.  I would need to be a special kind of moron to not guess that this will also spill over into information security.  The question is how will this affect us infosec people?




Infosec folks like tools! AI makes tools easier to build! 

Given the history of how things developed in Infosec we can definitely say that our folks love to use tools! This was once touted as impending doom for the hacker culture and that people who don't know anything about actually breaking into stuff will walk around calling themselves hackers. Well you have to admit those extremely paranoid purists were absolutely correct (sort of)! There are people walking around abusing tools calling themselves "hackers".  But not everyone that uses these tools are such fool hardy neophyte idiots; pro's use them too! Pro's use them because they can make simple actions like port scanning, fuzzing such things during a Pentest rapidly faster, more autonomous and applicable to larger scopes.

So there we will definitely use AI to make InfoSec work better. And in fact we already are. The interesting question is what could the near future look like if it follows from this trend? What astounding tech do we have to look forward to in information security if we eventually through ourselves over completely to building with AI?

What will offensive AI InfoSec tool look like?

Well one thing we have seen rapidly spur on AI application is something called deep learning. As far as I understand this innovation it is basically a way to let the AI decide on the most valuable features to train on. The problem with AI before deep learning is that we were the ones trying to let AI select the best features for itself, this was a necessary effort but one that limited the rapid application of AI because before you could build a working system you need someone very clever to study the data and produce a neural net model (design), training regiment, data samples and a feature selection in order for the AI to be reproducible and verifiable. Now, we've designed a system that (as far as I understand it) rapidly selects random features, competes to see which features produce the best match to the data and throws away those that don't until it eventually produces the best set of features that for itself allows it to mimic the data most accurately.

The sequencing conjecture


The other clearly massive advantage of deep learning structures especially those like LSTMs means that we can very easily and rapidly produce sequence prediction technology. This means ANYTHING that is a sequence or can be expressed as a sequence can be "learned" or fitted by an LSTM immediately without any need to engineer much feature selection. The biggest challenges you face in designing such a system is deciding how to represent the data in a sequence (so it interfaces cleanly with the LSTM) and how your LSTM should be designed!

Given these observations we can then turn LSTMs to everything in InfoSec that can be solved with the "Sequencing conjecture"  (just to make that idea a label for reference in this article) here's what I imagine people will come up with:

Attack vector prediction (given a collection of attributes of a network, produce a set of attack strategies as learned from real penetration testers) you can have AI test your programs just like an actual pentester would! Just like millions of pentesters would, as they would over possibly 1000s of years of testing (based on the current testing trends)

Password Change/Choice mimicry (given a collection of password change sequences produce a statistical prediction of what the next password change will be and most likely WHEN it will be) You can also apply this to the amount of entropy in password changes, if the entropy dwindles you can predict when the next entropy drop will be and guess using less resources as an attacker. The other amazing application is you can have an active password guesser gauging how your organization changes and chooses passwords and model your "password security" on how hard it is for the AI to eventually guess it based on ALL the available data about peoples password choices.

Automated Phishing (an intelligent phishing firewall): It is easy to model emotional and impulsive triggers and the emails and discussion topics that produce them as a sequence! So you could build a "bot" essentially that tries to constantly catch people in your own organization during a simulated phishing attack so that they know to avoid replying to emails of a certain kind in case they may be an actual attack (kind of like a live constant phishing fire drill). Doing this means your attackers are actually competing with your artificial intelligence "phishing" bot and not only trying to catch the people in your org. They need to beat the bot now, AND catch the people, which definitely makes phishing attacks harder and more complex to pull off.

Vulnerability prediction (model the code paths, commit logs and other code orientated behavior as predictors for vulnerabilities) there must be pieces of code that can be reliably predicted as eventually causing vulnerabilities based on the vulnerable code that has a commit log, or code change "signature"

These are a mere few of the very very obvious ideas for applications but I think the effect AI will have on the technical information security industry will spread much further and deeper than this.

Start buying linear algebra text books now!



If we are going to build tools based on these curve fitting algorithms we will most definitely also build defenses (i.e. it would be foolish to imagine this will only be useful to offensive infoseccers). It would also be foolish to propose that the software we test will also be devoid of these curve fitters, that we the ones who determine and engineer the security of such systems will never come to tassle with AI in our targets.

So what does that mean for penetration testers of today? It means we will definitely either need to become AI people or our penetration tester teams will strictly require an AI person! I say this because to be able to deduce about the behavior of a system you are required to first be able to detail it! You cannot DEDUCT from a point of doubt, that is called INDUCTION ( opposite approach to deduction, IN-duciton). So that means any security proof that involves an AI system must make arguments and measurements of how the neural net fits and the mathematical tenets that promise its eventual convergence to a solution and how this will affect the systems security standing.

Considering the opposite to this prediction hilarious; for instance can you imagine us testing AI drive systems, talk about how they work and not being able to discuss the inference? Its a ridiculous concept because ANY business that can afford it would certainly prefer having their system analyzed by those who understand AI as well as infosec!

For cryptographers it also means the idea of modeling an attacker during a security game or game based proof for a cipher or crypto-system MUST become more interesting, it MUST affect all of the other crypto we were doing since this crypto wasn't modeled on such a capable attacker! Will we start to see AI safe crypto? hehe but this is perhaps a little out of my expertise ;) 


It possibly also means very interesting things for vulnerability dichotomy or software vulnerability anthropology lol whatever you call the study of vulnerability types. Becuase we know that there is already a purely AI based attack vector, one that attacks the very heart of how AI works; inference attacks! We've seen exmaples of this at black hat / defcon and possibly (not that I've checked) USENIX already! So right now it is possible to train a neural net on the error bounds of another neural net in order to exploit the false positivies and negatives harboured in the prediction fittings.

Here's a quick layman break down of the idea behind these vulnerabilities:

AI is basically just curve fitting, you take a set of data points and try to draw up a function that mimics a function (should such a function possibly exist) that will always produce an output that matches this function as closely as possible should it exist. This is an age old practice in physics, statistics and other scientific fields. The equations of Maxwell, Faraday, Newton, Plank and literally every other physicist were drawn up for exactly this purpose, to predict the behavior of nature consistently and as accurately as possible from a limited set of data points! This is the crux of what AI allows us to do, the only advantage is that it allow us to do so autonomously and much more rapidly than ever before.

Because these "fits of curves" will never be exact in fact some of them can be proven to never be exact mathematically (much as you can prove no SBOX in a cipher can never be absolutely unflawed) this means there will be an error in the accuracy of the neural net, when you input some points into the function that was fitted, it will definitely always have points that produce inaccurate behavior. There is a preverbal "distance" from the AI fit and the real behavior of the data.   This "difference" is what is exploited by the "inference attacks", they look for distances from real data that are so large you can have one classification or prediction complete mess up. To determine these points you need only to "fuzz" the neural net and detect such points. 


Now because this exists there will likely be other kinds of inference attacks that emerge soon, here are some verifiable hypothesis that could give rise to other such attacks against AI:

* Data signature detection: AI prediction is based solely on the data you train it on, what if this data is sensitive? What if you can tell based on a number of perturbations that the data used to train this AI exhibits a definite behavior that exposes critical details of the data; what if this data is sensitive?

* Data inference poisoning attacks: Because certain AI systems will employed in training mode i.e. applied to problems in the real world in order to be constantly train and adapt to the natural data. It means attackers could possible attack it in this state or target this functionality as part of the attack surface! We which means if the AI is in learning mode, there must be way to negatively influence how it trains, since you are the one capable of influencing the training data as an attacker.

Anyhow, this is what I see happening. Hopefully it will allow orgs to respond with more rationality to the onset of AI driven tech and help prep infosec peeps on how to stay ahead of the software curve and catch or prevent problems before they really start making life hard!