Showing posts with label machine learning. Show all posts
Showing posts with label machine learning. Show all posts

Thursday, 15 November 2018

Combating Potentially Harmful Applications with Machine Learning at Google: Datasets and Models




Posted by Mo Yu, Damien Octeau, and Chuangang Ren, Android Security & Privacy Team




In a previous blog post, we talked about using machine learning to combat Potentially Harmful Applications (PHAs). This blog post covers how Google uses machine learning techniques to detect and classify PHAs. We'll discuss the challenges in the PHA detection space, including the scale of data, the correct identification of PHA behaviors, and the evolution of PHA families. Next, we will introduce two of the datasets that make the training and implementation of machine learning models possible, such as app analysis data and Google Play data. Finally, we will present some of the approaches we use, including logistic regression and deep neural networks.


Using machine learning to scale





Detecting PHAs is challenging and requires a lot of resources. Our security experts need to understand how apps interact with the system and the user, analyze complex signals to find PHA behavior, and evolve their tactics to stay ahead of PHA authors. Every day, Google Play Protect (GPP) analyzes over half a million apps, which makes a lot of new data for our security experts to process.



Leveraging machine learning helps us detect PHAs faster and at a larger scale. We can detect more PHAs just by adding additional computing resources. In many cases, machine learning can find PHA signals in the training data without human intervention. Sometimes, those signals are different than signals found by security experts. Machine learning can take better advantage of this data, and discover hidden relationships between signals more effectively.



There are two major parts of Google Play Protect's machine learning protections: the data and the machine learning models.


Data sources





The quality and quantity of the data used to create a model are crucial to the success of the system. For the purpose of PHA detection and classification, our system mainly uses two anonymous data sources: data from analyzing apps and data from how users experience apps.


App data





Google Play Protect analyzes every app that it can find on the internet. We created a dataset by decomposing each app's APK and extracting PHA signals with deep analysis. We execute various processes on each app to find particular features and behaviors that are relevant to the PHA categories in scope (for example, SMS fraud, phishing, privilege escalation). Static analysis examines the different resources inside an APK file while dynamic analysis checks the behavior of the app when it's actually running. These two approaches complement each other. For example, dynamic analysis requires the execution of the app regardless of how obfuscated its code is (obfuscation hinders static analysis), and static analysis can help detect cloaking attempts in the code that may in practice bypass dynamic analysis-based detection. In the end, this analysis produces information about the app's characteristics, which serve as a fundamental data source for machine learning algorithms.


Google Play data





In addition to analyzing each app, we also try to understand how users perceive that app. User feedback (such as the number of installs, uninstalls, user ratings, and comments) collected from Google Play can help us identify problematic apps. Similarly, information about the developer (such as the certificates they use and their history of published apps) contribute valuable knowledge that can be used to identify PHAs. All these metrics are generated when developers submit a new app (or new version of an app) and by millions of Google Play users every day. This information helps us to understand the quality, behavior, and purpose of an app so that we can identify new PHA behaviors or identify similar apps.



In general, our data sources yield raw signals, which then need to be transformed into machine learning features for use by our algorithms. Some signals, such as the permissions that an app requests, have a clear semantic meaning and can be directly used. In other cases, we need to engineer our data to make new, more powerful features. For example, we can aggregate the ratings of all apps that a particular developer owns, so we can calculate a rating per developer and use it to validate future apps. We also employ several techniques to focus in on interesting data.To create compact representations for sparse data, we use embedding. To help streamline the data to make it more useful to models, we use feature selection. Depending on the target, feature selection helps us keep the most relevant signals and remove irrelevant ones.



By combining our different datasets and investing in feature engineering and feature selection, we improve the quality of the data that can be fed to various types of machine learning models.


Models




Building a good machine learning model is like building a skyscraper: quality materials are important, but a great design is also essential. Like the materials in a skyscraper, good datasets and features are important to machine learning, but a great algorithm is essential to identify PHA behaviors effectively and efficiently.



We train models to identify PHAs that belong to a specific category, such as SMS-fraud or phishing. Such categories are quite broad and contain a large number of samples given the number of PHA families that fit the definition. Alternatively, we also have models focusing on a much smaller scale, such as a family, which is composed of a group of apps that are part of the same PHA campaign and that share similar source code and behaviors. On the one hand, having a single model to tackle an entire PHA category may be attractive in terms of simplicity but precision may be an issue as the model will have to generalize the behaviors of a large number of PHAs believed to have something in common. On the other hand, developing multiple PHA models may require additional engineering efforts, but may result in better precision at the cost of reduced scope.



We use a variety of modeling techniques to modify our machine learning approach, including supervised and unsupervised ones.



One supervised technique we use is logistic regression, which has been widely adopted in the industry. These models have a simple structure and can be trained quickly. Logistic regression models can be analyzed to understand the importance of the different PHA and app features they are built with, allowing us to improve our feature engineering process. After a few cycles of training, evaluation, and improvement, we can launch the best models in production and monitor their performance.



For more complex cases, we employ deep learning. Compared to logistic regression, deep learning is good at capturing complicated interactions between different features and extracting hidden patterns. The millions of apps in Google Play provide a rich dataset, which is advantageous to deep learning.



In addition to our targeted feature engineering efforts, we experiment with many aspects of deep neural networks. For example, a deep neural network can have multiple layers and each layer has several neurons to process signals. We can experiment with the number of layers and neurons per layer to change model behaviors.



We also adopt unsupervised machine learning methods. Many PHAs use similar abuse techniques and tricks, so they look almost identical to each other. An unsupervised approach helps define clusters of apps that look or behave similarly, which allows us to mitigate and identify PHAs more effectively. We can automate the process of categorizing that type of app if we are confident in the model or can request help from a human expert to validate what the model found.



PHAs are constantly evolving, so our models need constant updating and monitoring. In production, models are fed with data from recent apps, which help them stay relevant. However, new abuse techniques and behaviors need to be continuously detected and fed into our machine learning models to be able to catch new PHAs and stay on top of recent trends. This is a continuous cycle of model creation and updating that also requires tuning to ensure that the precision and coverage of the system as a whole matches our detection goals.


Looking forward




As part of Google's AI-first strategy, our work leverages many machine learning resources across the company, such as tools and infrastructures developed by Google Brain and Google Research. In 2017, our machine learning models successfully detected 60.3% of PHAs identified by Google Play Protect, covering over 2 billion Android devices. We continue to research and invest in machine learning to scale and simplify the detection of PHAs in the Android ecosystem.




Acknowledgements



This work was developed in joint collaboration with Google Play Protect, Safe Browsing and Play Abuse teams with contributions from Andrew Ahn, Hrishikesh Aradhye, Daniel Bali, Hongji Bao, Yajie Hu, Arthur Kaiser, Elena Kovakina, Salvador Mandujano, Melinda Miller, Rahul Mishra, Sebastian Porst, Monirul Sharif, Sri Somanchi, Sai Deep Tetali, and Zhikun Wang.

Thursday, 24 May 2018

Keeping 2 billion Android devices safe with machine learning





Posted by Sai Deep Tetali, Software Engineer, Google Play Protect



At Google I/O 2017, we introduced Google Play Protect, our comprehensive set of security services for Android. While the name is new, the smarts powering Play Protect have protected Android users for years.



Google Play Protect's suite of mobile threat protections are built into more than 2 billion Android devices, automatically taking action in the background. We're constantly updating these protections so you don't have to think about security: it just happens. Our protections have been made even smarter by adding machine learning elements to Google Play Protect.


Security at scale





Google Play Protect provides in-the-moment protection from potentially harmful apps (PHAs), but Google's protections start earlier.



Before they're published in Google Play, all apps are rigorously analyzed by our security systems and Android security experts. Thanks to this process, Android devices that only download apps from Google Play are 9 times less likely to get a PHA than devices that download apps from other sources.



After you install an app, Google Play Protect continues its quest to keep your device safe by regularly scanning your device to make sure all apps are behaving properly. If it finds an app that is misbehaving, Google Play Protect either notifies you, or simply removes the harmful app to keep your device safe.



Our systems scan over 50 billion apps every day. To keep on the cutting edge of security, we look for new risks in a variety of ways, such as identifying specific code paths that signify bad behavior, investigating behavior patterns to correlate bad apps, and reviewing possible PHAs with our security experts.



In 2016, we added machine learning as a new detection mechanism and it soon became a critical part of our systems and tools.


Training our machines





In the most basic terms, machine learning means training a computer algorithm to recognize a behavior. To train the algorithm, we give it hundreds of thousands of examples of that behavior.



In the case of Google Play Protect, we are developing algorithms that learn which apps are "potentially harmful" and which are "safe." To learn about PHAs, the machine learning algorithms analyze our entire catalog of applications. Then our algorithms look at hundreds of signals combined with anonymized data to compare app behavior across the Android ecosystem to find PHAs. They look for behavior common to PHAs, such as apps that attempt to interact with other apps on the device, access or share your personal data, download something without your knowledge, connect to phishing websites, or bypass built-in security features.



When we find apps exhibit similar malicious behavior, we group them into families. Visualizing these PHA families helps us uncover apps that share similarities to known bad apps, but have yet remained under our radar.









After we identify a new PHA, we confirm our findings with expert security reviews. If the app in question is a PHA, Google Play Protect takes action on the app and then we feed information about that PHA back into our algorithms to help find more PHAs.


Doubling down on security





So far, our machine learning systems have successfully detected 60.3% of the malware identified by Google Play Protect in 2017.



In 2018, we're devoting a massive amount of computing power and talent to create, maintain and improve these machine learning algorithms. We're constantly leveraging artificial intelligence and our highly skilled researchers and engineers from all across Google to find new ways to keep Android devices safe and secure. In addition to our talented team, we work with the foremost security experts and researchers from around the world. These researchers contribute even more data and insights to keep Google Play Protect on the cutting edge of mobile security.



To check out Google Play Protect, open the Google Play app and tap Play Protect in the left panel.



Acknowledgements: This work was developed in joint collaboration with Google Play Protect, Safe Browsing and Play Abuse teams with contributions from Andrew Ahn, Hrishikesh Aradhye, Daniel Bali, Hongji Bao, Yajie Hu, Arthur Kaiser, Elena Kovakina, Salvador Mandujano, Melinda Miller, Rahul Mishra, Damien Octeau, Sebastian Porst, Chuangang Ren, Monirul Sharif, Sri Somanchi, Sai Deep Tetali, Zhikun Wang, and Mo Yu.