Spam, scams, phishing attacks, or fraudulent activities are prevalent problems for social app users, and can be a serious threat both financially and in regards to private or sensitive information. Spam also has detrimental effects on the user experience, potentially causing the customer to lose trust, resulting in a decline in user base and downfall of the business.
Anti-spam is a tug-of-war. Spammers continuously seek out new ways to hack your technology and become more sophisticated at circumventing anti-spam strategies. Businesses are relying on techniques such as data mining (text mining) and machine learning to advance the anti-spam mechanism. Hyphenate recognizes the need to proactively address spamming, prevent the potential breach of data, and help our customers who are deeply trenched by the aforementioned issues.
- Advertising: Merchandise, promotional content, websites, shops, drugs, cosmetics, etc.
- Phishing: Fish for personal information by publishing false and misleading information.
- Adult content: Pornographic transactions, redirecting to pornographic sites, or adult products.
- Click bait for SEO: increase search rankings by incentivizing the user to click the link.
- Political threats, terrorism, or hatred acts.
- Financial fraud: loans, insurance, etc.
- Product loopholes and a lack of preventative / defensive mechanisms
- Penetrating a legitimate user account’s contact list
- Spammers maintain massive amounts of fake user accounts and pretend to be legitimate users, switching between different accounts to spread low volume spam per account
- Spammers use bots to create massive amounts of fake user accounts and then spread the spam rapidly before anti-spam filters recognize, report, and block the account
Here are some approaches employed by Hyphenate to mitigate the impact of spam. We can also apply a hybrid solution with mixed and matched techniques to enhance the effectiveness:
The basis of this technology is a tree algorithm for keyword queries with a series of preprocessing filters for textual cues that match the following criteria:
- Basic lexicon and synonyms, ex. pornography, profanity, etc.
- Self-defined thesaurus
- Special characters
- Textspeak: abbreviations, acronyms, initials, emoji, etc
- Keyword shielding by replacing character with wildcards or asterisk . Ex. f*k.
For simple junk and spam, we can simply perform the sweep by keyword filtering technique. Keyword filtering is an effective technique to mitigate the impact of unexpected outbursts. However, spammers could alter the form of information and utilize more advanced techniques to avoid interception.
Behavioral pattern analysis is used to determine the spam by its characteristics and patterns. For example, a clustering algorithm is used to to identify the user behavior, including the identification by behavioral factors (such as sender, receiver, time, content type, etc.). The behavior relationship refers to the user's social attributes, message transmission frequency, time interval, message response rate, indicator of relationship and messaging interaction pattern between senders and receivers. Spam tends to have a high ratio of one way communication and a lack of interaction, since receivers recognize the spam prior to further engagement.
Spammers deliver a large quantity of repetitive messages within a relatively short time period, whether acting as official personnel, sending links, advertising, and marketing related content, or pornographic material. For example, a legitimate chat user can only send a limited amount of messages within a timeframe. There are physical limitations preventing users from typing a very short word, such as "hi", consecutively.
Spammers might maintain a vast library of 'sockpuppets' to disguise their true identity, or generate a significant amount of user accounts using bots. Multi-user or multi-account spammers switch between different fake user accounts to deliver low volume spam but repeated messages. They also can launch a small scale DDoS attack; this will require message analysis in global scale in order to detect the misbehavior.
Sentiment analysis is based on Natural Language Processing (NLP) or textual content analysis technologies. The models improve in precision over time, with more training data and content sampling with smarter spam pattern recognition. Here unsupervised clustering analysis is emphasized to sort out clusters of features and apply labeling.
Here are some common factors to help distinguish spam from healthy messages. They are subtle but have distinct characteristics, sentimental patterns, and feature correlations:
- Spam ratio - potential spam words concentration in health message
- Stopwords ratio - stop words usually refer to the most common words in a language which are filtered out before or after the processing of natural language. I.e., a, the, are, is, at, which, in, on, etc.
Stopwords ratio = # of stopwords in the message / total # of words in the message
- Sentiment library
- Noun ratio - spam tends to have a higher concentration of nouns due to packing dense information into one message
- Adjective ratio - similar to noun ratio pattern
- Number of sentences - spam tends to have longer sentence
- Spam similarity - high concentration of hyperlink, similar lexicon, etc.
Due to the difficulty of filtering spam that is injected into irrelevant content, it is typically treated with a combination of clustering analysis, using keywords detection, semantic analysis and behavioral recognition methodology. An example of this type of spam entails inserting contact information or links into some ordinary content.
Real time message processing is not only an essential feature for real time instant messaging, but it also eliminates or mitigates the impact of the effect of spam, by intercepting the outbreak of spam, doing so by imposing a limit on the message sending frequency. Achieving real time anti spam is not an easy task, it requires the processing of millions of messages instantly and simultaneously. Not to fear however, as Hyphenate has the expertise to help you to cope with such impromptu events.
In addition to taking a technical approach, app developers can tighten the grip on user account registration and management in order to raise the bar for spammer and mitigate the impact of abusive account or scalping activities. This can be accomplished by utilizing some of the following techniques:
- Enhance user identity verification: Enforce user identity verification, such as phone number verification, which is more effective than email; Or, you can also combine multiple verification methods, such Facebook login, etc.
- Two-factor account authentication during login: Prevent account takeover due to weak passwords, stolen credentials, or bot-driven attacks.
- Impose verification codes during registration
Reporting and flagging mechanism: Leverage the community to report inappropriate content and users.
- Effective account management: Unlike email, it tends to have high a false positive rate, but IM has significantly better control on user accounts to prevent account abuse by deactivating or banning spammer’s account.
- Verify account registration: Detect the creation of multiple accounts from a single device, geo-location, or IP addresses in a relatively short time period.
Hyphenate is able to reach 90+ percent of the spam messages recognized in accuracy, the high recall rate (or true positive rate), and the low false positive rate with sufficient training data. Hyphenate’s customers have applied our anti-spam services in a multitude of scenarios. For instance, a well known female focused social app is able to weed out of over 40% of junk messages based on Hyphenate’s behavior recognition algorithm. We enable you to filter out illegal advertising and fraudulent job posting among the junk messages effectively; the proportion is as great as 9:1. A satisfied customer's testimonial, "the anti-spam service by Hyphenate has significantly improved the number of active users, dramatically mitigated the worrisome of great varieties of pornographic, profanity, or illegal content lingering in the app."
Hyphenate processes the user’s anonymous Hyphenate ID to ensure the greatest user privacy. Anonymous messages are feed into a machine learning model for semantic analysis in order to recognize the spam content and make the model more intelligent over time.
Hyphenate also provides other complementary services, such as:
URL detection - Detect malicious URLs, phishing sites, malicious code, cross-site scripting (XSS) attack, and other potentially malicious attacks by comparing it with the url database.
Multimedia recognition - Use our intelligent recognition system to process images, audios, or videos that encapsulate the spam content. To optimize the computation efficiency and process, it’ll first be determined by user behavior.
Please contact Hyphenate, for our anti-spam service so we can have better understanding of your requirements and tailor the solution to suit your need. Beyond providing a standard out of the box anti-spam suite, we will also work closely with you to understand the requirements and dataset to train the model to further optimize the spam message handling process. Hyphenate continues to enhance the precision and **recall rate of the anti-spam algorithm, as well as refining the model to make your app community safe.
**Recall rate (true positive rate)
Recall rate is one of the important indicators of machine learning. For example, there are 1000 messages in the app, 300 of them are spam. The system is able to identify 240 spam messages based on the algorithm verified by human, then the algorithm recall rate is 240/300=80% and the algorithm accuracy is 100%. The missed ones are considered a learning cost.