paint-brush
Detecting fake viral stories before they become viral using FB APIby@baditaflorin
3,575 reads
3,575 reads

Detecting fake viral stories before they become viral using FB API

by Florin BaditaNovember 24th, 2016
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

The Outbrake is a service that i want to developed for Journalists, allowing them to find viral articles before they go viral. Every hour, we look at what news articles post the top 1000 most popular news websites in the US and download them into a database, searching for the ones that are getting viral. Using NLP, we analyze the comments that the facebook users are posting to the viral comments, to get a better understanding what diferentiate the viral posts compared with the average posts.

People Mentioned

Mention Thumbnail

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Detecting fake viral stories before they become viral using FB API
Florin Badita HackerNoon profile picture

This is the first part of a X part article.

Part 2


The Outbreak — Detecting fake Viral News, automatically._Two weeks ago i published this post on medium about how we can detect fake viral news, using the Outbreak, a tool…_hackernoon.com

part 3


The Outbreak — How to detect the real viral posts compared to the one hour share spike._To get more information's about what is the Outbreak tool, read this Medium post:_medium.com

part 4


Understanding Facebook Reactions using Google Sentiment Analysis_2019update: New article out of the owen:_medium.com

First of all a question. How do you find out about a fake viral article ?

Does somebody tell you abou_t it ? Are you using a special software ? Do you use facebook trends and twitter trends to find out what is viral ?_


To find out how journalists find out in the present, i made this google form**.**I encourage you to complete the form so we can get a better understanding of how journalists and bloggers find out about viral news articles.

What we need to build to acomplish this :

The Outbrake is a service that i want to developed for Journalists,allowing them to find viral articles before they go viral.

Idea

By automatic crawling the FB pages that usually share fake or misleading articles, journalists can see the “next lies” before they are already so popular so even if you will explain that something is not true, more people had heared the fake version.

Every article that now have 600K Shares, in the first hour had 5K shares, in the second 11K, the third hour 19K

Using this, we can detect this articles way before they spread from one vertical into another and they become viral.

The highlevel details

  1. Every hour, we look at what news articles post the top 1000 most popular news websites in the US and download them into a database, searching for the ones that are getting viral.
  2. Every hour we download data from facebook to see how many shares, likes and comments has each article, and detect the ones that are rising up the fastest, the ones that are getting viral.
  3. Using NLP, we analyze the comments that the facebook users are posting to the viral comments, to get a better understanding what diferentiate the viral posts compared with the average posts.

Low level details



Using a custom python web parser we crawl, every hour, a list of the top 1000 news websites in the US and add them in a database. The second time we parse the same link, we add it to the database and we calculate the difference in the number of likes, emotions and shares using the facebook API. We monitor each story for 3 days before we stop indexing that particular link. If the story resurface in our databse later, we will monitor again for 3 days before stopping again.

Only for around 5% of the posts, the posts that we see that are becoming viral, we download the comments so that we can analyze them and see what people are talking about.

The rationale is that we can automatize the process of learning what is this article about and the validity based on the comments the users post to a article.

Facebook API payload :

# Getting the articles from the FB pages.



1000 FB Pages * 24 hours = 24.000 request per Day.  If the average size of the request is 1MB, per DAY need to download 24 GB from Facebook Per Month, this means 720GB

# Getting the number of comments,likes,shares from the FB API



1000 news websites * 50 articles per day * 5 days (avg time a article will be crawled) = 250.000 request per hour.  Per day this will mean 6.000.000 requests If the average size of the request is 100k, per HOUR i need to download 25 GB from Facebook. Per day 600GB. Per month, minimum bandwidth need of 18TB

# Getting the text comments for the Viral Articles.

Around 1-5% of all articles.





Per Day we will download around 10.000 Viral News/Top News articles Per month this mean 300.000 requests * 100 comment nested pages = 30.000.000 requests If the average size of the request is 1MB, per DAY we need to download 10GB from Facebook Per month, minimum bandwidth needed is 300GB This will be data that we will keep almost entirely, so it`s costly on the server side.

Webcrawler API payload :



Payload per day of 1000 news websites * 50 articles per day = 50.000 request per DAY. If the average size of the request is 1MB, this means 50GB per day Per month, this means 1.5TB

Per month, we need to send over 200M API requests to facebook, downloading over 20TB of data

Next steps :

I will officialy kickstart this project at the second @Debug Politics Hackaton on the 9th to 11 of December in San Francisco.

But i`m working already on it.What i need is somebody with experience in database design, to create the back office architecture. Also, somebody that have experience in Design.

I need financing for pushing this project foward, for the server and other costs asociated with the project. If you want to be part in this project, either as a sponsor or as a developer, email me at [email protected]

About Me

I collaborate with Rise Project, were i do data analysis and pattern recognition to uncover patterns of data in unstructured datasets.

You can find me online on Medium Florin Badita, AngelList, Twitter , Linkedin, Openstreetmap, Github, Quora, Facebook

Sometimes i write on my blog http://florinbadita.com/