Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False positives in co-link network? #46

Open
havardl opened this issue May 6, 2022 · 3 comments
Open

False positives in co-link network? #46

havardl opened this issue May 6, 2022 · 3 comments

Comments

@havardl
Copy link

havardl commented May 6, 2022

I'm seeing a big difference between two outputted networks when I preprocess a .csv file with and without tweets which contains urls.

When I preprocess a .csv file which contains tweets without urls, I get more than 40 pairs of source/target combinations between profiles. But when I remove tweets without links in them, my network gets reduced to just a handful of profiles.

This makes me wonder if I am processing my data in a wrong way when I'm generating the .csv file? This is the current format of my csv file:

message_id,user_id,username,repost_id,reply_id,message,timestamp,urls
id,id,username,,id,"Tweet content.",1645488019,
id,id,username2,,id,Tweet content,1645488035,
id,id,username3,,id,Tweet content,1645488035,url
id,id,username4,,id,Tweet content,1645488035,url1 url2

Is this the correct way of doing it?

@havardl
Copy link
Author

havardl commented May 6, 2022

btw, this is how I preprocess the .csv file and build the colink network in python:

coord_net_tk.preprocess.preprocess_csv_files(db_name, [csv_filename_path])
coord_net_tk.compute_networks.compute_co_link_network(db_name, 10, min_edge_weight=5, resolved=False)
G = coord_net_tk.graph.load_networkx_graph(db_name, "co_link")
nx.write_graphml_lxml(G, "filename.graphml")

@SamHames
Copy link
Collaborator

SamHames commented May 8, 2022

When I preprocess a .csv file which contains tweets without urls, I get more than 40 pairs of source/target combinations between profiles. But when I remove tweets without links in them, my network gets reduced to just a handful of profiles.

If your input is only tweets with no urls, there shouldn't be anything in the output co-link network, so something has gone wrong somewhere.

From a quick glance your preprocessing/data looks reasonable to me, but I'll take a closer look later when I have more time.

Few questions:

  • how have you filtered out tweets without urls?
  • are you testing with a completely new db when you're doing the different trials?
  • are you running the latest version? It's possible I broke something when cleaning up for unable to resolve urls from csv #42

@havardl
Copy link
Author

havardl commented May 10, 2022

If your input is only tweets with no urls, there shouldn't be anything in the output co-link network, so something has gone wrong somewhere.

This is very helpful. I was wondering if the co-link network perhaps also looked at some other variables, but this makes me think I can just remove all the rows which have no urls before preprocessing the .csv file. That way I'm sure the network is only made up by link connections.

To your questions:

  • I build the .csv file with pandas and remove all rows were the url column is empty. The url column is based on the content of entities.urls.expanded_url for a given tweet from the twitter API
  • I noticed that I had to delete the .db for each iteration, so yes, I'm only testing on new databases
  • I'm running version 1.5.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants