Link to github
Table: date | u1 | u2 | n_msg
n_msg: the number of messsages between one unique user pair at someday.
DDL Scripts
create table messages( date varchar(80), u1 varchar(80), u2 varchar(80), n_msg integer ); insert into messages values ('10/01/2018','user1','user2',4); insert into messages values ('10/01/2018','user2','user1',4); insert into messages values ('10/01/2018','user1','user4',2); insert into messages values ('10/02/2018','user1','user2',2); insert into messages values ('10/02/2018','user2','user1',1); insert into messages values ('10/02/2018','user3','user4',10); insert into messages values ('10/02/2018','user3','user2',14); insert into messages values ('10/02/2018','user1','user2',14); insert into messages values ('10/03/2018','user4','user1',4); insert into messages values ('10/03/2018','user4','user5',25);
QUESTIONS
What can we get some insights from this table? user activities, represent closeness. Write a query about the distribution of number of conversations among users on someday. Before we run any SQL, what is your gut sense that what the distribution will look like? Why?
1) GET ALL THE MESSAGES BY ALL USERS ON A PARTICULAR DAY
select * from messages where date='10/02/2018'; +------------+-------+-------+-------+ | date | u1 | u2 | n_msg | +------------+-------+-------+-------+ | 10/02/2018 | user1 | user2 | 2 | | 10/02/2018 | user2 | user1 | 1 | | 10/02/2018 | user3 | user4 | 10 | | 10/02/2018 | user3 | user2 | 14 | | 10/02/2018 | user1 | user2 | 14 | +------------+-------+-------+-------+
2) SELECT DISTINCT CONVERSATIONS BY ALL USERS ON A PARTICULAR DAY
select u1,count(distinct u2) conversations from messages where n_msg >0 and date ='10/02/2018' group by u1; select u1,count(distinct(u2)) as conversations from messages where date ='10/02/2018' group by u1; +-------+---------------+ | u1 | conversations | +-------+---------------+ | user1 | 1 | | user2 | 1 | | user3 | 2 | +-------+---------------+
) DISTRIBUTIONS OF CONVERSATIONS
select a.conversations,count(a.u1) freq from (select u1,count(distinct u2) conversations from messages where n_msg >0 and date ='10/02/2018' group by u1) a group by a.conversations order by conversations; +---------------+------+ | conversations | freq | +---------------+------+ | 1 | 2 | | 2 | 1 | +---------------+------+
4) SUM OF ALL THE MESSAGES SENT BY A USER TO OTHER USERS
select u1,u2, sum(n_msg) as sum_msg from messages group by u1,u2; +-------+-------+---------+ | u1 | u2 | sum_msg | +-------+-------+---------+ | user1 | user2 | 20 | | user2 | user1 | 5 | | user1 | user4 | 2 | | user3 | user4 | 10 | | user3 | user2 | 14 | | user4 | user1 | 4 | | user4 | user5 | 25 | +-------+-------+---------+
5) TOP PARTNERS WHO SENDS THE MOST NUMBER OF MESSAGES TO EACH OTHER.
select u1, max(sum_n_msg) sum_top from (select u1, u2, sum(n_msg) sum_n_msg from table1 group by u1, u2) tmp group by u1 +-------+-------------+ | u1 | top_partner | +-------+-------------+ | user1 | 20 | | user2 | 5 | | user3 | 14 | | user4 | 25 | +-------+-------------+
FROM THE ABOVE QUERY CALCULATE THE FOLLOWING QUERY: sum(n_msg_with_top_partners) / sum(n_msg_with_all_contacts)
select sum(tmp.sum_top)/(select sum(n_msg) from messages) as fraction from (select a.u1,max(a.sum_msg) as sum_top from (select u1,u2,sum(n_msg) as sum_msg from messages group by u1,u2 order by sum_msg desc) a group by a.u1) tmp; +----------+ | fraction | +----------+ | 0.8000 | +----------+
Thank you very much for this great post. 铭识协议
ReplyDelete