源数据 源数据在私有 gitlab 对应的 019-20231121-RFM 模型目录下
源数据说明 数据表:action.csv
user_id:用户ID;
sku_id:商品ID;
time:行为(浏览、下单、加入购物车、删除、购买等行为的)时间;
model_id:模型ID;
type:行为ID;
cate:品类(服装、护肤品、生鲜等等分类);
brand:品牌
数据表:user.csv
user_id:用户ID;
age:用户年龄(-1 代表该用户没有设置年龄段,6 代表50~60 岁的样子);
sex:用户性别(0
user_lv_cd:用户等级;
browse_num:用户浏览记录次数;
addcart_num:用户加入购物车的数量;
delcart_num:用户删除购物车的数量;
buy_num:用户购买次数统计;
favor_num:用户收藏记录;
click_num:用户点击次数;
数据读取分析 源数据读取 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 import pandas as pd import numpy as np path = r'/data/gitlab/python3-data-analysis/019-20231121-RFM 模型/data/action.csv' data = pd.read_csv(path, chunksize=10000) chunks = [] for chunk in data: chunk = pd.DataFrame(chunk) chunks.append(chunk) action = pd.concat(chunks) action = action[['user_id' , 'sku_id' , 'type' , 'time' , 'cate' ]] print (action.head())'' ' user_id sku_id type time cate 0 287842.0 75018 6 2016-03-31 23:59:01 9 1 208266.0 31662 1 2016-03-31 23:59:04 8 2 209390.0 118799 6 2016-03-31 23:59:05 8 3 237311.0 5825 1 2016-03-31 23:59:05 8 4 257651.0 128104 1 2016-03-31 23:59:05 4 ' '' print (action.info())'' ' <class ' pandas.core.frame.DataFrame'> RangeIndex: 13199934 entries, 0 to 13199933 Data columns (total 5 columns): # Column Dtype --- ------ ----- 0 user_id float64 1 sku_id int64 2 type int64 3 time object 4 cate int64 dtypes: float64(1), int64(3), object(1) memory usage: 503.5+ MB None ' ''
数据清洗 空值 1 2 3 4 5 6 7 8 9 print (action.isnull().sum ())'' ' user_id 0 sku_id 0 type 0 time 0 cate 0 dtype: int64 ' ''
重复值 1 2 3 4 print (action.duplicated().sum ())''' 7262534 '''
异常值 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 print (action.describe())''' user_id sku_id type cate count 1.319993e+07 1.319993e+07 1.319993e+07 1.319993e+07 mean 2.509635e+05 8.344591e+04 4.146564e+00 6.571679e+00 std 3.009638e+04 4.931126e+04 2.395629e+00 1.785305e+00 min 2.000010e+05 2.000000e+00 1.000000e+00 4.000000e+00 25% 2.251310e+05 4.033600e+04 1.000000e+00 5.000000e+00 50% 2.493220e+05 7.908000e+04 6.000000e+00 7.000000e+00 75% 2.770750e+05 1.269060e+05 6.000000e+00 8.000000e+00 max 3.053200e+05 1.712250e+05 6.000000e+00 1.100000e+01 ''' print (action.describe(include='all' ))''' user_id sku_id ... time cate count 1.319993e+07 1.319993e+07 ... 13199934 1.319993e+07 unique NaN NaN ... 1078080 NaN top NaN NaN ... 2016-04-14 00:00:01 NaN freq NaN NaN ... 347 NaN mean 2.509635e+05 8.344591e+04 ... NaN 6.571679e+00 std 3.009638e+04 4.931126e+04 ... NaN 1.785305e+00 min 2.000010e+05 2.000000e+00 ... NaN 4.000000e+00 25% 2.251310e+05 4.033600e+04 ... NaN 5.000000e+00 50% 2.493220e+05 7.908000e+04 ... NaN 7.000000e+00 75% 2.770750e+05 1.269060e+05 ... NaN 8.000000e+00 max 3.053200e+05 1.712250e+05 ... NaN 1.100000e+01 [11 rows x 5 columns] '''
数据类新转换 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 action['user_id' ] = action['user_id' ].astype('int64' ) action['time' ] = pd.to_datetime(action['time' ]) print (action.info())''' <class 'pandas.core.frame.DataFrame'> RangeIndex: 13199934 entries, 0 to 13199933 Data columns (total 5 columns): # Column Dtype --- ------ ----- 0 user_id int64 1 sku_id int64 2 type int64 3 time datetime64[ns] 4 cate int64 dtypes: datetime64[ns](1), int64(4) memory usage: 503.5 MB None '''
获取高潜力用户 规则:
必须有购买行为;
对一个商品购买,并且有其它交互行为(如浏览、点击、收藏等);
对一个商品购买和其它交互行为(浏览、点击、收藏等)时间差大于1天;
最后的购买时间;
最早与该商品的交互时间;
首先,查看 action 表中有哪几种行为:
1 2 3 4 print (action['type' ].unique())''' [6 1 4 2 3 5] '''
其次,统计每一种行为的次数:
1 2 3 4 5 6 7 8 9 10 11 print (action['type' ].value_counts())''' type 6 8219746 1 4715843 2 154935 3 72294 5 24837 4 12279 Name: count, dtype: int64 '''
最后,根据电商平台的漏斗法则,我们可以大致的推算 4 为用户的购买行为。因此,我们可以的计算出有购买行为的数据;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 action_type4 = action[action['type' ]==4 ] print (action_type4.head())''' user_id sku_id type time cate 94 250576 162108 4 2016-04-01 00:00:00 4 1019 264694 5825 4 2016-04-01 00:01:00 8 1027 266911 5825 4 2016-04-01 00:01:00 8 1029 286589 154504 4 2016-04-01 00:01:00 5 1030 244299 87057 4 2016-04-01 00:01:00 7 ''' print (action_type4['cate' ].unique()) ''' [ 4 8 5 7 9 6 10 11] ''' print (action_type4) ''' user_id sku_id type time cate 94 250576 162108 4 2016-04-01 00:00:00 4 1019 264694 5825 4 2016-04-01 00:01:00 8 1027 266911 5825 4 2016-04-01 00:01:00 8 1029 286589 154504 4 2016-04-01 00:01:00 5 1030 244299 87057 4 2016-04-01 00:01:00 7 ... ... ... ... ... ...13195275 242683 161265 4 2016-04-15 23:53:00 8 13195919 215804 157750 4 2016-04-15 23:54:00 5 13197284 275840 18412 4 2016-04-15 23:56:00 8 13198572 291147 70491 4 2016-04-15 23:58:00 6 13199309 254642 25003 4 2016-04-15 23:59:00 5 [12279 rows x 5 columns] '''
将最后的购买时间合并到表 action 中
1 2 3 4 5 6 7 8 9 10 ac_all_buy = pd.merge(ac_lastbuytime, action,left_on='user_id' , right_on='user_id' ) print (ac_all_buy.head())''' user_id time_x sku_id type time_y cate 0 200014 2016-04-05 10:47:00 88414 3 2016-04-03 09:08:02 4 1 200014 2016-04-05 10:47:00 72967 1 2016-04-03 09:08:23 4 2 200014 2016-04-05 10:47:00 72967 6 2016-04-03 09:08:24 4 3 200014 2016-04-05 10:47:00 72967 6 2016-04-03 09:09:19 4 4 200014 2016-04-05 10:47:00 72967 1 2016-04-03 09:09:34 4 '''
将最早的行为时间进行分组
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ac_firsttime = ac_all_buy.groupby(by='user_id' )['time_y' ].apply(lambda x:x.min ()) print (ac_firsttime)''' user_id 200014 2016-04-03 09:08:02 200033 2016-04-01 18:05:49 200034 2016-04-14 08:12:03 200049 2016-04-01 18:35:43 200053 2016-04-01 18:16:48 ... 305295 2016-04-11 20:13:02 305296 2016-04-01 11:40:20 305308 2016-04-08 17:37:14 305313 2016-04-05 21:35:15 305318 2016-04-01 00:00:36 Name: time_y, Length: 9356, dtype: datetime64[ns] '''
将最早的行为时间和最晚的行为时间进行合并:
1 2 3 4 5 6 7 8 9 10 11 12 df = pd.merge(ac_lastbuytime, ac_firsttime.to_frame(), on='user_id' ) df.columns = ['buytime' , 'ac_time' ] print (df.head())''' buytime ac_time user_id 200014 2016-04-05 10:47:00 2016-04-03 09:08:02 200033 2016-04-06 01:28:00 2016-04-01 18:05:49 200034 2016-04-14 08:18:00 2016-04-14 08:12:03 200049 2016-04-10 10:14:00 2016-04-01 18:35:43 200053 2016-04-13 20:16:00 2016-04-01 18:16:48 '''
求最晚行为时间和最早行为时间的时间差:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 df['days' ] = (pd.to_datetime(df['buytime' ]) - pd.to_datetime(df['ac_time' ])).dt.days print (df)''' buytime ac_time days user_id 200014 2016-04-05 10:47:00 2016-04-03 09:08:02 2 200033 2016-04-06 01:28:00 2016-04-01 18:05:49 4 200034 2016-04-14 08:18:00 2016-04-14 08:12:03 0 200049 2016-04-10 10:14:00 2016-04-01 18:35:43 8 200053 2016-04-13 20:16:00 2016-04-01 18:16:48 12 ... ... ... ...305295 2016-04-13 19:19:00 2016-04-11 20:13:02 1 305296 2016-04-01 11:41:00 2016-04-01 11:40:20 0 305308 2016-04-08 18:40:00 2016-04-08 17:37:14 0 305313 2016-04-08 17:08:00 2016-04-05 21:35:15 2 305318 2016-04-01 00:05:00 2016-04-01 00:00:36 0 [9356 rows x 3 columns] '''
获取高潜用户(时间大为 1 的)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 高潜力用户 hight_pot = df[df['days' ] >1 ] print (hight_pot)''' buytime ac_time days user_id 200014 2016-04-05 10:47:00 2016-04-03 09:08:02 2 200033 2016-04-06 01:28:00 2016-04-01 18:05:49 4 200049 2016-04-10 10:14:00 2016-04-01 18:35:43 8 200053 2016-04-13 20:16:00 2016-04-01 18:16:48 12 200054 2016-04-14 08:22:00 2016-04-07 13:25:56 6 ... ... ... ...305193 2016-04-08 08:05:00 2016-04-01 06:17:20 7 305206 2016-04-14 22:55:00 2016-04-01 18:33:50 13 305254 2016-04-14 22:37:00 2016-04-02 15:36:15 12 305291 2016-04-11 12:29:00 2016-04-03 20:38:10 7 305313 2016-04-08 17:08:00 2016-04-05 21:35:15 2 [6162 rows x 3 columns] '''