源数据

源数据在私有 gitlab 对应的 019-20231121-RFM 模型目录下

源数据说明

数据表:action.csv

  • user_id:用户ID;
  • sku_id:商品ID;
  • time:行为(浏览、下单、加入购物车、删除、购买等行为的)时间;
  • model_id:模型ID;
  • type:行为ID;
  • cate:品类(服装、护肤品、生鲜等等分类);
  • brand:品牌

数据表:user.csv

  • user_id:用户ID;
  • age:用户年龄(-1 代表该用户没有设置年龄段,6 代表50~60 岁的样子);
  • sex:用户性别(0
  • user_lv_cd:用户等级;
  • browse_num:用户浏览记录次数;
  • addcart_num:用户加入购物车的数量;
  • delcart_num:用户删除购物车的数量;
  • buy_num:用户购买次数统计;
  • favor_num:用户收藏记录;
  • click_num:用户点击次数;

数据读取分析

源数据读取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import pandas as pd
import numpy as np

path = r'/data/gitlab/python3-data-analysis/019-20231121-RFM 模型/data/action.csv'
data = pd.read_csv(path, chunksize=10000)

chunks = []
for chunk in data:
chunk = pd.DataFrame(chunk)
chunks.append(chunk)

action = pd.concat(chunks) # 数据拼接
action = action[['user_id', 'sku_id', 'type', 'time', 'cate']] # 提取所需的数据(删除了不需要的列)
print(action.head())
'''
user_id sku_id type time cate
0 287842.0 75018 6 2016-03-31 23:59:01 9
1 208266.0 31662 1 2016-03-31 23:59:04 8
2 209390.0 118799 6 2016-03-31 23:59:05 8
3 237311.0 5825 1 2016-03-31 23:59:05 8
4 257651.0 128104 1 2016-03-31 23:59:05 4
'''

print(action.info())
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13199934 entries, 0 to 13199933
Data columns (total 5 columns):
# Column Dtype
--- ------ -----
0 user_id float64
1 sku_id int64
2 type int64
3 time object
4 cate int64
dtypes: float64(1), int64(3), object(1)
memory usage: 503.5+ MB
None
'''

数据清洗

空值

1
2
3
4
5
6
7
8
9
print(action.isnull().sum())
'''
user_id 0
sku_id 0
type 0
time 0
cate 0
dtype: int64
'''

重复值

1
2
3
4
print(action.duplicated().sum())
'''
7262534
'''

异常值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
print(action.describe())
'''
user_id sku_id type cate
count 1.319993e+07 1.319993e+07 1.319993e+07 1.319993e+07
mean 2.509635e+05 8.344591e+04 4.146564e+00 6.571679e+00
std 3.009638e+04 4.931126e+04 2.395629e+00 1.785305e+00
min 2.000010e+05 2.000000e+00 1.000000e+00 4.000000e+00
25% 2.251310e+05 4.033600e+04 1.000000e+00 5.000000e+00
50% 2.493220e+05 7.908000e+04 6.000000e+00 7.000000e+00
75% 2.770750e+05 1.269060e+05 6.000000e+00 8.000000e+00
max 3.053200e+05 1.712250e+05 6.000000e+00 1.100000e+01
'''

print(action.describe(include='all'))
'''
user_id sku_id ... time cate
count 1.319993e+07 1.319993e+07 ... 13199934 1.319993e+07
unique NaN NaN ... 1078080 NaN
top NaN NaN ... 2016-04-14 00:00:01 NaN
freq NaN NaN ... 347 NaN
mean 2.509635e+05 8.344591e+04 ... NaN 6.571679e+00
std 3.009638e+04 4.931126e+04 ... NaN 1.785305e+00
min 2.000010e+05 2.000000e+00 ... NaN 4.000000e+00
25% 2.251310e+05 4.033600e+04 ... NaN 5.000000e+00
50% 2.493220e+05 7.908000e+04 ... NaN 7.000000e+00
75% 2.770750e+05 1.269060e+05 ... NaN 8.000000e+00
max 3.053200e+05 1.712250e+05 ... NaN 1.100000e+01

[11 rows x 5 columns]
'''

数据类新转换

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
action['user_id'] = action['user_id'].astype('int64')
action['time'] = pd.to_datetime(action['time'])
print(action.info())
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13199934 entries, 0 to 13199933
Data columns (total 5 columns):
# Column Dtype
--- ------ -----
0 user_id int64
1 sku_id int64
2 type int64
3 time datetime64[ns]
4 cate int64
dtypes: datetime64[ns](1), int64(4)
memory usage: 503.5 MB
None
'''

获取高潜力用户

规则:

  • 必须有购买行为;
  • 对一个商品购买,并且有其它交互行为(如浏览、点击、收藏等);
  • 对一个商品购买和其它交互行为(浏览、点击、收藏等)时间差大于1天;
  • 最后的购买时间;
  • 最早与该商品的交互时间;

首先,查看 action 表中有哪几种行为:

1
2
3
4
print(action['type'].unique())
'''
[6 1 4 2 3 5]
'''

其次,统计每一种行为的次数:

1
2
3
4
5
6
7
8
9
10
11
print(action['type'].value_counts())
'''
type
6 8219746
1 4715843
2 154935
3 72294
5 24837
4 12279
Name: count, dtype: int64
'''

最后,根据电商平台的漏斗法则,我们可以大致的推算 4 为用户的购买行为。因此,我们可以的计算出有购买行为的数据;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
action_type4 = action[action['type']==4]
print(action_type4.head())
'''
user_id sku_id type time cate
94 250576 162108 4 2016-04-01 00:00:00 4
1019 264694 5825 4 2016-04-01 00:01:00 8
1027 266911 5825 4 2016-04-01 00:01:00 8
1029 286589 154504 4 2016-04-01 00:01:00 5
1030 244299 87057 4 2016-04-01 00:01:00 7
'''
print(action_type4['cate'].unique()) # 查看品类
'''
[ 4 8 5 7 9 6 10 11]
'''
print(action_type4) # 所有有购买行为的品类数据
'''
user_id sku_id type time cate
94 250576 162108 4 2016-04-01 00:00:00 4
1019 264694 5825 4 2016-04-01 00:01:00 8
1027 266911 5825 4 2016-04-01 00:01:00 8
1029 286589 154504 4 2016-04-01 00:01:00 5
1030 244299 87057 4 2016-04-01 00:01:00 7
... ... ... ... ... ...
13195275 242683 161265 4 2016-04-15 23:53:00 8
13195919 215804 157750 4 2016-04-15 23:54:00 5
13197284 275840 18412 4 2016-04-15 23:56:00 8
13198572 291147 70491 4 2016-04-15 23:58:00 6
13199309 254642 25003 4 2016-04-15 23:59:00 5

[12279 rows x 5 columns]
'''

将最后的购买时间合并到表 action 中

1
2
3
4
5
6
7
8
9
10
ac_all_buy = pd.merge(ac_lastbuytime, action,left_on='user_id', right_on='user_id')
print(ac_all_buy.head())
'''
user_id time_x sku_id type time_y cate
0 200014 2016-04-05 10:47:00 88414 3 2016-04-03 09:08:02 4
1 200014 2016-04-05 10:47:00 72967 1 2016-04-03 09:08:23 4
2 200014 2016-04-05 10:47:00 72967 6 2016-04-03 09:08:24 4
3 200014 2016-04-05 10:47:00 72967 6 2016-04-03 09:09:19 4
4 200014 2016-04-05 10:47:00 72967 1 2016-04-03 09:09:34 4
'''

将最早的行为时间进行分组

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ac_firsttime = ac_all_buy.groupby(by='user_id')['time_y'].apply(lambda x:x.min())
print(ac_firsttime)
'''
user_id
200014 2016-04-03 09:08:02
200033 2016-04-01 18:05:49
200034 2016-04-14 08:12:03
200049 2016-04-01 18:35:43
200053 2016-04-01 18:16:48
...
305295 2016-04-11 20:13:02
305296 2016-04-01 11:40:20
305308 2016-04-08 17:37:14
305313 2016-04-05 21:35:15
305318 2016-04-01 00:00:36
Name: time_y, Length: 9356, dtype: datetime64[ns]
'''

将最早的行为时间和最晚的行为时间进行合并:

1
2
3
4
5
6
7
8
9
10
11
12
df = pd.merge(ac_lastbuytime, ac_firsttime.to_frame(), on='user_id')
df.columns = ['buytime', 'ac_time']
print(df.head())
'''
buytime ac_time
user_id
200014 2016-04-05 10:47:00 2016-04-03 09:08:02
200033 2016-04-06 01:28:00 2016-04-01 18:05:49
200034 2016-04-14 08:18:00 2016-04-14 08:12:03
200049 2016-04-10 10:14:00 2016-04-01 18:35:43
200053 2016-04-13 20:16:00 2016-04-01 18:16:48
'''

求最晚行为时间和最早行为时间的时间差:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
df['days'] = (pd.to_datetime(df['buytime']) - pd.to_datetime(df['ac_time'])).dt.days
print(df)
'''
buytime ac_time days
user_id
200014 2016-04-05 10:47:00 2016-04-03 09:08:02 2
200033 2016-04-06 01:28:00 2016-04-01 18:05:49 4
200034 2016-04-14 08:18:00 2016-04-14 08:12:03 0
200049 2016-04-10 10:14:00 2016-04-01 18:35:43 8
200053 2016-04-13 20:16:00 2016-04-01 18:16:48 12
... ... ... ...
305295 2016-04-13 19:19:00 2016-04-11 20:13:02 1
305296 2016-04-01 11:41:00 2016-04-01 11:40:20 0
305308 2016-04-08 18:40:00 2016-04-08 17:37:14 0
305313 2016-04-08 17:08:00 2016-04-05 21:35:15 2
305318 2016-04-01 00:05:00 2016-04-01 00:00:36 0

[9356 rows x 3 columns]
'''

获取高潜用户(时间大为 1 的)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
高潜力用户
hight_pot = df[df['days'] >1]
print(hight_pot)
'''
buytime ac_time days
user_id
200014 2016-04-05 10:47:00 2016-04-03 09:08:02 2
200033 2016-04-06 01:28:00 2016-04-01 18:05:49 4
200049 2016-04-10 10:14:00 2016-04-01 18:35:43 8
200053 2016-04-13 20:16:00 2016-04-01 18:16:48 12
200054 2016-04-14 08:22:00 2016-04-07 13:25:56 6
... ... ... ...
305193 2016-04-08 08:05:00 2016-04-01 06:17:20 7
305206 2016-04-14 22:55:00 2016-04-01 18:33:50 13
305254 2016-04-14 22:37:00 2016-04-02 15:36:15 12
305291 2016-04-11 12:29:00 2016-04-03 20:38:10 7
305313 2016-04-08 17:08:00 2016-04-05 21:35:15 2

[6162 rows x 3 columns]
'''